Next: Results Up: Implementation of a Relational Previous: Communication with the Database

Loading Gt2 File Formatted Data

Now that the database had been designed, a tool needed to be written capable of parsing files in the gt2 format and loading this ground-truthed information into the database and appropriate tables. For this project the language Python⁴ was selected for its simple syntax, elegant and clean semantics, powerful methods of conveying ideas, exception handling, dynamic typing and binding, orthogonal structure, stability, portability, and extensibility and flexibility. The source to this program can be found in Appendix B (Page

The program begins with the get_files() function, checking to see whether the program is a module being called from another program or from the command line. If it was called from the command line the system arguments are checked and input and output files are stored. The process_file() function is called, which stores the output of get_known_targets() into the variable known_targets. get_known_targets() works by connecting to the database, storing the query (querying for class, type, variant, and specific from the table target), and returning the output from the query. Once this list has been returned ``gt2import.py'' steps through the current input file line by line (thereby allowing files of extreme length to be processed).

Upon encountering the first line the bytes per pixel value is found by calling get_bytes_pixel(), which opens the file again and searches for the first reference, then parsing it and extracting its bytes per pixel value. Then the sequence location is stored and load_sequence_id(bytes_pixel) is called, which connects to the database, checks to make sure this sequence has not already been entered into the database, and inserts the needed information into the sequence_location table (Appendix C.2) if this is a new sequence location. Control is handed back to process_file() and the time at which this sequence was recorded is stored.

The second line of the gt2 format contains the number of targets in the file, so this value is stored as number_targets for later use. Starting on line three the number of references for each target is stored, so number_targets lines are read and parsed for the number of references for each target. These values are then stored in a dictionary for use in parsing the main sections of this file.

In lines four through four plus number_targets target specific information is stored. ``gt2import.py'' iterates through number_targets, reading each line and storing the values contained on that line in a list which is then transformed into a dictionary. A consistency check is also done (through check_consistency()) to help ensure the file format is valid. In the gt2 format, priority levels are stored as words (``PRIMARY,'' ``SECONDARY,'' ...) but the designed database uses numerals to sort priority, so this value is converted. The class, type, variant, and specific values are then stored as a string into the list target_ids, to enable checking. The current target code is checked against the list to which was just added and if there is no match ``gt2import.py'' proceeds and queries the database to see if this target is entered or not. If not, all needed target information is stored into the database through an INSERT query, if the target is already present, this process is skipped.

We have now entered the ``main'' section of the gt2 file, where reference data is stored. ``gt2import.py'' iterates through list_targets, checking to see whether it should append or overwrite each time. The temporary output file is opened, however many lines were indicated earlier are read, checked for consistency, and stored in a dictionary. The output is then formatted as a tab delimited file and written to the output file. After this if EOF is encountered ``gt2import.py'' breaks, else it continues to the next line. Once the current target's references have been read and written, the output file is closed, current directory is stored, and the load_frame_rows() call is made, loading the output data into the database (the LOAD command is significantly faster than inserting rows individually). Once the data is loaded the temporary output file is erased and the next target is processed.

Next: Results Up: Implementation of a Relational Previous: Communication with the Database

Chris Frost
1999-08-07