Parsing Shoebox Data in Python

Source Code

The code base is in a very early stage of development (beta). All the usual disclaimers apply. Use at your own risk.

Download: [ shoebox.tar.gz ]

Background

Shoebox data files are marked up in Standard Format, a loosely defined data format which is basically obsolescent (thanks to XML). As is the case with XML, it is useful to distinguish between two kinds of valid Shoebox data files: well-formed and valid.

Well-formedness is therefore a necessary (but not sufficient) condition for validity.

Standard Format

Without validating information from a Shoebox metadata file, the following assumptions will be made about standard format:

  1. Field Marker:
    1. occurs at the beginning of a line
    2. starts with a single backslash
    3. first character after backslash is alphabetic
    4. cannot contain whitespace
    5. second and subsequent characters can be alphanumeric, an underscore (_), and/or dash (-)
    6. is separated from associated field data by a single space
  2. Field Data:
    1. unrestricted text
    2. can include carriage returns
    3. cannot contain a line-inital backslash

Some examples of well-formed and ill-formed Shoebox data are provided below:

Well-Formed
\ref orang
\ps N
\ge person
 
\ref orang
\ps N
\nt
\ge person
 
\ref orang
\ps N
\ge person
\ge people
Ill-Formed
\ref orang \ps N \ge person
 
\1 orang
\2 N
\3
\4 person
 
\ref orang \
\ge person
\ge people

There are some special field markers reserved for use by Shoebox which violate the rule concerning the initial-character of the field marker--for example, \+mkrset and \-mkrset. So far, it seems to be only + and -. The rule for the first character might therefore be relaxed.

Parsing

There are 3 ways of parsing a Shoebox file into entries and their associated fields:

Available Information Description Can be Parsed? Can be Validated?
No Metadata If no metadata is available, it is assumed that the first field of the first entry encountered is the head field. Yes No
Head Field Known If the head field marker is known, it is therefore possible to properly handle multiline fields. Yes No
Shoebox Metadata With the full metadata available (i.e., *.typ file), it is possible to parse the data file and ensure that the contents conform to the metadata constraints. Yes Yes

Comments on Shoebox Parsing

Validation

When a full metadata description of a Shoebox file is available, the Shoebox parser can validate its contents against their metadata specification. All validation errors extend from the class ShoeboxValidationError. The following validation errors are recognized:

Functionality

A good deal of the functionality for Shoebox per se (rather than Standard Format) has been reverse-engineered. If anyone is aware of explicit specifications for the make-up of Shoebox metadata, please contact the author.

Feature Wish List

  1. the shoebox reader should probably read from a file instead of a string, to make it more similar to the corpus readers
  2. include inline epydoc documentation like the rest of NLTK http://epydoc.sourceforge.net/
  3. tutorial (include the bin/*.py examples as demo code so that each module can be run from the command line and demonstrate something useful (cf the rest of NLTK)
  4. make sure we have methods to create new fields and new entries, and save them to a shoebox file (NB this may involve updating a project file as well)
  5. Unicode handling for Toolbox
  6. Better error handling for Shoebox validation
  7. Automatic parsing and alignment of interlinearized tiers

Known Bugs or Limitations

  1. Font information for markers ignored
  2. File sets, jump sets, templates, and (RTF) export sets ignored
  3. Interlinearized texts do not always parse properly (\id in metadata, but \ref is actual head field)

Sample Data

You can test the the Standard Format parser by running some test scripts (bin.tar.gz) on the sample data that comes with Shoebox (samples.tar.gz). Make sure that the shoebox directory containing the Shoebox modules can be found by Python by adding it to the environment variable PYTHONPATH.

For purposes of illustration, assume that all of the above-linked files reside in your home directory called foo. The following Bash session illustrates how to run a sample script:

~/foo $ ls
bin.tar.gz  samples.tar.gz  shoebox.tar.gz
~/foo $ gunzip bin.tar.gz
~/foo $ tar xf bin.tar
~/foo $ gunzip samples.tar.gz
~/foo $ tar xf samples.tar
~/foo $ gunzip shoebox.tar.gz
~/foo $ tar xf shoebox.tar
~/foo $ rm -f *.tar
~/foo $ ls
bin  samples  shoebox
~/foo $ export PYTHONPATH=$PYTHONPATH:~/foo/shoebox/
~/foo $ python bin/print-shoebox.py -s samples/Frisian1/FriRt.dic

For a collection of use cases, see the following different possibilities:

Available Information Test Scripts
No Metadata $ python bin/print-shoebox.py -s samples/Frisian1/FriRt.dic
$ python bin/print-shoebox.py -s samples/Frisian2/FriRt.dic
$ python bin/print-shoebox.py -s samples/Axint/Ax.lex
Head Field Known $ python bin/print-shoebox.py -s samples/Frisian1/FriRt.dic -f lx
$ python bin/print-shoebox.py -s samples/Frisian2/FriRt.dic -f lx
$ python bin/print-shoebox.py -s samples/Axint/Ax.lex -f lx
Shoebox Metadata $ python bin/print-shoebox.py -s samples/Frisian1/FriRt.dic -m samples/Frisian1/FrisianD.typ
$ python bin/print-shoebox.py -s samples/Frisian2/FriRt.dic -m samples/Frisian2/FrisianD.typ
$ python bin/print-shoebox.py -s samples/Axint/Ax.lex -m samples/Axint/Axininc2.typ

For more examples and explanation, see the tutorial.


The author may be contacted at Stuart DOT Robinson AT mpi DOT nl.