Handling Shoebox Dictionaries with the Natural Language Toolkit

Stuart Robinson

This tutorial introduces the Shoebox capabilities of the Natural Language Toolkit (NLTK) for Python. Here we concentrate on Shoebox dictionaries rather than interlinearized texts.


Table of Contents

  1. What is Shoebox?
  2. How Does nltk.shoebox Work?
  3. Some Case Studies
  4. Conclusion
  5. Links


What is Shoebox?

Shoebox (and its latest incarnation, Toolbox) is a computer program used by many linguists to handle fieldwork data. The description of the program provided on the Shoebox homepage sums it up nicely:

"Shoebox is a computer program that helps field linguists and anthropologists integrate various kinds of text data: lexical, cultural, grammatical, etc. It has flexible options for sorting, selecting, and displaying data. It is especially useful for helping researchers build a dictionary as they use it to analyze and interlinearize text. The name Shoebox recalls the use of shoe boxes to hold note cards on which definitions of words were written in the days before researchers could use computers in the field."

A sample entry from a Shoebox dictionary of Rotokas (East Papuan, spoken on Bougainville) is provided below (taken from ROTRT.DIC). Note the program displays the data in two columns: on the left are the field markers, which identify different data fields; on the right are the field values, which provide data for the fields identified by the markers on the left.

Alternate views of the data are possible. For example, below we find the same data displayed in three columns: on the left are the field markers and their descriptions; on the right are the field values (the actual data for the entry).

It is possible to look at this data in its raw form using any word processor (e.g., Notepad, MS Word, etc.) or text editor (e.g., vi, emacs, etc.). Here is what the sample entry from it looks like:

\lx korau
\ps V.A
\ge clear
\ge unobstructed
\gp klia
\dt 14/Feb/2005
\cmt What is aue doing in the first example?
\ex Korauvira toupai aue evaoa.
\xp Diwais em i stap long ples klia.
\xe The trees are in the clearing.
\ex Ezra korauvira rutu toreparoi.
\xp Ezra i sanap long ples klia.
\xe Ezra is standing up in the clearing.

This raw data is simply text and can be manipulated programmatically. Although Shoebox is a very full-featured program with good data analysis capabilities, there is no substitute for the power and flexibility of a bona fide programming language. Anyone who uses Shoebox has no doubt at some point wanted to perform some type of analysis but found the inherent capabilities of Shoebox inadequate for the task. For example, it would be quite difficult to query a Shoebox dictionary and obtain every example sentence for entries that, say, consist of four segments, begin with a particular consonant, end with a particular suffix, and belong to a particular part of speech. Yet the Shoebox functionality of the NLTK makes this possible---in fact, it makes the job fairly simple (see Case 8 for an indication).

Here we will provide a tutorial on the manipulation of Shoebox dictionary files with the NLTK for Python. Although this tutorial does not require intimate knowledge of Shoebox, it is a good idea to familiarize yourself with at least the basics of the application. Fortunately, there is a good deal of documentation available (see the links section). (The programs are quite similar and most of the skills acquired on one transfer to the other. For this reason, all subsequent references to Shoebox can be assumed to apply equally well to Toolbox.)

How Does nltk.shoebox Work?

Let's begin by looking at how nltk.shoebox is organized. Within nltk.shoebox, there are two main modules: standardformat.py and shoebox.py. The standardformat modules supplies most of the low-level functions for dealing with files in standard format (which in theory encompasses more than Shoebox files). The shoebox modules handles a good deal more, providing functionality for handling various aspects of standard format files that are specific to the Shoebox program. We will look at each in turn.

Standard Format

The first is a module that provides functionality for handling Standard Format, the file format used by Shoebox. Standard format is not well described (and is arguably rendered obsolete by other formats, such as XML). It consists of a collection of entries, which are generally separated from one another by double carriage returns. Technically, however, what defines the beginning of an entry is a particular field, referred to as the head field.

A sample entry from a Frisian Shoebox dictionary is provided below:

\fri do
\ps Pron
\g you
\eng you

It is broken down into its constituent parts here (the first field, \fri, is the head field):

Field Field Marker Field Value
1
\fri
do
2
\ps
Pron
3
\g
you
4
\eng
you

Note that the field marker occurs at the beginning of a line and is preceded by a backslash and that the field value is separated from the field marker by a single space mark. The following is assumed to be true of standard format:

  1. Field Marker:
    1. occurs at the beginning of a line
    2. starts with a single backslash
    3. first character after backslash is alphabetic
    4. cannot contain whitespace
    5. second and subsequent characters can be alphanumeric, an underscore (_), or dash (-)
    6. is separated from associated field data by a single space
  2. Field Data:
    1. unrestricted text
    2. can include carriage returns
    3. cannot contain a line-inital backslash

As is the case with XML, it is useful to distinguish between two kinds of Shoebox data files:

(Note: There are some special field markers used in metadata files that violate the above-given rules concerning the initial-character of the field marker--for example, \+mkrset and \-mkrset.)

Some examples of well-formed and ill-formed Shoebox data are provided below:

Well-Formed
\ref orang
\ps N
\ge person
 
\ref orang
\ps N
\nt
\ge person
\ref orang
\ps N
\nt

\ge person
 
\ref orang
\ps N
\ge person
\ge people
Ill-Formed
\ref orang \ps N \ge person
 
\1 orang
\2 N
\3
\4 person
 
\ref orang \
\ge person
\ge people

Well-formedness is therefore a necessary (but not sufficient) condition for validity.

Shoebox

Although most users interact with only a few Shoebox files (typically, only the dictionary file and interlinearized texts), many more are generated by the program. These files are normally modified from within Shoebox, using its handy graphical interface, but they can also be directly edited. However, this should be done with caution, since minor changes can have dramatic effects. We recommend that changes be made to copies of files and not to the originals. This significantly reduces the likelihood of irretrievable errors that lead to data loss.

To understand a little better what goes on under the hood, we will look at the files from Samples/Frisian1/.

Type File Description
Metadata Fri.prj Frisian Project File
  Default.lng
Frisian.lng
Language encoding files
  FrisianD.typ Metadata for Frisian dictionary
  FrisianT.typ Metadata for Frisian texts
Data FriRt.dic Frisian dictionary
  FriSampl.txt A sample interlinear text in Frisian

It is useful to distinguish between two types of files:

Data files are directly modified by the user through the Shoebox program, whereas metadata files are modified indirectly, by the program itself.

As an illustration, consider a particular entry from the Frisian dictionary, the one for the indefinite article a. It has four field markers: eng, fri, g, and ps. Information about these field markers is found in FrisianD.typ. Excerpts from this file are provided below:

Field Markers eng fri g ps
Metadata Definition
\+mkr eng
\nam English
\lng Default
\mkrOverThis fri
\-mkr
\+mkr fri
\nam Frisian Word
\lng Frisian
\-mkr
\+mkr g
\nam Gloss
\lng Default
\mkrOverThis fri
\-mkr
\+mkr ps
\nam Part of Speech
\lng Default
\mkrOverThis fri
\-mkr

When metadata of this sort is available, it is possible to validate Shoebox data against it in order to ensure that the data is valid. In one of the cases examined below (Case 8: Validating Field Data Against Range Sets), we will see how the fields of a Shoebox dictionary can be validated against metadata to ensure that all of the field values for a particular field marker belong to a fixed list of possible values.

Parsing and Validation

There are 3 ways of parsing a Shoebox dictionary file into entries and their associated fields:

Available Information Description Can be Parsed? Can be Validated?
No Metadata If no metadata is available, it is assumed that the first field of the first entry encountered is the head field. Yes No
Head Field Known If the head field marker is known, it is therefore possible to handle properly multiline fields. Yes No
Shoebox Metadata With the full metadata available (i.e., *.typ file), it is possible to parse the data file and ensure that the contents conform to the metadata constraints. Yes Yes

When a full metadata description of a Shoebox file is available, the Shoebox parser can validate its contents against their metadata specification. All validation errors extend from the class ShoeboxValidationError. The following validation errors are recognized:

Error Type Description
BadMetadataFile Something is wrong with the metadata file and it cannot be parsed.
NoMetadataFound No metadata file has been supplied even though one is required.
BadFieldValue ValueOutsideRangeSet The value falls outside of the fixed set specified for a particular field.
NoWordWrap The value of the field has a carriage return despite being specified for no word wrap.
EmptyValue No value is provided for a field that requires one.
SingleWord The value of a field consists of multiple words despite being specified for only a single word.

Some Case Studies

For these demonstrations of nltk.shoebox functionality, we will manipulate some sample Shoebox files (in the folder Samples). Because the sample files can be useful when learning how to use Shoebox, most users prefer to keep an unmodified version of them. Therefore, we recommend making a backup of these files before trying the scripts below.

Case 1: Formatting a Shoebox Lexicon for Display

Raw Shoebox data isn't very easy to read. It's therefore useful to be able to reformat a Shoebox dictionary file according to your wishes. Although Shoebox has in-built facilities for producing formatted dictionaries, they do not rival the possibilities provided by the NLTK. Here we will stick to the basics for the sake of illustration and show how a minimally formated plain text version of a Shoebox dictionary can be produced using the NLTK.

The script reformat-dict.py does the job. It is run as follows:

$ python bin/reformat-dict.py samples/Rotokas/ROTRT.DIC

The script is quite simple. It simply parses a Shoebox dictionary file, goes through each entry, retrieves selected fields, and then prints them out with some bare bones formatting: the lexeme, the part-of-speech in parentheses, and the English translation in single quotes. Note that the translation is created dynamically, but first checking whether the field eng exists for an entry, and then falling back on ge if it does not.

def main() :
    try :
        filepath = sys.argv[1]
    except :
        sys.stderr.write("%s " % sys.argv[0])
        sys.exit(0)        
    fp = StandardFormatFileParser(filepath)
    sff = fp.parse()
    print sff.getHeader()
    for e in sff.getEntries() :
        lex   = e.getHeadField()[1]
        pos   = e.getFieldValuesByFieldMarkerAsString("ps")
        gloss = e.getFieldValuesByFieldMarkerAsString("ge")
        eng   = e.getFieldValuesByFieldMarkerAsString("eng", "/")
        if eng :
            print "%s (%s) '%s'" % (lex, pos, eng)
        else :
            print "%s (%s) '%s'" % (lex, pos, gloss)

There are two different methods of the Entry class that can be used to obtain specific fields of an entry: getFieldValuesByFieldMarker() and getFieldValuesByFieldMarkerAsString(). The method getFieldValuesByFieldMarker() returns all of the field values for a given field marker as a list. If the field is not found, the list will be empty. Otherwise, the number of items will equal the number of times the field is found in the entry. Because of the possibility of obtaining a None value, it is sometimes preferable to use the method getFieldValueByFieldMarkerAsString(), which always returns a string. If the specified field is not found, a blank string is returned (as opposed to the value None). If the specified field is non-unique, the fields are combined into a single string. The default separator in the returned string are space marks, but an alternative separator can be specified as a second argument to the method. Calling getFieldValuesByFieldMarker("eng", "/") returns all of the \eng fields as a single string separated by slashes.

Case 2: Adding a Field to a Shoebox Lexicon Automatically

For this case study, we will manipulate the Frisian dictionary that comes with Shoebox. Our goal is to add to each entry in the lexicon a field that provides the CV skeleton for that entry. For example, the CV skeleton for brek is CCVC and for bikwaam, CVCCVVC. The Shoebox functionality of the NLTK significantly simplifies the job of going through each entry and computings its CV skeleton, as can be seen by add-cv-skeleton.py. To see the script in operation, we can run it on one of the Frisian dictionaries that comes with Shoebox as part of a sample project (FriRt.dic):

$ python bin/add-cv-skeleton.py samples/Frisian1/FriRt.dic

Because the scripts writes to standard output, its output can be redirected to a file in order to create a new version of the processed Shoebox dictionary file, as illustrated below:

$ python bin/add-cv-skeleton.py samples/Frisian1/FriRt.dic > samples/Frisian1/FriRt-new.dic

When the new version of the lexicon is opened again with Shoebox, the only change is the inclusion of a cv field in every entry, as can be seen in the following before-and-after screenshots:

Before   After
 

Here's how the script works. First, the script imports two classes from the StandardFormat module: the StandardFormatFileParser and the StandardFormatFile. The former builds the latter from a Shoebox dictionary file.

from shoebox.standardformat import StandardFormatFileParser

It then defines three functions, handle_options(), cv() and main(). The function cv() is a very simple function that takes a lexical entry as input and returns its CV skeleton. It does this by using regular expressions to replace consonants with C and vowels with V.

def cv(s):
    s = s.lower()
    s = re.sub(r'[^a-z]',     r'-', s)
    s = re.sub(r'[^aeiou\-]', r'C', s)
    s = re.sub(r'[aeiou]',    r'V', s)
    return (s)

The function main() does the major work. First, the path to the Shoebox dictionary file is obtained as a command line argument, and a usage message is printed if one is not supplied.. Second, a StandardFormatFileParser object is created and the parse() method is called to obtain a StandardFormatFile object. The header information of the dictionary is printed out. This header information is crucial since Shoebox will not recognize the dictionary without it. Here's what it looks like:

\_sh v3.0  400  Frisian Dictionary

The entries within the dictionary are next obtained by calling the getEntries() method, which produces a list of Entry objects. These are then iterated over and the head field of each is retrieved by calling the getHeadField() method. The CV skeleton is constructed using the cv() function and added back to the entry with the addField() method. The entry is then printed by relying upon the __str__ method of the Entry class for formatting.

def main() :
    try :
        filepath = sys.argv[1]
    except :
        sys.stderr.write("%s -f \n" % sys.argv[0])
        sys.exit(0)
        
    fp = StandardFormatFileParser(filepath)
    sff = fp.parse()

    print sff.getHeader()
    for entry in sff.getEntries() :
        headField = entry.getHeadField()
        frisian = headField[1]
        entry.addField("cv", cv(frisian))
        print entry

Case 3: Filtering out Specific Fields in a Database

It is sometimes useful to be able to remove extraneous fields from a dictionary. The script filter-fields.py prints out the contents of a Shoebox dictionary, omitting the field specified by the user on the command-line. To see how it works, we will run the filter on the Rotokas dictionary ROTRT.DIC, filtering out the date field (\dt), as follows:

$ python bin/filter-fields.py -f dt samples/Rotokas/ROTRT.DIC

In order to filter out multiple fields, the script can be run multiple times, as illustrated below:

$ python bin/filter-fields.py -f dt  samples/Rotokas/ROTRT.DIC > /tmp/foo1.txt
$ python bin/filter-fields.py -f cmt /tmp/foo1.txt             > /tmp/foo2.txt
$ python bin/filter-fields.py -f nt  /tmp/foo2.txt             > samples/Rotokas/ROTRT-FILTERED.DIC

To understand how the script works, we'll look at the main() function, provided below:

def main() :
    fn, field2Filter = handle_options()
    fp = StandardFormatFileParser(fn)
    sff = fp.parse()
    print sff.getHeader()
    for e in sff.getEntries() :
        e.removeField(field2Filter)
        print e

The function handle_options() first obtains from the command-line the path to a Shoebox file and the field to be filtered. Then the Shoebox dictionary is parsed into a StandardFormatFile object. A list of entries is obtained using the getEntries() method and then iterated over. From each entry, the following fields are obtained:

Case 4: Automatically Extracting Minimal Pairs

When studying the phonology of a language, it is useful to have a list of minimal pairs---which for our purposes we will define as a pair of words of the same length (i.e., identical number of characters) that differ from one another by a single character (e.g., bill and pill in English). It is relatively easy to extract minimal pairs automatically from word lists, provided that the orthography in the word list is phonemic (i.e., characters represent phonemes) and that there is a one-to-one relationship between characters and phonemes (i.e., no digraphs).

The script find-min-pairs.py will find all minimal pairs within a Shoebox dictionary. Below we see the first few lines of output obtained by running it on a Shoeobox dictionary file for Rotokas (ROTRT.DIC):

$ python bin/find-min-pairs.py samples/Rotokas/ROTRT.DIC
a/e:kaa/kae
a/u:kaa/kau
a/e:kaa/kea
a/o:kaa/koa
a/e:kaa/kae
...

To see the total number of minimal pairs obtained, we can pipe the output to the Unix utility wc with the -l flag so that the number of lines is counted:

$ python bin/find-min-pairs.py samples/Rotokas/ROTRT.DIC | wc -l
456

Here's how the script works. With a monographic, phonemic orthography, finding minimal pairs is a fairly trivial task. One simple algorithm for identifying minimal pairs goes as follows: The length of every word in a Shoebox dictionary is determined. Every word is compared to every other word of the same length. (There is no point in examining words of different lengths, since they cannot be a minimal pair). Words of identical length are lined up and each segment in the word is compared one by one, in sequential order. A minimal pair is then simply a pair of identical-length words that differ only by one segment. Consider a pair of words like mint and lint:

Index 0 1 2 3
Letters m i n t
l i n t
Same? N Y Y Y

Here we provide a simple script that takes a Shoebox dictionary and extracts all of the minimal pairs in it: find-min-pairs.py. The first part of the program uses the NLTK's Shoebox functionality to extract all of the words from the Shoebox dictionary.

def extractWords(filepath) :
    words = []
    sffp = StandardFormatFileParser(filepath)
    sff = sffp.parse()
    for e in sff.getEntries() :
        hf = e.getHeadField()
        words.append(hf[1])
    return words

The words are then fed into a function that classifies them according to length, returning a dictionary in which a key-value pair is a particular word length and a list of words of that length.

def sortWordsByLength(words) :
    wordLengths = {}
    for w in words :
        wl = len(w)
        if not wordLengths.has_key(wl) :
            wordLengths[wl] = []
        wordLengths[wl].append(w)
    return wordLengths

The dictionary of word lengths is then passed to the function findMinPairs(), which goes through words of identical length and finds any minimal pairs among them.

def findMinPairs(wordsByLength) :
    for l in wordsByLength.keys() :
        words1 = wordsByLength[l]
        words2 = wordsByLength[l]
        for w1 in words1 :
            for w2 in words2 :
                i = 0
                diffCount = 0
                diffChar1 = ''
                diffChar2 = ''
                while i < l :
                    if not w1[i] == w2[i] :
                        diffCount = diffCount + 1
                        diffChar1 = w1[i]
                        diffChar2 = w2[i]
                    i = i + 1
                    if diffCount > 1 :
                        continue
                if diffCount == 1 :
                    print "%s/%s:%s/%s" % (diffChar1, diffChar2, w1, w2)
            words1.remove(w1)

Case 5: Handling Entry Date Stamps

If properly configured (see date stamps documentation), Shoebox will automatically update the date field of a dictionary entery whenever that entry is modified (i.e., created or edited). We will refer to this date field as a date stamp. Date stamps are a very useful feature with a number of possible applications, but one obvious benefit is that they provide an inherent log of activity in a database. This can be quite useful if one user wishes to review the changes made to a database by another user, as might be the case when a Shoebox dictionary is shared by multiple parties. (Shoebox makes few provisions for multi-user set-ups.)

Obtaining a Log of Activity on a Shoebox Database

By looking at date stamps, it is possible to determine quickly the general patterns of activity on a database. Essentially, this means being able to answer quickly and easily questions such as the following: When was the database originally created? When was it first modified? When was it last modified? These questions can be answered using the script list-modified-dates.py, which takes a Shoebox dictionary and provides a summary of the activity on a dictionary file by examining its entry date stamps. Its use with ROTRT.DIC is illustrated below:

$ python bin/list-modified-dates.py samples/Rotokas/ROTRT.DIC
2003 May    2 00%
2004 Jan    1 00%
2004 Feb   64 07%
2004 May    1 00%
2004 Jul   14 01%
2004 Aug    4 00%
2004 Sep   49 05%
2004 Oct    5 00%
2004 Nov    5 00%
2004 Dec  151 18%
2005 Jan  123 14%
2005 Feb  307 36%
2005 Mar   37 04%
2005 Apr   29 03%
2005 May   46 05%

If the flag -g is provided, the output takes the form of a histogram:

$ python bin/list-modified-dates.py -g samples/Rotokas/ROTRT.DIC
2003 May    2
2004 Jan    1
2004 Feb   64 *******
2004 May    1
2004 Jul   14 *
2004 Aug    4
2004 Sep   49 *****
2004 Oct    5
2004 Nov    5
2004 Dec  151 ******************
2005 Jan  123 **************
2005 Feb  307 ************************************
2005 Mar   37 ****
2005 Apr   29 ***
2005 May   46 *****

The output of the program is a breakdown of the number of entries modified during a particular month of a particular year. In the example provided, we see that the entries in the dictionary were modified between 2003 and 2005 with a peak of activity between December 2004 and Feb 2005.

The Shoebox functionality of the NLTK greatly simplifies this programming task. In fact, the only real complication is the handling of dates and times in Python (see python.org's datetime documentation). To understand how the script works, we will first look at the function main(), which calls a number of custom functions.

def main() :
    fn, histogram = handle_options()
    d = process_file(fn)
    print_results(d, histogram)

First, the file to be processed and any options are obtained from the command line. The main option is -g or --histogram, which determines the nature of the output. If the flag is provided, the output takes the form of a histogram; otherwise, percentages are provided. Second, the file is parsed and the date fields of every entry is put into a dictionary, where each key is a particular dates whose associated value is a count of the number of entries modified on that date. Finally, the contents of this dictionary are printed out for display by print_results().

def print_results(d, histogram) :
    total = 0
    for yr in d.keys() :
        months = d[yr].keys()
        for m in months :
            dateList = d[yr][m]
            count = len(dateList)
            total = total + count
    
    for yr in d.keys() :
        months = d[yr].keys()
        months.sort()
        for m in months :
            dateList = d[yr][m]
            count = len(dateList)
            if histogram :
                print "%s %s %04s %s" % (yr, format_month(m), count, ((count * 100 / total) * "*") )                
            else :
                print "%s %s %04s %02d%%" % (yr, format_month(m), count, (count * 100.0 / total) )

Note that for display the months are converted from integers to strings for the sake of readability. This is done using a custom fuction format_month(intMonth). It would also be possible to do this conversion with built-in functionality from the Python standard library. We leave this as an exercise for the reader.

Finding All Entries Modified During a Particular Time Range

Once we have a general idea of when a database has been modified, it would be useful to be able to view only the entries within a particular time range. The script find-modified-entries.py takes a Shoebox dictionary and lists all entries modified within a time range specified by the user using command-line options. The logic of the script is reasonably straightforward:

Here we illustrate its use with ROTRT.DIC, specifying only the start date (May 30th, 2005), which means that every entry modifed on or after the start date will be printed out. In this case, this amounts to only two entries.

$ python bin/find-modified-entries.py -s 30/May/2005 samples/Rotokas/ROTRT.DIC
kou [V.B] 'lay egg defecate' (30/May/2005)
karu [V.B] 'open unlock untie unhook' (30/May/2005)

To obtain every day on or before the same start date, the script can be run specifying only the end date, which means that every entry modified on or before the end will be printed out. (Note that the date range is inclusive. To prevent the two entries with the date stamp 30/May/2005 from being printed out, the specified date is one day prior---that is, May 29th, 2005 rather than May 30th, 2005)

$ python bin/find-modified-entries.py -e 29/May/2005 samples/Rotokas/ROTRT.DIC
kasiarao [N.F] 'limbum' (15/Sep/2004)
kerikerisi [V.B] 'evaluate judge carefully' (14/Feb/2005)
karuvira [ADV] 'open' (12/Feb/2005)
kogo [V.B] 'cut chop' (01/Dec/2004)
koroto [V.B] 'meet together' (02/Dec/2004)
...

If the user specifies a date on the command-line in the wrong format, an error message is raised, as illustrated below:

$ python bin/find-modified-entries.py -s 30/5/2005 samples/Rotokas/ROTRT.DIC
Traceback (most recent call last):
  File "bin/find-modified-entries.py", line 95, in ?
    main()
  File "bin/find-modified-entries.py", line 89, in main
    startDate = string_to_datetime(startDateStr, dateFormat)
  File "bin/find-modified-entries.py", line 43, in string_to_datetime
    epochSecs = time.mktime(time.strptime(dateString, dateFormat))
  File "/usr/local/lib/python2.3/_strptime.py", line 424, in strptime
    raise ValueError("time data did not match format:  data=%s  fmt=%s" %
ValueError: time data did not match format:  data=30/5/2005  fmt=%d/%b/%Y

The problem is that the script expects dates to be formatted as a one- or two-digit day, a three-letter month, and a four-digit year (separated by slahes), but the user-specified date does not conform to that format. It gives an integer for the month rather than a three-letter code (that is, 5 rather than May). (It is possible to use different date formats, and the script provides for this possibility with the option -f.)

In broad strokes, the script works by parsing the Shoebox dictionary into a StandardFormatFile object, iterating over its entries, and checking whether the entry's date stamp belongs to the time range defined by the user. We can see how this is done by first looking at the function main().

def main() :
    fn, startDateStr, endDateStr, userDateFormat = handle_options()
    dateFormat = "%d/%b/%Y"
    if userDateFormat :
        dateFormat = userDateFormat
    startDate = string_to_datetime(startDateStr, dateFormat)
    endDate = string_to_datetime(endDateStr, dateFormat)
    d = process_file(fn, startDate, endDate, dateFormat)
    print_results(d)
    return

In the function main(), numerous options are obtained from the command-line with the handle_options() function. The custom date format is overriden if one is provided by the user. The date format is then used to parse the start and end dates into datetime objects. These are then passed to the function process_file. Note that the date format is also passed, since the script assumes that the same date format used to parse the user-specified dates will also be used to parse the dictionary's date stamps.

def process_file(fn, startDate, endDate, dateFormat) :
    d = {}
    fp = StandardFormatFileParser(fn)
    sff = fp.parse()
    for e in sff.getEntries() :
        lexeme = e.getHeadField()[1]
        modDateStr = e.getFieldValuesByFieldMarkerAsString(FM_DATE)
        modDate = string_to_datetime(modDateStr, dateFormat)
        if modDate and in_time_range(startDate, endDate, modDate) :
            d[lexeme] = e
    return d

The main work is done by the function in_time_range(), which takes three dates---the start date, the end date, and the date stamp of an entry---and determines whether the entry's date stamp falls within the date range defined by the start and end date.

def in_time_range(startDate, endDate, modDate) :
    if ( startDate and endDate ) and ( modDate >= startDate and modDate <= endDate ) :
        return True
    elif ( startDate and not endDate ) and ( modDate >= startDate ) :
        return True
    elif ( endDate and not startDate ) and ( modDate <= endDate ) : 
        return True
    else :
        return False

If the date stamp of an entry does fall within the desired date range, the entry is added to a dictionary, which is then passed to the function print_results(), which prints out all of the entries in the dictionary with minimal formatting.

Case 6: Guess Entry Template of Shoebox Dictionary Without Metadata

If you work with Shoebox data, it's fairly likely that at some point you will come across a Shoebox dictionary by itself without any of the supporting metadata (language definitions, field marker definitions, etc.). In such a scenario, it is useful to be able to eyeball the data in order to learn more about the template used for entries. The idea is to query the database structure implicit in the dictionary by asking questions of the following sort: Which elements are obligatory? Which are optional? What are the dependencies?

Here we will present a short script, query-template-structure.py, that attempts to automate this game of twenty questions. Below we see the result of running the script on the Rotokas dictionary file ROTRT.DIC:

$ python bin/query-template-structure.py samples/Rotokas/ROTRT.DIC
\ps     845     100%
\ge     845     100%
\gp     840     99%
\dt     838     99%
\xp     705     83%
\xe     705     83%
\ex     705     83%
\rt     339     40%
\nt     162     19%
\cmt    119     14%
\eng    87      10%
\sf     51      6%
\rdp    36      4%
\arg    31      3%
\cd     28      3%
\sa     20      2%
\cm     19      2%
\ig     10      1%
\dx     9       1%
\vx     8       0%
\alt    8       0%
\cl     7       0%
\am     7       0%
\wf     2       0%
\sc     1       0%

The output tells us that the fields \ps and \ge are found in all of the entries. The remaining fields do not appear in all of the entries, but two of them (\gp and \dt) appear in virtually all. The remaining fields appear with varying degrees of frequency.

The script works as follows: First, the path to a Shoebox dictionary file is obtained from the command-line. Second, the file is parsed into a dictionary object and a list of entries are retrieved with the function get_entries(), which, given a filepath to a Shoebox dictionary, will return a list of entries.

def get_entries(filepath) :
    fp = StandardFormatFileParser(filepath)
    sff = fp.parse()
    return sff.getEntries()

The function process_entries() then keeps track of the distribution of fields across entries using a dictionary data structure.

def process_entries(entries) :
    counter = {}
    i = 0
    for e in entries :
        for fm in e.getFieldMarkers() :
            try :
                counter[fm] = counter[fm] + 1 
            except :
                counter[fm] = 1
    return counter

The results are then printed out by the function print_results(), which sorts the dictionary by value in ascending order and then reserves that order. The result is that the most common field markers come first and the least common field markers come last. The field markers are then printed out along with the number of entries in which they appear and what percentage of all entries that represents.

def print_results(entries, counter) :
    totalEntries = len(entries)
    fieldMarkers = sort_by_value(counter)
    fieldMarkers.reverse()
    for fieldMarker in fieldMarkers :
        numEntries = counter[fieldMarker]
        pctEntries = ((100.0 * numEntries)/totalEntries)
        print "\%s\t%i\t%i%%" % (fieldMarker, numEntries, pctEntries)

More sophisticated analysis of entry templates is of course possible. For example, this script does not reveal whether a given field has more than one value---i.e., it does not distinguish between unique and non-unique fields. We leave this as an exercise for the reader.

Case 7: Validating Field Data Against Range Sets

One particularly useful feature of Shoebox is the ability to restrict the possible values of a field. For example, part of speech information usually refers to a restricted set of categories (e.g., Noun, Verb, Adjective, Adverb, etc.). In the part-of-speech field of a Shoebox lexicon, one may therefore wish to restrict the possible values to a restricted inventory (e.g., N, V, ADJ, and ADV), which in Shoebox is called a range set.

To see the range set for a particular field, you must select the pull-down menu Database and select Properties, as shown in the following screenshot.

You can then select a specific field from list of those recognized by the database, as shown in the following screenshot.

From this list, we will select a particular field and examine its range set. In the following screenshot, the part-of-speech field has been selected.

If we select the range set for the part-of-speech field, we can see whether the field has a range set defined for it---in this case, it does---and which elements are in it.

Unfortunately, when you add or edit a range set within Shoebox, any old entries that are in conflict with the new range set will not be automatically flagged. In fact, the only time that Shoebox enforces the range set for a data field is when you attempt to save changes to that field. In other words, only when an entry is created or edited will the range set be enforced.

What is interesting about this particular problem is that its solution involves two files: the actual Shoebox dictionary file and a metadata file used by Shoebox which defines the range set. Below we contrast two metadata definitions of the part-of-speech (\ps) field: one with and one without a range set.

No Range Set Range Set
\+mkr ps
\nam Part of Speech
\lng Default
\mkrOverThis fri
\-mkr
\+mkr ps
\nam Part of Speech
\lng Default
\rngset N V 
\mkrOverThis fri
\-mkr

Using the NLTK, we can validate the Shoebox file against the metadata and ensure that all of the field values conform to the defined range sets for their field markers (see Parsing and Validation for an explanation of the validation errors).

The script validate-shoebox.py takes a Shoebox dictionary file and a dictionary type file and validates the dictionary file against the metadata of the dictionary type file. It is run as follows:

$ python tutorial/scripts/validate-shoebox.py --s=samples/Frisian1/FriRt.dic --m=samples/Frisian1/FrisianD.typ

In this case, running the script should do nothing. When a dictionary conforms to its metadata, the script produces no output. It is only when there are discrepancies between data and metadata that output is produced. To see how this works, we will run the same script on a modified version of the metadata, FrisianDAlt.typ, where the field marker ps has the range set of N and V. When the script is run on this alternative metadata, the results are quite different.

$ python tutorial/scripts/validate-shoebox.py --s=samples/Frisian1/FriRt.dic --m=tutorial/FrisianDAlt.typ
[\_sh v3.0  400  Frisian Dictionary]
Traceback (most recent call last):
  File "tutorial/scripts/validate-shoebox.py", line 37, in ?
    ev.validate()
  File "/home/stuart/workspace/Shoebox/shoebox/shoebox/shoebox.py", line 395, in validate
    raise BadFieldValue(BadFieldValue.FIELD_VALUE_ERROR_RANGE_SET, e, f, fmm)
shoebox.shoebox.BadFieldValue: 'Range Set' error in '\ps' field of record 4!
Record:
\fri -ber
\ps V>Adj
\g able

To understand what is going on here, we need to look at how the script works.

  1. the filenames for the Shoebox dictionary and the metadata file are taken from the command-line
  2. a MetadataParser is used to construct a Metadata object from the metadata file
  3. a StandardFormatParser is used to construct a StandardFormatFile object from the dictionary file
  4. a MetaDataValidator object uses the MetadataFile object and the StandardFormat object to validate the dictionary data

If the dictionary data conforms to the range set in the metadata, nothing happens. That is, the validate() method simply returns true. However, if the dictionary data does not conform to the range set in the metadata, a BadFieldValue error is thrown.

import sys
from optparse               import OptionParser
from shoebox.shoebox        import MetadataParser, ShoeboxValidator
from shoebox.standardformat import StandardFormatFileParser

# Deal with metadata
fo = open(options.metadata, 'rU')
mdFc = fo.read()
fo.close()
mp = MetadataParser(mdFc)
md = mp.parse()

# Deal with Shoebox
fo = open(options.shoebox, 'rU')
sbFc = fo.read()
fo.close()
fp = StandardFormatFileParser(sbFc)
fp.setHeadFieldMarker(md.getHeadFieldMarker())
sb = fp.parse()

# Validate
ev = ShoeboxValidator()
ev.setMetadata(md)
ev.setShoebox(sb)
ev.validate()

If there are any inconsistencies, the method validate() will raise a BadFieldValue exception.

Case 8: Finding Entries that Conform to a Particular Profile

One common task in the analysis of dictionaries is the extraction of entries conforming to a particular profile. For example, it is sometimes useful to find words that are of a particular length or words that begin with a particular prefix or end with a particular suffix. By way of illustration, the script find-long-vowel-final-entries.py extracts all dictionary entries that end with a long vowel. Here we illustrate its use with ROTRT.DIC:

$ python bin/find-long-vowel-final-entries.py samples/Rotokas/ROTRT.DIC
kaa [V.A] 'gag'
kaa [V.B] 'strangle'
kaa [N.M] 'cooking banana'
kaepaa [N.N] 'wheelbarrow/basket'
kakupaa [N.N] 'landslide/mudslide'
...

Here's how it works. The dictionary file is first parsed using the StandardFormatFileParser. Then all of the entries obtained are then iterated over in a for loop. As the for loop goes through each entry, various fields are obtained from it and the entry is printed out if its head field matches the regular expression that defines a word-final long vowel. (The regular expression simply looks for a string-final sequence of two identical vowel letters. For more on regular expressions, see this regular expression tutorial.)

The desired profile for entries can be made more strict by adding additional conditions. For example, instead of printing all entries ending with a long vowel, the script can be modified so that it prints all entries that are both nouns and end with a long vowel. This is done by obtaining the part-of-speech field in addition to the head field and checking whether the part-of-speech is a noun. In the case of ROTRT.DIC, the part of speech field uses abbreviations that conform to a particular convention: major lexical categories (e.g., noun, verb) are indicated by a single capital letter (e.g., N, V) followed by a subclass abbreviation with an intervening dot (e.g., N.N for neuter noun or N.M for masculine noun).

$ python bin/find-long-vowel-final-noun-entries.py samples/Rotokas/ROTRT.DIC
kaa [N.M] 'cooking banana'
kaepaa [N.N] 'wheelbarrow/basket'
kakupaa [N.N] 'landslide/mudslide'
kapopaa [N.N] 'wrench/spanner'
kapupiepaa [N.N] 'pincers'
...

Conclusion

In this tutorial we have seen how the NLTK can make it easier to manipulate dictionary files produced by Shoebox. In another tutorial, we will see how some of the same tools can be used to work with interlinear texts produced by the same program.

Links

Python   Natural Language Toolkit (NLTK) Homepage
NLTK Tutorials
Shoebox   Python Homepage
Shoebox Homepage
Toolbox Homepage
User Tips for Shoebox


The author may be contacted at stuart-at-zapata-dot-org. Many thanks to those who read drafts of this tutorial and provided feedback: Steven Bird, Brian McWhinney, and Loretta O'Connor. All errors are of course my own.