Handling Shoebox Interlinear Texts with the Natural Language Toolkit

Stuart Robinson

This tutorial introduces the Shoebox capabilities of the Natural Language Toolkit (NLTK) for Python. Here we concentrate on interlinearized texts.

  1. Interlinear Text Formatting in Shoebox
  2. Manipulating Interlinear Text with nltk.interlineartext
  3. Some Case Studies
  4. Conclusion
  5. Links


Interlinear Text Formatting in Shoebox

Interlinear texts are texts that have been broken down into lines that have multiple levels of information (tiers). (For more on the nature of intear text, see Towards a General Model for Interlinear Text by Cathy Bow, Baden Hughes, and Steven Bird.) Typically, there are at least four tiers, listed below:

Label   Description
Text   the actual text being being analyzed
Morpheme breakdown   provides a breakdown of the words into morphemes
Morpheme gloss   provides a gloss (short description) of each individual morpheme
Part-of-speech   provides labels for the part-of-speech (grammatical function) of each word/morpheme

This can be illustrated with the following excerpt from a story in an elementary school reader in Rotokas (Papuan, Bougainville) (SampleText1.txt):

\ref 15
\t Igei      uuko  regapaiei.
\m igei      uuko  rega       -pa       -i         -ei
\g 1.PL.EXCL water thirst for -PROG     -1.PL.EXCL -PRES
\p PRO       N.N   V.I        -SUFF.V.3 -SUFF.V.4  -SUFF.VI.5
\f Mipela i nek drai.
\fe We are thirsty.

This line of text is in standard format and consists of the following fields:

Field Marker Description
\ref head field which marks the beginning of a new line and labels it
\t word-by-word breakdown of text
\m morpheme-by-morpheme breakdown of each word
\g short description of morpheme's meaning (gloss)
\p part-of-speech label for morpheme
\f free translation in Tok Pisin
\fe free translation in English

First, there are three columns, defined by the words in the first tier (marked by \t). This is shown more clearly below:

1 2 3
Igei uuko regapaiei.
igei uuko rega -pa -i -ei
1.PL.EXCL water thirst for -PROG -1.PL.EXCL -PRES
PRO N.N V.I -SUFF.V.3 -SUFF.V.4 -SUFF.VI.5

Second, within these columns, there are in some cases multiple columns, defined by the morphemes of the word. This is the case for the third word in our example:

rega -pa -i -ei
thirst for -PROG -1.PL.EXCL -PRES
V.I -SUFF.V.3 -SUFF.V.4 -SUFF.VI.5

Although the formatting of interlinearized text is essentially the same format used for dictionaries---that is, standard format---it poses some unique challenges.

One challenge is the fact that a line of interlinearized text consists of some fields which are aligned with one another and others that are not, which we will refer to as aligned and non-aligned fields, respectively. Below we see the breakdown of the fields from the previous Rotokas example in terms of this distinction:

Aligned \ref, \f, \fe
Non-Aligned \t, \m, \g, \p

We can look at the text excerpt again with this distinction in mind. The aligned field are provided below in ??? and the non-aligned fields in ???.

\ref   15
\t   Igei uuko regapaiei.
\m   igei uuko rega -pa -i -ei
\g   1.PL.EXCL water thirst for -PROG -1.PL.EXCL -PRES
\p   PRO N.N V.I -SUFF.V.3 -SUFF.V.4 -SUFF.VI.5
\f   Mipela i nek drai.
\fe   We are thirsty.

There is an additional distinction to be made within aligned fields, and that is between word-alignment and morpheme-alignment. The \t field defines the word alignment, whereas the \m field defines morpheme alignment.

Another challenged posed by interlinearized Shoebox texts is their two-level structure. In the Rotokas example provided above, the interlinear text line begins with the head field \ref, which identifies the beginning of a new line of interlinearized text. Some interlinearizd texts have an additiona level of structure that sits between the level of the text and the line. We will refer to this as paragraph structure, although it is normally identified by the head field \id, as shown in the following excerpt from Frisian sample (FriSampl.txt):

\id First sample sentence from the tutorial
\ref 1
\t Berne      en   opgroeid         yn   Ynje,
\m bern -e    en   op-  groei -e    yn   Ynje
\g bear -PSTP and  up-  grow  -PSTP in   Indonesia
\p V    -Tns  Conj Dir- V     -Tns  Prep N

\f Born and raised in Indonesia, 

\ref 2
\t sil  dêr   syn  grêf  wêze.
\f 

\id frilake.txt Tutorial Exercise
\ref 1
\t Fryslân yn maitiidpracht, wylst de sinne skynde oer
de marren en de wide greiden mei fee.
\f 

\ref 2
\t Noarwegen, doe't de hege sinne dreamde yn 'e fjorden.
\f 

\ref 3
\t Heite en Memme lân.
\f 

\ref 4
\t Mar sines?
\f 

\ref 5
\t Hij hat der nea werom west.
\f 

Manipulating Interlinear Text with nltk.interlineartext

The heavy lifting in interlinear text processing with the NLTK is done by the InterlinearTextParser object, which is a factory object that will parse an interlinear Shoebox file and build an InterlinearText object. This can be done with only a few lines of Python code, as illustrated below:

import sys
from shoebox.text import TextParser

filepath = None
try :
  filepath = sys.argv[1]
except :
  sys.stderr.write("%s " % sys.argv[0])

tp = TextParser(filepath)
t = tp.parse()

The TextParser parses the Shoebox file by delegating the work to a number of other parsers. The main two are the ParagraphParser and LineParser. The ParagraphParser simply builds Paragraph objects, which are little more than collections of Line objects. Most of the sophistication of the object model resides in these Line objects, which possess a collection of fields and provide various means for accessing the information contained within them.

In order to parse an interlinearized text, a number of different pieces of information are required. They are listed below:

Information   Status   Default Value   Description
paragraph head field   optional   \id   divides the text into paragraphs
line head field   required   \ref   divides the text into lines
text field marker   required   \t   field in which the original text is broken down into words
morpheme field marker   required   \m   field in which each word is broken down into morphemes

COMMENT ON METADATA

Once an InterlinearText object has been obtained, the information within the interlinear text file can be queried using its various methods. The various possiblities afforded by its object model will be presented and discussed in the case studies provided in the following section.

Some Case Studies

For these demonstrations of nltk.shoebox functionality, we will manipulate some sample Shoebox files (in the folder Samples). Because the sample files can be useful when learning how to use Shoebox, most users prefer to keep an unmodified version of them. Therefore, we recommend making a backup of these files before trying the scripts below.

Case 1: Counting Paragraphs and Lines

Perhaps the simplest analysis tht one could perform on an interlinear text in Shoebox is to determine how many paragraphs it contains and how many lines each contains. The script shoebox-text-tally.py, which is run as follows:

$ python bin/shoebox-text-tally.py samples/Rotokas/SampleText1.py
Par.	Line
1	34
---	---
1	34

The main work is done by the tally() function.

def tally(txt) :
    totalParagraphs = 0
    totalLines = 0
    print "Par.\tLine"
    for p in txt.getParagraphs() :
        totalParagraphs = totalParagraphs + 1
        numberLines = len(p.getLines())
        totalLines = totalLines + numberLines
        print "%i\t%i" % (totalParagraphs, numberLines)
    print "---\t---"
    print "%i\t%i" % (totalParagraphs, totalLines)

A list of Paragraph objects is obtained using the accessor method getParagraphs(). The list is iterated over and from each paragraph a list of Line objects is obtained with the accessor method getLines(). Counts are obtained and counters incremented and the results are printed out at the end.

Case 2: Reformatting Interlinear Text

Interlinear text provides a great deal of analytical information, but sometimes one may wish to display only a subset of that information, providing essentially a stripped-down version of a text.The script shoebox-text-reformat.py prints out only the text itself (with no morphemic analysis) on one line, followed by its free translation in English. It is run as follows:

$ python bin/shoebox-text-reformat.py samples/Rotokas/SampleText1.py
Gaurai raga oisoa toupareve vegoaro.
Gaurai lived alone in the bush.

Vaiterei kaakau vaio rera vaitereiaro taporo oisoa toupareve.
He lived with his two dogs.

Rera raga oisoa toupareve uva viapau oisoa rorupareve.
He was very lonely and unhappy.

...

The main work is done by the reformat() function.

def reformat(txt) :
    for p in txt.getParagraphs() :
        for l in p.getLines() :
            rawText = l.getFieldValueByFieldMarker("t")
            english = l.getFieldValueByFieldMarker("fe")
            cookedText = re.sub(r" +", " ", rawText)
            print "%s\n%s\n" % (cookedText, english)

A list of Paragraph objects is obtained using the accessor method getParagraphs(). The list is iterated over and from each paragraph a list of Line objects is obtained with the accessor method getLines(). From each line, the original Rotokas (\t) and their free translation in English (\fe) are obtained. Since the original Rotokas is word-aligned with its glossing, extraneous spaces must be removed, and this is done using regular expression substitution:

cookedText = re.sub(r" +", " ", rawText)

The Rotokas and its English translation are then printed out in the order in which they are iterated through.

Case 3: Searching Multiple Tiers for a Specific Morpheme

One of the virtues of an interlinear text is that the breakdown of the text into words and morphemes makes it possible to search for particular words or morphemes on the basis of more than their surface form. It is possible to search for a particular morpheme on the basis of, say, its surface form as well as its gloss or part-of-speech.

$ python bin/shoebox-text-search-morphemes.py -m='pa' -g='PROG' samples/Rotokas/SampleText1.txt
\ref Paragraph 1
\t Gaurai raga oisoa    toupareve                           vegoaro.
\m Gaurai raga oisoa    tou -pa       -re         -ve       vegoa  -ro
\g name   only always   be  -PROG     -3.SG.M     -SUB      jungle -PL
\p N.PN   ADV  ADV.TEMP V.B -SUFF.V.3 -SUFF.V.B.4 -SUFF.V.5 N.N    -SUFF.N.???
\f Gaurai em wanpela i save stap long bus.
\fe Gaurai lived alone in the bush.

...

In the example provided, we are searching the text SampleText1.txt for lines containing the morpheme pa when it has the gloss PROG. It will not print a line such as the following, since it does not have the searched-for gloss---i.e., it has DERIV rather than PROG:

\ref Paragraph 25
\t Vokiarovi kareroepa                      Gaurai voroopa      tapiva.
\m vokiarovi kare   -ro         -epa        Gaurai voroo -pa    tapi  -va
\g afternoon return -3.SG.M     -RP         name   hunt  -DERIV place -ABL
\p TIME      V.A    -SUFF.V.A.4 -SUFF.V.A.5 N.PN   V.B   -SUFF  N.N   -ENC.N.6
\f Long apinun Gaurai i bin kam bek long painim pik.
\fe In the afternoon Gaurai arrived back from hunting.

The morphemes of a line can be extracted with the method getMorphemes(). The morphemes can then be iterated over and their surface form and gloss examined. If the surface form is pa and the gloss is PROG, the line is printed and the next one examined.

def search(txt, surfaceForm, gloss) :
    for p in txt.getParagraphs() :
        for l in p.getLines() :
            for m in l.getMorphemes() :
                sf = m.getForm()
                gl = m.getGloss()
                if sf == 'pa' and gl == 'PROG' :
                    print "%s\n" % l.getRawText()
                    break

Note, however, that the extraction of morphemes will only work if the InterlinearTextParser is provided with the field markers for the tiers that contain the surface form and gloss of the morphemes.

def main() :
    surfaceForm, gloss, filepath = handle_options()
    itp = InterlinearTextParser(filepath)
    itp.setMorphemeFieldMarker("m")
    itp.setMorphemeGlossFieldMarker("g")
    txt = itp.parse()
    search(txt, surfaceForm, gloss)

Case 4: Reformatting Morpheme Alignment

The ability to extract morphemes makes it possible to achieve other types of reformatting. ???.

$ python bin/shoebox-text-realign-morphemes.py samples/Rotokas/SampleText1.txt
???

???

???

Case 5: Checking an Interlinear Text Against a Modified Dictionary

One serious shortcoming of Shoebox is that once a text has been interlinearized, it cannot be automatically re-synched against the dictionary originally used to gloss it. It is therefore useful to be able to update an interlinear text automatically once the dictionary has been updated. Here we show how an interlinear text can be compared to a dictionary for updating.

??? ???.py ???:

$ python bin/???.py ???

???

???

???

Case 6: Using Intearlinear Text Metadata

Shoebox stores information about texts in a file called a type definition file. This file contains information about the types of information found in the interlinear texts of a project. This metadata can be very useful when parsing an interlinear text and manipulating its contents.

???

Conclusion

In this tutorial we have seen how interlinear texts formatted by Shoebox can be manipulated using the functionality provided by the Natural Language Toolkit for Python. The formatting of these files was discussed as well as the object model that allows for their simple and easy processing. Finally, a number of case studies illustrating the functionality at work were provided, and these should illustrate all of the basics. For more detailed information, consult the NLTK documentation.


Links

Python   Natural Language Toolkit (NLTK) Homepage
NLTK Tutorials
Shoebox   Python Homepage
Shoebox Homepage
Toolbox Homepage
User Tips for Shoebox


The author may be contacted at stuart-at-zapata-dot-org.