This tutorial introduces the Shoebox capabilities of the Natural Language Toolkit (NLTK) for Python. Here we concentrate on interlinearized texts.
Interlinear texts are texts that have been broken down into lines that have multiple levels of information (tiers). (For more on the nature of intear text, see Towards a General Model for Interlinear Text by Cathy Bow, Baden Hughes, and Steven Bird.) Typically, there are at least four tiers, listed below:
| Label | Description | |
| Text | the actual text being being analyzed | |
| Morpheme breakdown | provides a breakdown of the words into morphemes | |
| Morpheme gloss | provides a gloss (short description) of each individual morpheme | |
| Part-of-speech | provides labels for the part-of-speech (grammatical function) of each word/morpheme |
This can be illustrated with the following excerpt from a story in an elementary school reader in Rotokas (Papuan, Bougainville) (SampleText1.txt):
\ref 15 \t Igei uuko regapaiei. \m igei uuko rega -pa -i -ei \g 1.PL.EXCL water thirst for -PROG -1.PL.EXCL -PRES \p PRO N.N V.I -SUFF.V.3 -SUFF.V.4 -SUFF.VI.5 \f Mipela i nek drai. \fe We are thirsty. |
This line of text is in standard format and consists of the following fields:
| Field Marker | Description |
| \ref | head field which marks the beginning of a new line and labels it |
| \t | word-by-word breakdown of text |
| \m | morpheme-by-morpheme breakdown of each word |
| \g | short description of morpheme's meaning (gloss) |
| \p | part-of-speech label for morpheme |
| \f | free translation in Tok Pisin |
| \fe | free translation in English |
First, there are three columns, defined by the words in the first tier (marked by \t). This is shown more clearly below:
| 1 | 2 | 3 | |||
| Igei | uuko | regapaiei. | |||
| igei | uuko | rega | -pa | -i | -ei |
| 1.PL.EXCL | water | thirst for | -PROG | -1.PL.EXCL | -PRES |
| PRO | N.N | V.I | -SUFF.V.3 | -SUFF.V.4 | -SUFF.VI.5 |
Second, within these columns, there are in some cases multiple columns, defined by the morphemes of the word. This is the case for the third word in our example:
| rega | -pa | -i | -ei |
| thirst for | -PROG | -1.PL.EXCL | -PRES |
| V.I | -SUFF.V.3 | -SUFF.V.4 | -SUFF.VI.5 |
Although the formatting of interlinearized text is essentially the same format used for dictionaries---that is, standard format---it poses some unique challenges.
One challenge is the fact that a line of interlinearized text consists of some fields which are aligned with one another and others that are not, which we will refer to as aligned and non-aligned fields, respectively. Below we see the breakdown of the fields from the previous Rotokas example in terms of this distinction:
| Aligned | \ref, \f, \fe |
| Non-Aligned | \t, \m, \g, \p |
We can look at the text excerpt again with this distinction in mind. The aligned field are provided below in ??? and the non-aligned fields in ???.
| \ref | 15 | ||||||
| \t | Igei | uuko | regapaiei. | ||||
| \m | igei | uuko | rega | -pa | -i | -ei | |
| \g | 1.PL.EXCL | water | thirst for | -PROG | -1.PL.EXCL | -PRES | |
| \p | PRO | N.N | V.I | -SUFF.V.3 | -SUFF.V.4 | -SUFF.VI.5 | |
| \f | Mipela i nek drai. | ||||||
| \fe | We are thirsty. | ||||||
There is an additional distinction to be made within aligned fields, and that is between word-alignment and morpheme-alignment. The \t field defines the word alignment, whereas the \m field defines morpheme alignment.
Another challenged posed by interlinearized Shoebox texts is their two-level structure. In the Rotokas example provided above, the interlinear text line begins with the head field \ref, which identifies the beginning of a new line of interlinearized text. Some interlinearizd texts have an additiona level of structure that sits between the level of the text and the line. We will refer to this as paragraph structure, although it is normally identified by the head field \id, as shown in the following excerpt from Frisian sample (FriSampl.txt):
\id First sample sentence from the tutorial \ref 1 \t Berne en opgroeid yn Ynje, \m bern -e en op- groei -e yn Ynje \g bear -PSTP and up- grow -PSTP in Indonesia \p V -Tns Conj Dir- V -Tns Prep N \f Born and raised in Indonesia, \ref 2 \t sil dêr syn grêf wêze. \f \id frilake.txt Tutorial Exercise \ref 1 \t Fryslân yn maitiidpracht, wylst de sinne skynde oer de marren en de wide greiden mei fee. \f \ref 2 \t Noarwegen, doe't de hege sinne dreamde yn 'e fjorden. \f \ref 3 \t Heite en Memme lân. \f \ref 4 \t Mar sines? \f \ref 5 \t Hij hat der nea werom west. \f |
The heavy lifting in interlinear text processing with the NLTK is done by the InterlinearTextParser object, which is a factory object that will parse an interlinear Shoebox file and build an InterlinearText object. This can be done with only a few lines of Python code, as illustrated below:
import sys
from shoebox.text import TextParser
filepath = None
try :
filepath = sys.argv[1]
except :
sys.stderr.write("%s |
The TextParser parses the Shoebox file by delegating the work to a number of other parsers. The main two are the ParagraphParser and LineParser. The ParagraphParser simply builds Paragraph objects, which are little more than collections of Line objects. Most of the sophistication of the object model resides in these Line objects, which possess a collection of fields and provide various means for accessing the information contained within them.
In order to parse an interlinearized text, a number of different pieces of information are required. They are listed below:
| Information | Status | Default Value | Description | |||
| paragraph head field | optional | \id | divides the text into paragraphs | |||
| line head field | required | \ref | divides the text into lines | |||
| text field marker | required | \t | field in which the original text is broken down into words | |||
| morpheme field marker | required | \m | field in which each word is broken down into morphemes |
COMMENT ON METADATA
Once an InterlinearText object has been obtained, the information within the interlinear text file can be queried using its various methods. The various possiblities afforded by its object model will be presented and discussed in the case studies provided in the following section.
For these demonstrations of nltk.shoebox functionality, we will manipulate some sample Shoebox files (in the folder Samples). Because the sample files can be useful when learning how to use Shoebox, most users prefer to keep an unmodified version of them. Therefore, we recommend making a backup of these files before trying the scripts below.
Perhaps the simplest analysis tht one could perform on an interlinear text in Shoebox is to determine how many paragraphs it contains and how many lines each contains. The script shoebox-text-tally.py, which is run as follows:
$ python bin/shoebox-text-tally.py samples/Rotokas/SampleText1.py Par. Line 1 34 --- --- 1 34 |
The main work is done by the tally() function.
def tally(txt) :
totalParagraphs = 0
totalLines = 0
print "Par.\tLine"
for p in txt.getParagraphs() :
totalParagraphs = totalParagraphs + 1
numberLines = len(p.getLines())
totalLines = totalLines + numberLines
print "%i\t%i" % (totalParagraphs, numberLines)
print "---\t---"
print "%i\t%i" % (totalParagraphs, totalLines)
|
A list of Paragraph objects is obtained using the accessor method getParagraphs(). The list is iterated over and from each paragraph a list of Line objects is obtained with the accessor method getLines(). Counts are obtained and counters incremented and the results are printed out at the end.
Interlinear text provides a great deal of analytical information, but sometimes one may wish to display only a subset of that information, providing essentially a stripped-down version of a text.The script shoebox-text-reformat.py prints out only the text itself (with no morphemic analysis) on one line, followed by its free translation in English. It is run as follows:
$ python bin/shoebox-text-reformat.py samples/Rotokas/SampleText1.py Gaurai raga oisoa toupareve vegoaro. Gaurai lived alone in the bush. Vaiterei kaakau vaio rera vaitereiaro taporo oisoa toupareve. He lived with his two dogs. Rera raga oisoa toupareve uva viapau oisoa rorupareve. He was very lonely and unhappy. ... |
The main work is done by the reformat() function.
def reformat(txt) :
for p in txt.getParagraphs() :
for l in p.getLines() :
rawText = l.getFieldValueByFieldMarker("t")
english = l.getFieldValueByFieldMarker("fe")
cookedText = re.sub(r" +", " ", rawText)
print "%s\n%s\n" % (cookedText, english)
|
A list of Paragraph objects is obtained using the accessor method getParagraphs(). The list is iterated over and from each paragraph a list of Line objects is obtained with the accessor method getLines(). From each line, the original Rotokas (\t) and their free translation in English (\fe) are obtained. Since the original Rotokas is word-aligned with its glossing, extraneous spaces must be removed, and this is done using regular expression substitution:
cookedText = re.sub(r" +", " ", rawText) |
The Rotokas and its English translation are then printed out in the order in which they are iterated through.
One of the virtues of an interlinear text is that the breakdown of the text into words and morphemes makes it possible to search for particular words or morphemes on the basis of more than their surface form. It is possible to search for a particular morpheme on the basis of, say, its surface form as well as its gloss or part-of-speech.
$ python bin/shoebox-text-search-morphemes.py -m='pa' -g='PROG' samples/Rotokas/SampleText1.txt \ref Paragraph 1 \t Gaurai raga oisoa toupareve vegoaro. \m Gaurai raga oisoa tou -pa -re -ve vegoa -ro \g name only always be -PROG -3.SG.M -SUB jungle -PL \p N.PN ADV ADV.TEMP V.B -SUFF.V.3 -SUFF.V.B.4 -SUFF.V.5 N.N -SUFF.N.??? \f Gaurai em wanpela i save stap long bus. \fe Gaurai lived alone in the bush. ... |
In the example provided, we are searching the text SampleText1.txt for lines containing the morpheme pa when it has the gloss PROG. It will not print a line such as the following, since it does not have the searched-for gloss---i.e., it has DERIV rather than PROG:
\ref Paragraph 25 \t Vokiarovi kareroepa Gaurai voroopa tapiva. \m vokiarovi kare -ro -epa Gaurai voroo -pa tapi -va \g afternoon return -3.SG.M -RP name hunt -DERIV place -ABL \p TIME V.A -SUFF.V.A.4 -SUFF.V.A.5 N.PN V.B -SUFF N.N -ENC.N.6 \f Long apinun Gaurai i bin kam bek long painim pik. \fe In the afternoon Gaurai arrived back from hunting. |
The morphemes of a line can be extracted with the method getMorphemes(). The morphemes can then be iterated over and their surface form and gloss examined. If the surface form is pa and the gloss is PROG, the line is printed and the next one examined.
def search(txt, surfaceForm, gloss) :
for p in txt.getParagraphs() :
for l in p.getLines() :
for m in l.getMorphemes() :
sf = m.getForm()
gl = m.getGloss()
if sf == 'pa' and gl == 'PROG' :
print "%s\n" % l.getRawText()
break
|
Note, however, that the extraction of morphemes will only work if the InterlinearTextParser is provided with the field markers for the tiers that contain the surface form and gloss of the morphemes.
def main() :
surfaceForm, gloss, filepath = handle_options()
itp = InterlinearTextParser(filepath)
itp.setMorphemeFieldMarker("m")
itp.setMorphemeGlossFieldMarker("g")
txt = itp.parse()
search(txt, surfaceForm, gloss)
|
The ability to extract morphemes makes it possible to achieve other types of reformatting. ???.
$ python bin/shoebox-text-realign-morphemes.py samples/Rotokas/SampleText1.txt ??? |
???
??? |
One serious shortcoming of Shoebox is that once a text has been interlinearized, it cannot be automatically re-synched against the dictionary originally used to gloss it. It is therefore useful to be able to update an interlinear text automatically once the dictionary has been updated. Here we show how an interlinear text can be compared to a dictionary for updating.
??? ???.py ???:
$ python bin/???.py ??? |
???
??? |
???
Shoebox stores information about texts in a file called a type definition file. This file contains information about the types of information found in the interlinear texts of a project. This metadata can be very useful when parsing an interlinear text and manipulating its contents.
???
In this tutorial we have seen how interlinear texts formatted by Shoebox can be manipulated using the functionality provided by the Natural Language Toolkit for Python. The formatting of these files was discussed as well as the object model that allows for their simple and easy processing. Finally, a number of case studies illustrating the functionality at work were provided, and these should illustrate all of the basics. For more detailed information, consult the NLTK documentation.
| Python | Natural Language Toolkit (NLTK) Homepage NLTK Tutorials |
|
| Shoebox | Python Homepage Shoebox Homepage Toolbox Homepage User Tips for Shoebox |
The author may be contacted at stuart-at-zapata-dot-org.