Shoebox texts differ from Shoebox lexicons in possessing two levels of structure: a paragraph structure, in which records are usually marked by the record marker \id, and line structure, in which records are usually marked by \ref. Because there are two levels, we are faced with a conundrum ("a paradoxical, insoluble, or difficult problem"): either we parse texts like lexicons and ignore their first level of structure (the paragraph structure imposed by \id) or we parse texts differently and obtain a two-level structure where files contain paragraphs which in turn contain lines.
The conundrum rears its ugly head when we try to parse one of the sample texts that comes with Shoebox, FriSampl.txt. (Warning: Some characters will look strange when pasted straight into HTML without conversion.)
\_sh v3.0 397 Frisian Text \id Adjectives with ûn \ref 1 \t ûnbrekber. ûnwis. ûntrou. ûngeef. ûnfreonlik. ûnwolkom. ûntagonklik. ûnbikwaam. ûnferlykber. ûnkrekt. ûngenêslik. ûnytber. ûnbegeanber. ûngeduldich. ûnfolmakke. ûnbeskamme. ûnmooglik. ûnwierskynlik. \id First sample sentence from the tutorial \ref 1 \t Berne en opgroeid yn Ynje, \m bern -e en op- groei -e yn Ynje \g bear -PSTP and up- grow -PSTP in Indonesia \p V -Tns Conj Dir- V -Tns Prep N \f Born and raised in Indonesia, \ref 2 \t sil dêr syn grêf wêze. \f \id frilake.txt Tutorial Exercise \ref 1 \t Fryslân yn maitiidpracht, wylst de sinne skynde oer de marren en de wide greiden mei fee. \f \ref 2 \t Noarwegen, doe't de hege sinne dreamde yn 'e fjorden. \f \ref 3 \t Heite en Memme lân. \f \ref 4 \t Mar sines? \f \ref 5 \t Hij hat der nea werom west. \f \id Verb present and past paradigm \ref 1 \t miene. \f \ref 2 \t ik mien. do mienst. hy mient. wy miene. \f \ref 3 \t ik miende. do miendest. hy miende. wy mienden. \f
The hierarchical structure of the text looks something like this:
| Paragraphs | Contents | ||||||||||||
\id Adjectives with ûn |
|
||||||||||||
\id First sample sentence from the tutorial |
|
||||||||||||
\id frilake.txt Tutorial Exercise |
|
||||||||||||
\id Verb present and past paradigm |
[...] |
What's interesting is that if we look at the metadata for this text, it treats the field marker \id as the record marker. The metadata for the markers is excerpted from (FrisanT.typ):
\+DatabaseType Frisian Text \ver 5.0 \desc Frisian Interlinear Text for Tutorial \+mkrset \lngDefault Default \mkrRecord id \+mkr f \nam Free Translation \lng Default \mkrOverThis id \-mkr \+mkr g \nam Gloss \lng Default \mkrOverThis id \-mkr \+mkr id \nam Identifier \lng Default \-mkr \+mkr m \nam Morphemes \lng Default \mkrOverThis id \-mkr \+mkr p \nam Part of Spch \lng Default \mkrOverThis id \-mkr \+mkr ref \nam Reference \lng Default \mkrOverThis id \-mkr \+mkr t \nam Text \lng Frisian \mkrOverThis id \-mkr \-mkrset
If we ignore the paragraph structure, and treat \ref as the record marker, we are essentially declaring the metadata incorrect. So, either we ignore the metadata and parse the text by lines rather than paragraphs or we split the file parser into two subclasses (one for lexicons and another for texts) and define different behavior for each. I favor having two subclasses but I'd like to solicit the opinion of others.
Contact the author at: Stuart DOT Robinson AT mpi DOT nl