The Conundrum of Paragraph Structure in Shoebox Texts

Shoebox texts differ from Shoebox lexicons in possessing two levels of structure: a paragraph structure, in which records are usually marked by the record marker \id, and line structure, in which records are usually marked by \ref. Because there are two levels, we are faced with a conundrum ("a paradoxical, insoluble, or difficult problem"): either we parse texts like lexicons and ignore their first level of structure (the paragraph structure imposed by \id) or we parse texts differently and obtain a two-level structure where files contain paragraphs which in turn contain lines.

The conundrum rears its ugly head when we try to parse one of the sample texts that comes with Shoebox, FriSampl.txt. (Warning: Some characters will look strange when pasted straight into HTML without conversion.)

\_sh v3.0  397  Frisian Text

\id Adjectives with ûn
\ref 1
\t ûnbrekber.
ûnwis.
ûntrou.
ûngeef.
ûnfreonlik.
ûnwolkom.
ûntagonklik.
ûnbikwaam.
ûnferlykber.
ûnkrekt.
ûngenêslik.
ûnytber.
ûnbegeanber.
ûngeduldich.
ûnfolmakke.
ûnbeskamme.
ûnmooglik.
ûnwierskynlik.

\id First sample sentence from the tutorial
\ref 1
\t Berne      en   opgroeid         yn   Ynje,
\m bern -e    en   op-  groei -e    yn   Ynje
\g bear -PSTP and  up-  grow  -PSTP in   Indonesia
\p V    -Tns  Conj Dir- V     -Tns  Prep N

\f Born and raised in Indonesia, 

\ref 2
\t sil  dêr   syn  grêf  wêze.
\f 
\id frilake.txt Tutorial Exercise
\ref 1
\t Fryslân yn maitiidpracht, wylst de sinne skynde oer
de marren en de wide greiden mei fee.
\f 

\ref 2
\t Noarwegen, doe't de hege sinne dreamde yn 'e fjorden.
\f 

\ref 3
\t Heite en Memme lân.
\f 

\ref 4
\t Mar sines?
\f 

\ref 5
\t Hij hat der nea werom west.
\f 

\id Verb present and past paradigm
\ref 1
\t miene.
\f 

\ref 2
\t ik mien.
do mienst.
hy mient.
wy miene.
\f 

\ref 3
\t ik miende.
do miendest.
hy miende.
wy mienden.
\f 

The hierarchical structure of the text looks something like this:

Paragraphs Contents
\id Adjectives with ûn
Lines Contents
\ref 1
\t ûnbrekber.
ûnwis.
ûntrou.
ûngeef.
ûnfreonlik.
ûnwolkom.
ûntagonklik.
ûnbikwaam.
ûnferlykber.
ûnkrekt.
ûngenêslik.
ûnytber.
ûnbegeanber.
ûngeduldich.
ûnfolmakke.
ûnbeskamme.
ûnmooglik.
ûnwierskynlik.
\id First sample sentence from the tutorial
Lines Content
\ref 1
\t Berne      en   opgroeid         yn   Ynje,
\m bern -e    en   op-  groei -e    yn   Ynje
\g bear -PSTP and  up-  grow  -PSTP in   Indonesia
\p V    -Tns  Conj Dir- V     -Tns  Prep N

\f Born and raised in Indonesia,
\ref 2
\t sil  dêr   syn  grêf  wêze.
\f
\id frilake.txt Tutorial Exercise
Lines Content
\ref 1
\t Fryslân yn maitiidpracht, wylst de sinne skynde oer
de marren en de wide greiden mei fee.
\f 
\ref 2
\t Noarwegen, doe't de hege sinne dreamde yn 'e fjorden.
\f 
\ref 3
\t Heite en Memme lân.
\f 
\ref 4
\t Mar sines?
\f
\ref 5
\t Hij hat der nea werom west.
\f 
\id Verb present and past paradigm
[...]

What's interesting is that if we look at the metadata for this text, it treats the field marker \id as the record marker. The metadata for the markers is excerpted from (FrisanT.typ):

\+DatabaseType Frisian Text
\ver 5.0
\desc Frisian Interlinear Text for Tutorial
\+mkrset 
\lngDefault Default
\mkrRecord id

\+mkr f
\nam Free Translation
\lng Default
\mkrOverThis id
\-mkr

\+mkr g
\nam Gloss
\lng Default
\mkrOverThis id
\-mkr

\+mkr id
\nam Identifier
\lng Default
\-mkr

\+mkr m
\nam Morphemes
\lng Default
\mkrOverThis id
\-mkr

\+mkr p
\nam Part of Spch
\lng Default
\mkrOverThis id
\-mkr

\+mkr ref
\nam Reference
\lng Default
\mkrOverThis id
\-mkr

\+mkr t
\nam Text
\lng Frisian
\mkrOverThis id
\-mkr

\-mkrset

If we ignore the paragraph structure, and treat \ref as the record marker, we are essentially declaring the metadata incorrect. So, either we ignore the metadata and parse the text by lines rather than paragraphs or we split the file parser into two subclasses (one for lexicons and another for texts) and define different behavior for each. I favor having two subclasses but I'd like to solicit the opinion of others.


Contact the author at: Stuart DOT Robinson AT mpi DOT nl