“I think, yes, it’s also available with Latin texts,” Ines Rehbein and Josef Ruppenhofer answered during a lecture in one of our InFoDiTex sessions on my request. Immediately, I was electrified – to me this was a magic moment, because in most of the digital humanities conferences and summer schools, I learned to know powerful and valuable tools for English or German text corpora. Yet, my corpus covers a bundle of 252 Latin letters written by St. Augustine around the beginning of the 5th century CE. At least in my experience, useful tools for Latin texts are quite rare (I know there are many more possibilities with some skills in programming, but in this context, I’m thinking of hands-on tools for less technical researchers to start with). Rehbein and Ruppenhofer presented us some basic information about automatic text annotation when they mentioned the TreeTagger. After the session, I watched a (German) YouTube video to install it, downloaded the Latin parameter file by Gabriele Brandolini, and finally started to convert my plain text files in order to get a beautiful tokenized, lemmatized, and POS-tagged list to enable analyses like the following ones.
But the way to win valid data was not that easy. I was faced with many problems which I want to present here in detail.
A first limit becomes evident by looking on the Italian tagset LAMAP (pdf). You can analyze the part of speech in categories offered here. However, very interesting linguistic research questions are imaginable, which are unfortunately not possible in this setting. For example, just take the 5 dimensions of linguistic variations by Douglas Biber (p. 229):
- Informational versus Involved Production
- Narrative versus Nonnarrative Concerns
- Elaborated versus Situation dependent Reference
- Overt Expression of Persuasion
- Abstract versus Nonabstract Style
None of these dimensions can be treated sufficiently by working with the given setting. For this, we must know, for example, the tense of verbs (present, perfect, …), the number of person (1st, 2nd, or 3rd point of view), the kind of adverb (temporal, local, …), or the voices (active, passive).
Another problem arises from the fact that the TreeTagger was trained with certain sources. So, it is possible to use it for Latin texts, but the validity of the results differs tremendously, depending on similarity or difference of the texts that are related to the training data. Brandolini trained her files with the following data:
- PROIEL (PROIEL Project- University of Oslo): Peregrinatio Aetheriae and Jerome’s Vulgate (NT) (lines 1-52308ca)
- PERSEUS (Latin Dependency Treebank, The Perseus Project, Tufts University): 8 Classical texts (lines 52310-108782 ca)
- IT (Index Thomisticus, Catholic University of the Sacred Heart, Milan): Texts of St. Thomas of Aquino: (lines 108784-195911 ca.)
Minding Jerome’s Vulgate, you might think there are very few problems with the correspondence of Augustine, who is a contemporary of Jerome. However, there are still enough hurdles you should be aware of:
Proper nouns were usually not recognized. A Named Entity Recognition would be desirable.
Augustine loves to use contractions (quaesissent instead of quaesivissent). However, the TreeTagger has in most cases problems to indicate this properly.
Neologisms are, of course, staying unnoticed, because they were not part of the trained dataset.
This might be more of a problem in my text version. Where my text files have “u,” the tagger expects “v;” and where my files have “I,” the tagger expects “j.” This leads to many misspellings and even markers.
This was very annoying. The tagset includes the category CLI for enclitics, but there was not one single case where it was used properly. I had to correct every –que, every –ne, and every –ve manually just with the little aid of using some regular expressions.
This is another sample of surprise. The TreeTagger has huge problems with comparisons, although it is regarded in the tagset. Comparatives or superlatives are very often not recognized by the tagger. By the way, it is a pity that it’s not possible to detect diminutives, which Augustine uses also quite often.
Hyphens turned out to be a tripping hazard for the TreeTagger. I had to remove all of them manually.
If you think you “just” have to search for <unknown> in the third column to find all allocation errors, that’s unfortunately wrong. Sometimes there is a hyphen.
This is a third option. If the TreeTagger is unsure, he will present more than one option separated by a vertical bar. You have to search for them manually to decide which one is correct.
The TreeTagger is a helpful tool, but you must be careful when using parameter files trained by other persons with other datasets on your own corpora. In this case, it is necessary to correct the data manually – at least partly with regular expressions. If you don’t do this, you will have many errors which invalidate your results.