When I started my dissertation project a few years ago, I was looking for proper tools to transcribe handwritings like this:
My first naive idea was to google “handprint character recognition” with the thought: “If my tablet can transform my handwritten texts to a computer font (e.g. pen input in OneNote), there must be a powerful software for handwritten documents.” I don’t know what to say: I found Transkribus and left it untouched. What a mistake…
At this time, the reasons were simple: I was not interested in publishing a source edition, so the trouble to transcribe c. 100 pages to train an HTR-model seemed too high and too much of an effort. My personal cost-benefit analysis brought me to the decision to read the texts exemplarily, excerpt some of them, and use what you get. Without any doubt, I got many interesting insights into sermons of the 19th century and I was able to elaborate a relevant issue. But 15-20 sermons out of c. 300 are a sad foundation for – until now – unknown literary remains.
Then, I slowly slid into the Digital Humanities and I visited www.transkribus.eu again. Handy tip: If you work with handwritten papers for a long time and you want to finish your work, never – trust me – never ever visit this page! I became nostalgic, looked back to the times when I made methodic decisions, and thought: “Oh, Transkribus, I did you wrong!” But doctorates also end and I decided to relinquish it.
A few weeks ago, I finished another paragraph of my thesis. The motivation to write along was very low and procrastination started. After I finished watching all videos on YouTube and reading everything on Wikipedia, I felt the urge to do something useful. … … … www.transkribus.eu.
The download and setup were pretty easy and for test purposes, I uploaded some pages of the sermon corpus. My first aha moment concerns the Layout Analysis Tool. Detecting lines and a text’s baselines is a huge simplification of the transcription process. Transkribus highlights the line you are working on, so the document “moves” along with the text editor. By doing so, the software helps you to stay focused on the relevant parts of the document.
Now, I decided to test the thing to the utmost. I uploaded more than 2400 pages and asked for the opportunity to train an HTR model. My next moment of blank astonishment was the support: fast, friendly, collegial. I have a problem, they have the solution. What more could one want? The decision to train the model was founded upon the realization that I already transcribed a big amount of pages and had to transcribe some more for the thesis.
To make it short: I transcribed some pages, trained the model, tested it, and smiled… transcribed more pages, trained a new model, tested it, and smiled some more. That’s it!
Of course you want to see results and %-values of the CER (Character Error Rate). The transcribed corpus with 31,024 words reached a CER on the train set of 8.44% and a CER on a test set of randomly selected pages reached a value of 10.88%. I must confess that the results could be better, if I would have transcribed more conscientious and with a lower rate of careless mistakes.
The results look “okay” but, in fact, the transcriptions are not suitable to publish without further editorial work. However, I realized that an “okay”-result makes corrections easier and accelerates manual transcribing. In some cases, the HTR even recognized words correctly, which I was not able to make out at first view. In other cases, it constantly repeated the same mistakes. “Jesus heals the sick (die Kranken)” became “Jesus heals the Franks (die Franken).”
The example of mistaking the letters “K” and “F” (they are very similar in Bassermann’s hand) meant that in many cases of mistaking a “search and replace” revision raised the quality of the transcriptions. As mentioned above, the goal of the Transkribus test was not to produce a source edition. Primarily, I wanted to get an overview of the sermon corpus and find important keywords. In the example below, you can see a typical manuscript page and a – unfortunately – very bad transcription page.
What matters to me are the words of a specific “sermon language” like “Frömmigkeit” (piety) or “Religion.” “Frömmigkeit” is misspelled by a single m, which resulted from the single m being overlined in the manuscript. The overline doubles the letter and was used to save space or accelerate the writing. For my purpose, it states no problem. With the truncated search of the term “fröm*” the words “frömigkeit, frömmigkeit, frömigkt,” etc. were listed and I got what I wanted: a full text, a searchable document to work with.
Additionally, I was inquisitive about what would happen, if I uploaded the whole corpus to VoyantTools for a fast and simple linguistic analysis of the corpus ( VoyantTools is a great web application for quick analyses and visualizations of digital texts for users without brilliant coding skills). The results were far away from what I expected. They were magnificent! I parted the sermons and divided them into the years of their origins. Then, I displayed the frequency of the usage of the search keyword “fröm|from.” The following visualization is the result.
The visualization shows a special and terrific result. The usage of words related to “Frömmigkeit” raises enormously starting in 1901. In the historical context, this is not surprising: during those years, Bassermann started to complain about an increasing unchurching and, of course, he had to support activities to reinforce piety (“Frömmigkeit”) with his sermons.
I’m sure working with the corpus without the technical capabilities of Transkribus would be impossible. Whoever works with manuscripts should take it into account. If the HTR doesn’t bring the results desired, – at least – the Layout Analysis will bring a huge work simplification. And the ground rule remains: the more data, the more correct transcriptions will improve the HTR model and will bring more smiles. I’d be happy to hear your Transkribus experiences!
Finished my handwritings #experiment with @Transkribus. The results are remarkable!!! And a fast & simple analysis of the #HTR made text (975 pages plain txt, >700k words) with @VoyantTools already brought tons of new insights.
— Stefan Karcher (@streka) July 10, 2018