Achim Rabus, Martin Meindl & Milanka Matić-Chalkitis (MultiHTR project) · PyLaia · Published November 20, 2022

DEK_German_combined

Text Recognition

Description

This is the first version of a combined model for the Deutsche Einheitskurzschrift (DEK), based on natural and synthetic training data. The natural GT data consists of several diaries of a private person and was kindly provided by the German Diary Archive (DTA) (https://tagebucharchiv.de/). Special thanks at this point go to the director of the DTA, Marlene Kayen. The synthetic training data (electronically available longhand texts converted into German standard shorthand) are composed of Goethe's “Faust” (https://jens-wawrczeck.de/stenogenerator/goethe/Faust%201%20(Goethe)%20-%20A4%20oL.pdf und https://www.projekt-gutenberg.org/goethe/faust1/) and Grimm's fairy tales. The model was trained by Achim Rabus. Martin Meindl and Milanka Matić-Chalkitis also worked on the creation of this model as part of the MultiHTR project at the Department of Slavic Languages and Literatures of the University of Freiburg (Germany). The model is suitable for transcribing natural manuscripts written in DEK. It can also be useful as a base model for other German shorthand systems.

Try this model

Use this modelOpen in Transkribus
Low error rate9.5% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 9.5% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words144,709
Lines16,538
Training Pages698
Model ID47882