Milanka Matić-Chalkitis (MultiHTR project) · PyLaia · Published November 12, 2023

OttomanTurkish_generic

Text Recognition

Description

The model is based on handwritten and printed data in Arabic-Persian script and Ottoman-Turkish language. The model was trained by Milanka Matić-Chalkitis as part of the MultiHTR project (project leader: Prof. Dr. Achim Rabus) at the Department of Slavic Languages and Literatures of the University of Freiburg (Germany). The handwritten training data largely comprises the poetry collection 'Mecmua' (https://mecmua.acdh.oeaw.ac.at/toc.html), the poetry collection of the Ottoman poet Keşfī and a smaller collection of travelogs and correspondence of the Ottoman military apparatus from the QHoD project (Digital Edition of Sources on Habsburg-Ottoman Diplomacy 1500-1918; https://qhod.net/). We would like to thank Prof. Dr. Hülya Çelik (University of Bochum), Prof. Dr. Yavuz Köse (University of Vienna) and Dr. Stephan Kurz (Austrian Academy of Sciences) for kindly providing the data and for their close cooperation. The printed data includes parts of various newspapers and journals from the late Ottoman period, which were provided by Suphan Kirmizialtin (Ditigal Ottoman Corpora https://www.digitalottomancorpora.org/). Many thanks for the great support and helpfulness to Suphan. The ground truth was reused according to the 'data recycling' principle, so that the training data has a high diversity in terms of physical quality, layout, font, age and transcription rules. The model is to be understood as an auxiliary transcription model for users with little or no knowledge of the Ottoman-Turkish language and/or Arabic-Persian script.

Try this model

Use this modelOpen in Transkribus
Moderate error rate11.6% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 11.6% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words201,096
Lines25,480
Training Pages789
Model ID56496
Languages
Turkish Ottoman (1500-1928)