aleksej.tikhonov · PyLaia · Published May 5, 2023

Transcription Aid Model 1 for Handwritten Yiddish (Hebrew to Latin)

Text Recognition

Description

Transcription Aid Model 1 for Handwritten Yiddish (Hebrew to Latin) This first version of a transcription assistance model for handwritten Yiddish texts was developed by Aleksej Tikhonov as part of the MultiHTR project (Freiburg/Germany, project leader: Achim Rabus). The bulk of the GT data was supplied by the DYBBUK project, funded by the European Union (ERC StG, No. 958150). The texts were sourced from dramas written by Moyshe Hurwitz (1844-1910) and Joseph Lateiner (1853-1935). We extend our gratitude to Ruthie Abeliovich and Sinai Rusinek (the DYBBUK project, Tel Aviv University) . Another part of the GT data comes from Astrid Lembke (University of Mannheim) and consists of Yiddish texts from the 16th century; the manuscript MS Cambridge, Trinity College, F.12.45, which includes two poems by Elia Levita (ca. 1469-1549) as well as three narrative texts: the Mayse mi-Danzek (story from Danzig), the Mayse mi-Menz (story from Mainz) and the Mayse fun Würms (story from Worms). We would like to thank Astrid Lembke for the professional exchange and advice. For the semi-automatic transfer of the GT from the Hebrew to the Latin alphabet, the tool Protea t3xt conv3rt3r by Gal Abramovitz was used. This model is not a transliteration model but a transcription assistance model aimed at a broader audience and intended to help those who cannot read the Hebrew alphabet to read, understand, and, if necessary, learn Yiddish in the Hebrew script using Latin transcriptions. Since the phonological inventories of the Hebrew and Latin alphabets are not identical, there may be different phonetic realizations of the same grapheme depending on its consonantal environment. You can use the “Transcription Aid Model 2 for Handwritten Yiddish (Hebrew to Latin)” for more precise transcription. The first model can be used to infer the content of the text or to enable limited keyword searches, even for people who have not mastered or are still learning the Hebrew alphabet in its Yiddish application.

Try this model

Use this modelOpen in Transkribus
Very low error rate5% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 5% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words135,982
Lines16,849
Training Pages621
Model ID51946
Languages
Yiddish