Sinai Rusinek · PyLaia · Published June 13, 2025

DiJeSt 3.0

Text Recognition

Description

A model for printed (or typed) text in Hebrew Script (mainly Hebrew and Yiddish, both modern and Weiberteitch). The data includes the following contributions: - DiJeSt 2.0. 1,757 pages. The basis for the previous Transkribus model by that name, (https://app.transkribus.org/models/text/46003) collected with the support of Rothschild Foundation Hanadiv Europe in the framework of the project DiJeSt: Digitizing Jewish Studies. For more details see https://dijest.net/gtmodel/ - Hasidic Stories. 446 pages. Funded by the project “Historical Digital Analysis of Hasidic Stories Until 1914” ISF research grant no. 1478/2, headed by Gadi Sagiv, the Open University of Israel. - Zylbercweig Lexicon. 285 pages. Funded by the project “Historical Digital Analysis of Zalmen Zylbercweig’s Lexicon of Yiddish Theatre”. ISF grant number 284/24, headed by Ruthie Abeliovich, Tel Aviv University. - 20the century Hebrew Newspapers. 32 pages. Funded by the project ״The Double Movement? Towards a Socioeconomic Historiography of the Right in Israel (1948-1984)”, ISF grant no. 198/23, headed by Amir Goldstein, Tel Hai Academic College. - Community regulations, 1711-1929, High German Jewish community in Amsterdam. 250 pages. Ronny Reshef and Mirjam Gutschow. CF https://zenodo.org/records/7692989, https://zenodo.org/records/11179901

Try this model

Use this modelOpen in Transkribus
Very low error rate1.79% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 1.79% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words1,498,332
Lines173,190
Training Pages2,853
Model ID357765
Languages
HebrewJudeo-ArabicLadinoYiddish
Centuries
15th c.16th c.17th c.18th c.19th c.20th c.21st c.