Roxana Patras · PyLaia · Published April 19, 2023

RTA2 (Romanian Transition Alphabet)

Text Recognition

Description

This is a first attempt on a model trained for texts in the Romanian Transition Alphabet (1830-1862). In order to train an HTR model for these texts, I have chosen 5 samples that show, before and after 1859, when the 2 Romanian provinces become a country with an official language, the progression from a massive use of Cyrillic letters to an eye-friendly employment, which makes reading more fluent. As a general rule, Latin capital letters are preferred for writing titles after 1859. The Latin letters Z/ z, M/ m, D/ d, S/ s, T/ t, N/ n, A/ a, I/ i, E/ e, O/ o, Î/ î, U/ u, Ŭ/ ŭ, Ĭ/ ĭ are present from the oldest sampled text (1853), whereas the Cyrillic Х/х (ha), Ш/ ш (sha), Щ/ щ (shcha), Ц/ ц (tze), Џ/ џ (dze), Ч/ ч (che), Ъ/ ъ (ă), П/ п (pe), Р/ р (er), Ж/ ж (zhe), Ф/ф (ef), К/ к (ca), В/ в (ve), Л/ л (el), Г/ г (ghe), Б/ б (be). Among these Cyrillic letters, the first to receive a Latin equivalent are: Ф/ф (ef) → f;  Г/ г (ghe) → g; Л/ л (el) → l; Ж/ ж (zhe) → j. At the same time, Р/ р (er), П/ п (pe), Ъ/ ъ (ă), Ч/ ч (che), В/ в (ve), Ш/ ш (sha), Щ/ щ (shcha), Ц/ ц (tse) tend to be maintained until 1862, when some of them they are replaced with glyphs such as “ḑ” (dz), “ş” (sh) and “ț” (tz), which were imported from the Livonian alphabet but have entered the printing circuit only after 1865. The general guidelines for transcription have been established as follows: 1.     Creation of the collection “ALFABET DE TRANZITIE” containing 6 items. 2.     Random transcription of initial, middle, and end pages. 3.     Transliteration one-on-one of all Cyrillic letters excepting the situations when K/k stands for the group Ch/ ch (e.g. Бukete → Bukete): Х/х → H/ h; Ш/ ш → Ș/ ș; Щ/ щ → Șt/ șt; Ц/ ц → Ț/ ț, Ч/ ч → C/ c; Ъ/ ъ → Ă/ ă; П/ п → P/ p; C/c → S/s;  Р/ р → R/ r; Ж/ ж → J/j; Ф/ф → F/ f; К/ к → C/c; В/ в → V/ v; Л/ л → L/l; Г/ г → G/ g; Б/ б → B/ b; Џ/ џ → G/ g. 4.     Customization of the following glyphs: apostrophe, right double quotation mark, double low-9 quotation mark, Ŭ/ ŭ, Ĭ/ ĭ, á.

Try this model

Use this modelOpen in Transkribus
Very low error rate2.8% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 2.8% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a smaller, specialised model. It may achieve a very low CER on material similar to its training data, but could be less robust on unfamiliar handwriting or layouts.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words10,250
Lines1,211
Training Pages41
Model ID51515
Centuries
19th c.