lara.piva.95 · PyLaia · Published January 11, 2025

DiploLatina

Text Recognition

Description

Model for printed editions of Latin classics: diplomatic transcription with abbreviation tags. This model have been part of the PhD project "AI-powered text recognition models and classical philology: Livy's 16th-century editions as a case study" (it. Modelli pubblici di riconoscimento testuale basati sull'intelligenza artificiale e filologia classica: il caso delle cinquecentine di Tito Livio), Padua University, partially funded by PON REACT-EU program. Transcription criteria. This model is meant for recognise several editions through the XVI century, so expansion are normalized (e.g. ſ>s; circū->circum-; quēquam>quemquam; solēnis>solemnis; Pyrenȩum>Pyrenaeum) but not corrected (e.g. ꝓro>proro vs. pro). In case of graphic variation (e.g. prælium) the abbreviation is normalized (e.g. p̄lium>proelium). The approximant "v" is converted in "u" (caps excluded). Standardisation of diphthongs and vocalic ligatures (e.g. æ>ae; ij>ii), numbers (e.g. .xxi.>XXI) and clitic coordination (e.g. q́./>que). An unusual "et" ligature has been found in the corpus, which has been conventionally transcribed as "£", since there is almost no chance to find the British pound sing in Italian editions of classical Latin works. In case of ink gaps and line-break, the tag is split (e.g. æta[aeta]_is). Where proper noun has no capital, lowercase is kept (e.g. auētino>auentino v.s. Auentino). The consonant ligatures (e.g. the ones for ct, ſt, tt) are not transcribed as a single character, both as they are consistent (differently from diphthongs) and as this model is meant to be available for generic fonts (e.g. Gentium). Purpose and training data. The combination of diplomatic transcription and normalized expansions has a philological purpose: on one hand, 'recensio' and 'collatio' of exemplars from the same edition; on the other hand, comparison between different editions. Dataset from the 3rd decade of the following Livy's editions, mostly in Roman-type, some in Italic-type: Rusconi, Venice 1501; Minuziano, Milan 1505; Manuzio's and Torresano's heirs, Venice 1518-1533, 1520-1522, 1555, 1566, 1572, 1592; Sessa and Ravani, Venice, 1520; Giunta family, Florence 1522-1532, 1533, 1542. Some pages are from Rosso brothers, Venice 1507, and Pincio, Venice 1511. To prevent overfitting, the dataset has been enriched with further text: introduction on Livy's editions from 1469 to 1592 (many of them in Gothic-Antiqua-type); some pages from other Latin works and editions from other countries. Parameters. TS: 1072; VS: 41 200 Training Cycles; 30 early stopping; 0.0003 learning rate; Train Abbrevs with expansion Dewarping method: dewarp. Last version trained and released: 10/01/2024 WARNINGS: - it might happen that the first and the last text rows of a page are poorly recognized - occasionally, given 2 words, the first is correctly transcribed, but contains as "extention" the text of the second one; similar problems may occur less frequently

Try this model

DiploLatina
Use this modelOpen in Transkribus
Very low error rate0.34% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 0.34% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words870,793
Lines76,160
Training Pages1,072
Model ID262949
Languages
Latin