wilkenjonathan · PyLaia · Published October 1, 2022

Greek_Medieval-and-Modern-Minuscule

Text Recognition

Description

This model is intended for medieval and modern Greek Minuscule manuscripts. The initial data set consisted of ten manuscripts of dates ranging from the 10th to 19th centuries. Several of these manuscripts were the product of more than one scribe, however. On which account, this model is also trained on at least 15 unique hands. Their texts were all from among the Old and New Testament or the Testaments of the Twelve Patriarchs. Spaces have been included between words in the training data even when not present in the manuscript (as was the case in Cambridge University Library manuscript Ff 1.24). Thus, the model attempts to separate words even for scripta continua manuscripts. However, this model has been trained to extract the text only and no other features. Diacriticals, accents, punctuation etc. have been excluded. Capitalization has likewise been ignored. The model has not been trained to resolve abbreviations, alphabetic representation of numerals or nomina sacra. It has, however, been trained to resolve ligatures. The list of manuscripts on which this model was trained is as follows: Bodleian Library - Barocci 133 Bodleian Library - Holmes 94 Bodleian Library - Holmes 155 Bodleian Library - Smith 117 British Library - Harley 7522A Cambridge University Library - Ff 1.24 Cambridge University Library - Oo.VI.91,8 Queens College, Oxford - 214 Trinity College, Cambridge - O.4.24 Trinity College, Cambridge - B.10.3 *Note: transcriptions of the entire codices were not always used.

Try this model

Use this modelOpen in Transkribus
Very low error rate2.4% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 2.4% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words269,905
Lines32,481
Training Pages1,051
Model ID45032
Languages
Greek Modern (1453-)Greek Ancient (to 1453)