NOSCEMUS project (Stefan Zathammer) · PyLaia · Published October 29, 2021

Noscemus GM 5

Text Recognition

Description

The "Noscemus General Model" is tailored towards recognizing Latin prints from the early modern period. Although the model is designed to recognize Latin prints set in Antiqua-based typefaces, it is also capable of recognizing passages in Greek and passages set in (German) Fraktur. In creating the Ground Truth the following transcription guidlines were followed: - ligatures (e.g. Æ or æ, Œ or œ) and standard abbreviations (e.g. -que, -us, -tur, …mm…, …nn…) have been expanded - long s (ſ) was transcribed as a normal s - small caps were transcribed as majuscules - special characters and diacritics (e. g. &, ë, ï or ę) were kept The model was released by Stefan Zathammer and it is based on training data coming from the Digital Sourcebook of the NOSCEMUS project (https://transkribus.eu/r/noscemus/#/). If you use the Noscemus model as a base model for your own model, or if your edition is based on a transcription made with the help of the Noscemus model, you are kindly requested to mention the Noscemus model. The NOSCEMUS project (https://www.uibk.ac.at/projects/noscemus/) has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 741374).

Try this model

Use this modelOpen in Transkribus
Very low error rate0.6% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 0.6% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words607,837
Lines92,740
Training Pages2,975
Model ID37855
Languages
GermanGreek Ancient (to 1453)Latin