Saulo Rogério · PyLaia · Published January 18, 2025

Early Portuguese Printing (16th-19th)

Text Recognition

Description

This model was trained on a dataset of selected Portuguese grammars and linguistic publications spanning the 16th to the 18th centuries. These documents, along with many others, are publicly accessible through the Portuguese National Digital Library (bndigital.bnportugal.gov.pt). The training set for this version comprises 142,606 words (745 pages) printed in Portuguese since 1536. The dataset reveals texts that include unique letters, diacritics, historical acronyms, typography, and fleurons characteristic of the historical Portuguese writing system adapted to the new press technology, all of which this model has been trained to recognize. Given the linguistic focus, both grammatical and historical, of its training set, this model can also recognize certain Greek letters, Latin text, table patterns and simple initial capitals. However, due to the limited training in these areas, it is not recommended for those uses. This model was developed as part of a master's degree project in the postgraduate linguistics program at the Universidade Federal de Santa Catarina (UFSC). The author (saulo.r@posgrad.ufsc.br) was financially supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES).

Try this model

Early Portuguese Printing (16th-19th)
Use this modelOpen in Transkribus
Very low error rate2.58% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 2.58% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words142,606
Lines23,045
Training Pages745
Model ID267229
Languages
Portuguese