Natalia C. Salvador · PyLaia · Published November 24, 2023

XXth century Typewritten Portuguese

Text Recognition

Description

This model is created from Portuguese typewritten transcriptions from the mid-XXth century, which were based on a confraternity Statute from the XVIIIth century. A great number of documents from Minas Gerais in the XVIIIth century have been transcribed by researchers from the SPHAN (Serviço do Patrimônio Histórico e Artístico Nacional 1936-1970). These transcriptions were then typewritten and are now available digitally in IPHAN´s archives. This model allows a fast and reliable recognition of these types of documents. For this model we followed as much as possible the exact characters and spacing in the original (even when there was lapsus calamus), in order to teach the model to read exactly what is there. The diacritics have been maintained as in the original, except when faded or were too far away from the letter, in those cases we ignored them. Comas, stop points, and others, have remained exactly where they are shown. The markings of a line change, although sometimes appear at the bottom of the word, have been standardized after the last word of the line. Letters that are absent or too faded, have been ignored, leaving a space where they should be.

Try this model

XXth century Typewritten Portuguese
Use this modelOpen in Transkribus
Very low error rate2.6% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 2.6% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a smaller, specialised model. It may achieve a very low CER on material similar to its training data, but could be less robust on unfamiliar handwriting or layouts.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words7,468
Lines697
Training Pages25
Model ID56926
Languages
Portuguese
Centuries
20th c.