TraPrInq Project · PyLaia · Published July 5, 2023

Portuguese Handwriting 16th-19th c.

Text Recognition

Description

Generic Model created in the framework of the TraPrInq Project (01.2022 to 07.2023) funded by the FCT (Portuguese Agency for Scientific Research), by the members of the team: Carla Vieira, Jorge Ferreira Paulo, Hervé Baudry, Leonor Dias Garcia, Ana Margarida Dias da Silva, Maria Olinda Alves Pereira, Mário Soares Fatela, Marize Helena de Campos, Natalia Casagrande Salvador, Susana Tavares Pedro, Suzana Maria de Sousa Santos Severs. This HTR-model is based on the trial records of the Portuguese Inquisition produced between 1536 (some documents even before) and 1821. It contains careful transcription from 6226 pages (Validation Set: 505 p; Training Set: 5721 p) extracted from 830 processes, mainly by the Lisbon court, with a total of 1268040 words (VS: 107760 words; TS: 1160280). Digitized files can be found on the website of the Portuguese National Archive (Arquivo Nacional da Torre do Tombo). The Model proved its efficacy with hybrid texts (fill-in forms), documents from non-inquisitorial areas. In broad, the transcription reproduces the spelling of words and abbreviations, uses special characters for baseline abbreviation signs and a single COMBINING MACRON for all superscript abbreviation signs, and modernises word separation. The detailed transcription protocol and character list are available at: https://site-2011948.mozfiles.com/files/2011948/Grelha_Criterios.pdf

Try this model

Portuguese Handwriting 16th-19th c.
Use this modelOpen in Transkribus
Low error rate5.2% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 5.2% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words1,159,586
Lines153,467
Training Pages5,721
Model ID53270
Languages
Portuguese
Centuries
16th c.17th c.18th c.19th c.