hyj384412940 · PyLaia · Published December 5, 2025

Corpus Gradia_prensa_SIGLO XIX v1.0 b

Text Recognition

Description

This model is specifically designed for the transcription of 19th-century Spanish (Castilian) printed newspapers from Catalonia. It is based on 150 titles sourced from the Arxiu de Revistes Catalanes Antigues (ARCA: https://arca.bnc.cat/arcabib_pro/es/inicio/inicio.do); the Arxiu Municipal de Lleida (https://arxiu.paeria.cat/es/el-archivo-conserva/hemeroteca); the Arxiu i Documentació Municipal de Tarragona (https://www.tarragona.cat/patrimoni/arxiu-municipal/fons/hemeroteca-1/premsa-digitalitzada-1), the Servei de Gestió Documental, Arxius i Publicacions del Ajuntament de Girona (https://www.girona.cat/sgdap/cat/premsa.php) and the Hemeroteca Digital from Biblioteca Nacional de España (https://hemerotecadigital.bne.es/hd/es/advanced). This model was trained on 538,687 words and 59,176 lines across 1,006 training pages plus 111 validation pages, it achieves a 1.09% CER on validation. Developed within the Grup de Gramàtica i Diacronia (GRADIA) [2017SGR1337], Marginalia en el centro de la investigación diacrónica. Verbos en serie y perífrasis en cadena de MINECO (PID2022-138259NB-I00), University of Barcelona. Contact: Yujian Han. yujianhan@ub.edu.

Try this model

Use this modelOpen in Transkribus
Very low error rate1.09% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 1.09% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words538,687
Lines59,176
Training Pages1,006
Model ID446925
Languages
Castilian
Centuries
19th c.