Historisk Datalaboratorium, Aalborg Universitet · PyLaia · Published March 12, 2025

Danish Newspapers 1750-1850

Text Recognition

Description

This is a model created to read Danish newspapers in their existing digitised form, as found in Mediestream or Loar. It was trained by Johan Heinsen, Camilla Bøgeskov and the team members of the project Klart som Blæk. For more information see https://hislab.quarto.pub/aalborgonline/ The model performs best on running text. It reads fraktur print better than latin characters, although it can often still decipher the latter, since the newspapers used for training data occasionally include latin characters. The model far outperforms OCR when dealing with deteriorated materials, small letterforms, or material that has been scanned from microfilm, as is the case with the Danish newspaper collection held by the Danish Royal Library. It has been trained on materials from various advertisement papers, mainly from Copenhagen and Aalborg in the decades around 1800. When used on Danish material, it should be used with its language model. The model performs well on most of the newspapers from the period, though a special model suited for the colonial papers is needed, because these are often multi-lingual and also use latin characters much more prevalently. As of this writing (October 2023), the Transkribus Print models performs better on the papers from St. Croix and St. Thomas. The model has been updated in May 2025.

Try this model

Danish Newspapers 1750-1850
Use this modelOpen in Transkribus
Very low error rate0.56% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 0.56% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words420,266
Lines60,354
Training Pages642
Model ID306013
Languages
Danish