Description
This is a model created to read Danish newspapers in their existing digitised form, as found in Mediestream or Loar. It was trained by Johan Heinsen, Camilla Bøgeskov and the team members of the project Klart som Blæk. For more information see https://hislab.quarto.pub/aalborgonline/
The model performs best on running text. It reads fraktur print better than latin characters, although it can often still decipher the latter, since the newspapers used for training data occasionally include latin characters.
The model far outperforms OCR when dealing with deteriorated materials, small letterforms, or material that has been scanned from microfilm, as is the case with the Danish newspaper collection held by the Danish Royal Library. It has been trained on materials from various advertisement papers, mainly from Copenhagen and Aalborg in the decades around 1800. When used on Danish material, it should be used with its language model.
The model performs well on most of the newspapers from the period, though a special model suited for the colonial papers is needed, because these are often multi-lingual and also use latin characters much more prevalently. As of this writing (October 2023), the Transkribus Print models performs better on the papers from St. Croix and St. Thomas.
The model has been updated in May 2025.