Creating the Swedish Lion Ⅰ text recognition model

Simplicity, citizen engagement and AI-driven transcription were key factors that intrigued Olof Karsvall and the team at the Swedish National Archives when they discovered Transkribus.

Olof Karsvall, Research Manager at the Swedish National Archives, has been involved in several research projects, most recently the integration of AI, which has revolutionised the research of archival materials. In this blog post, he shares insights into how the Swedish Lion Ⅰ model is supporting this transformative journey.

*Archival Material from the Swedish Lion Ⅰ Project.*

A collaboration of archives, researchers and universities

The Swedish Lion Ⅰ text recognition model is a collaborative effort involving institutions such as the National Archives of Sweden and Finland, Stockholm City Archives, Jämtlands läns fornskriftsällskap and researchers from Stockholm and Uppsala Universities. “As we collectively focus on generating training data for [automatic text recognition] in Swedish, we recognised the advantages of collaboration. Consequently, we merged our training data to create a joint model”, says Olof Karsvall. The fully trained Swedish Lion Ⅰ model can automatically transcribe other documents with similar handwriting, making it a valuable tool for digitising and analysing historical manuscripts and archival materials.

At the heart of this collaboration is Transkribus, a platform that allows users to create and train models for specific handwriting styles and historical periods. A key moment came in 2019 when the Stockholm National Archives joined READ-COOP SCE, the cooperative behind Transkribus. Olof Karsvall emphasises, “Primarily, it was the ease of use and the opportunity to engage citizens and volunteers in utilising AI for machine transcription that fascinated us”.

With external funding from the Swedish Innovation Agency (Vinnova), and now most recently from the Swedish National Heritage Board, this fascination evolved into a transformative journey, resulting in innovative projects that seamlessly combine citizen science and text recognition, all made possible by Transkribus. In this way, the Swedish Lion Ⅰ model, together with Transkribus, opens up new possibilities for accessing and researching historical documents.

*Archival Material from the Swedish Lion Ⅰ Project and transcription in Transkribus.*

Expanding research possibilities

When looking more closely at the history of models, it is always interesting to find out what the aim and motivation behind their creation was. Karsvall explains that: “By incorporating texts of diverse types from various historical periods, the goal is for the model to generalise effectively and apply to archival material beyond its original training scope.” To create this model, it was necessary to include a variety of texts from different historical periods. This diversity of training data helps to make the model more effective and applicable to a wide range of archival materials, ensuring better accuracy and performance when transcribing handwritten documents from different periods and styles.

The Swedish Lion Ⅰ is envisioned as a basic model for Swedish historical texts, which will simplify access to handwritten materials and support data-driven research.

*The Swedish Lion Ⅰ model for Swedish handwriting.*

Training a versatile model

The Swedish Lion Ⅰ model, carefully trained using a wide range of historical documents, particularly court records and minutes from the 1600s, 1700s and 1800s, truly demonstrates the capabilities of Transkribus. Accordings to Karsvall: "Getting started with Transkribus was easy". The software's potential can be seen in the collaborative process of transcribing 3.3 million lines of text from 268 archival volumes. The final model was a result of various projects that have created Ground Truth data utilising specialised models and applying manual corrections The model's remarkable Character Error Rate (CER) of just 4% confirms the model's great performance. This is particularly evident in the processing of running text and marginal notes.

Olof Karsvall acknowledges a challenge in managing diverse documents: “As we manage a wide array of documents, a significant challenge has been the segmentation of regions and lines.” Fortunately, the introduction of new trainable Field Models and Table Models brought greater accuracy, easier segmentation, and better recognition of layout structures. After three years of careful transcription, manual review, and correction, the Swedish Lion Ⅰ model is now ready and available as a public model!

-> Swedish Lion Ⅰ Model

Next steps for the Swedish Lion Ⅰ Model

The hope is that the Swedish Lion I text recognition model will reach new users via Transkribus and stimulate the usage of historical archive material in Swedish. Its collaborative development, involving several institutions, researchers, and volunteers, is already a great inspiration. Karsvall highlights the intention to extend this collaboration, creating a larger model covering older periods and diverse materials, and thereby promoting citizen science. Colleagues and the archival community have already shown growing interest, leading to increased requests to collaborate. The team plans to apply the model to several large collections to meet the expectations of increased accessibility to archives, following the publication of the Swedish Lion Ⅰ model.

Thank you Olof Karsvall for the interview and for sharing the journey of the Swedish Lion Ⅰ model!

Olof Karsvall's Transkribus Tips:

“Seek advice from others who have undertaken similar projects previously”

“Share your data; everyone benefits if data can be reused”