Breaking the language barrier: 3 multilingual Transkribus research projects

Fiona Park

Breaking the language barrier: 3 multilingual Transkribus research projects

Fiona Park

July 1, 2026·4 min read

Breaking the language barrier: 3 multilingual Transkribus research projects

Historical documents and archival collections are rarely confined to a single language or writing system. Across the globe, researchers, historians, and archivists frequently encounter manuscripts that seamlessly shift between different languages, dialects, or even scripts within the same page. While these multilingual collections are incredibly rich resources for understanding history, they have traditionally presented a major hurdle for automatic text recognition technology. Most standard transcription tools are designed to process one language at a time, often stumbling when faced with unexpected vocabulary, shifting grammatical structures, or diverse orthographies.

But many projects around the world have successfully overcome these linguistic barriers with Transkribus. By allowing users to train advanced text recognition models on diverse datasets, the platform can successfully learn to read and transcribe multiple languages simultaneously, proving that linguistic diversity is no longer a barrier to digital accessibility. Here are three examples of how Transkribus is being used to create powerful multilingual text recognition models.

Creating a model for Irish and English

The first project focuses on a unique piece of diasporic history: An Gaodhal, a bilingual newspaper established in New York in the late 19th century to promote the Irish language and culture. This publication presented a double challenge for digital transcription. Not only did it contain both Irish and English text, but the Irish passages were also printed in Cló Gaelach, a traditional Gaelic script that looks entirely different from the standard Latin alphabet used for English.

To tackle this, a research team from the University of Galway and New York University used Transkribus to develop a specialised bilingual model. They trained the software to recognise and transcribe Irish texts in Cló Gaelach, as well as English texts in the standard Latin script. This innovative model makes it far easier to access historical Irish texts, showcasing how Transkribus can serve as a vital bridge for under-resourced languages, traditional scripts, and heritage preservation.

Read the full story: Training a bilingual Irish-English model in Transkribus using An Gaodhal

Decoding a trilingual classical lexicon at Cornell University

At Cornell University, Professor of Classics Jeff Rusten faced a daunting paleographical puzzle: a comprehensive lexicon to the playwright Aristophanes compiled in 1910 by German classicist Ernst Wüst. Wüst meticulously catalogued the playwright's unique words, puns, and phrases, but the primary challenge for modern digitisation was the lexicon's heavily multilingual nature. The handwritten entries contained text in Ancient Greek, German, and Latin simultaneously, meaning a standard text recognition model would struggle to process the three different languages on the same page.

Therefore, the team decided to use Transkribus to train a specialised, trilingual model capable of accurately transcribing all three languages as they appeared on the page. The resulting bespoke model has enabled the creation of a searchable, digital edition of the lexicon, providing an excellent case study for other projects dealing with multilingual scholarly texts from the 19th and 20th centuries.

Read the full story: Training a multilingual model in Transkribus

Screenshot_Sybren Valkema study-1

Transcribing the multilingual and multi-authored Sybren Valkema archive

This third project demonstrates how a single text recognition model can be trained to handle both multiple languages and multiple handwriting styles within modern collections. Published in the Heritage journal, this academic study explored the optimal automated transcription strategy for the personal archive of the influential Dutch glass artist Sybren Valkema (1916–1996). The collection consisted of a large quantity of uncategorized archival documents written by various authors in several different languages, including Dutch, English, and German.

Rather than training a separate text recognition model for every individual language or handwriting style, the researchers used Transkribus to conduct a comparative study. They successfully demonstrated that a single, robust multilingual model could handle the diverse variations in handwriting and language. This approach drastically reduced the manual annotation effort required, making the vast collection searchable and proving that high-accuracy automation is possible even for highly irregular modern archives.

Read the full study: Experimenting with Training a Neural Network in Transkribus to Recognise Text in a Multilingual and Multi-Authored Manuscript Collection

Unlocking your own multilingual collections

As these three diverse projects demonstrate, a mixed-language archive is no barrier to digital accessibility. Whether you are dealing with a 19th-century bilingual newspaper, a trilingual classical lexicon, or a multi-authored modern artist archive, Transkribus provides the flexibility, platform, and advanced AI technology needed to recognise multiple languages and scripts at scale.

If you are ready to bring your own multilingual collections into the digital age, the next step is to explore how automated text recognition can be tailored to your specific materials. For a detailed, step-by-step guide on how to train advanced, custom AI models for complex or mixed-language documents, watch our comprehensive instructional webinar on our YouTube channel.

Breaking the language barrier: 3 multilingual Transkribus research projects

Creating a model for Irish and English

Decoding a trilingual classical lexicon at Cornell University

Transcribing the multilingual and multi-authored Sybren Valkema archive

Unlocking your own multilingual collections

Related Articles

Reading English Secretary Hand with AI: the Egerton Model

Navigating the transcription of the Dutch Prize Papers

How the Hanse.Quelle.Lesen! project made Hanseatic records accessible through Citizen Science