Build text corpora from historical documents.

Historical linguistics and corpus research depend on machine-readable text — and that text has to come from somewhere. Transkribus converts handwritten and printed documents into structured text with XML markup that preserves layout, marginalia, deletions, and other features linguists need. From there, export to your corpus tools.

Start for free How text recognition works

Historical text transcription for corpus building

100+languages and scripts

300+community-trained models

XMLstructured text export

Transcription editor with structural markup

Structured text with layout markup

Headings, columns, marginalia, footnotes, deletions, insertions — the XML export preserves document structure that matters for linguistic analysis. Not just a flat text dump.

Searchable across the entire collection

Once transcribed, your documents are full-text searchable. Find word forms, spelling variants, and patterns across thousands of pages — a concordancer for your manuscript corpus.

Export for downstream analysis

Export as plain text, TEI-XML, PAGE XML, or ALTO XML. Feed into your NLP pipeline, concordancer, or corpus annotation tool. The structured markup carries over.

Case study

KorBa: Building a digital corpus of 17th–18th century Polish texts

The KorBa project at the Polish Academy of Sciences uses Transkribus to build a large-scale linguistic corpus of historical Polish texts from the 17th and 18th centuries. The project trains custom models on period-specific handwriting and print, then processes entire manuscript collections into machine-readable text that feeds into the corpus analysis platform.

Custom HTR models trained on historical Polish handwriting and print

Structured text export preserving document layout and annotations

Corpus used for diachronic linguistic analysis of the Polish language

Read the KorBa case study

Historical Polish manuscript — KorBa corpus project

Multilingual

100+ languages and scripts — with models trained by the community

Transkribus supports over 100 languages and scripts, with 300+ public models trained by researchers around the world. Whether you're building a corpus of medieval Latin sermons, early modern French correspondence, or 19th-century Devanagari print — there's likely a model you can start with. If not, train your own on 50 pages.

Latin, German, French, English, Dutch, Italian, Spanish, Portuguese, and 90+ more

Historical scripts: Kurrent, Sütterlin, Secretary Hand, Gothic textura, Caroline minuscule

Non-Latin: Hebrew, Arabic, Greek, Cyrillic, Devanagari, and more

Custom model training for any script or language with 50+ pages of ground truth

Browse public models