Skip to content
  • Pricing

Build text corpora from historical documents.

Historical linguistics and corpus research depend on machine-readable text — and that text has to come from somewhere. Transkribus converts handwritten and printed documents into structured text with XML markup that preserves layout, marginalia, deletions, and other features linguists need. From there, export to your corpus tools.

Historical text transcription for corpus building
100+languages and scripts
300+community-trained models
XMLstructured text export

What you get for corpus work

Text output that preserves the features linguists and corpus researchers need.

Transcription editor with structural markup

Structured text with layout markup

Headings, columns, marginalia, footnotes, deletions, insertions — the XML export preserves document structure that matters for linguistic analysis. Not just a flat text dump.

Full-text search across corpus

Searchable across the entire collection

Once transcribed, your documents are full-text searchable. Find word forms, spelling variants, and patterns across thousands of pages — a concordancer for your manuscript corpus.

Export to NLP and corpus tools

Export for downstream analysis

Export as plain text, TEI-XML, PAGE XML, or ALTO XML. Feed into your NLP pipeline, concordancer, or corpus annotation tool. The structured markup carries over.

Case study

KorBa: Building a digital corpus of 17th–18th century Polish texts

The KorBa project at the Polish Academy of Sciences uses Transkribus to build a large-scale linguistic corpus of historical Polish texts from the 17th and 18th centuries. The project trains custom models on period-specific handwriting and print, then processes entire manuscript collections into machine-readable text that feeds into the corpus analysis platform.
Custom HTR models trained on historical Polish handwriting and print
Structured text export preserving document layout and annotations
Corpus used for diachronic linguistic analysis of the Polish language
Historical Polish manuscript — KorBa corpus project

Multilingual

100+ languages and scripts — with models trained by the community

Transkribus supports over 100 languages and scripts, with 300+ public models trained by researchers around the world. Whether you're building a corpus of medieval Latin sermons, early modern French correspondence, or 19th-century Devanagari print — there's likely a model you can start with. If not, train your own on 50 pages.
Latin, German, French, English, Dutch, Italian, Spanish, Portuguese, and 90+ more
Historical scripts: Kurrent, Sütterlin, Secretary Hand, Gothic textura, Caroline minuscule
Non-Latin: Hebrew, Arabic, Greek, Cyrillic, Devanagari, and more
Custom model training for any script or language with 50+ pages of ground truth
Multilingual handwriting recognition models

Start building your corpus

Start for free with 50 credits per month. For large-scale corpus projects, talk to our team about institutional plans and research partnerships.

100+languages
300+public models
EU-hostedGDPR-compliant