Build text corpora from historical documents.
Historical linguistics and corpus research depend on machine-readable text — and that text has to come from somewhere. Transkribus converts handwritten and printed documents into structured text with XML markup that preserves layout, marginalia, deletions, and other features linguists need. From there, export to your corpus tools.

What you get for corpus work
Text output that preserves the features linguists and corpus researchers need.

Structured text with layout markup
Headings, columns, marginalia, footnotes, deletions, insertions — the XML export preserves document structure that matters for linguistic analysis. Not just a flat text dump.

Searchable across the entire collection
Once transcribed, your documents are full-text searchable. Find word forms, spelling variants, and patterns across thousands of pages — a concordancer for your manuscript corpus.

Export for downstream analysis
Export as plain text, TEI-XML, PAGE XML, or ALTO XML. Feed into your NLP pipeline, concordancer, or corpus annotation tool. The structured markup carries over.
Case study
KorBa: Building a digital corpus of 17th–18th century Polish texts

Multilingual
100+ languages and scripts — with models trained by the community

Start building your corpus
Start for free with 50 credits per month. For large-scale corpus projects, talk to our team about institutional plans and research partnerships.