Your digitisation project, managed from start to finish

Whether you need proven text recognition at scale or a completely new approach for material no standard method can handle — our team of domain experts, AI specialists, and archival scientists runs the entire project. From understanding your corpus to delivering structured, searchable results integrated with your systems.

Tell us about your project

Your documentsScans, images, manuscripts

Analysis & proof of conceptModel selection, CER evaluation

Processing & trainingRecognition, custom models, QA

Structured deliveryXML, CSV, Sites, system integration

20M+pages in a single project

2,000+institutions trust Transkribus

95%+accuracy on trained models

Batch processing with proven models

For well-scanned material with standard scripts: we select the right models from 100+ publicly available text and layout recognition models, configure the workflow, run batch processing, perform quality checks, and deliver.

Printed books and government recordsStandard handwriting (Latin, Kurrent, Fraktur)Large volumes with consistent quality

Custom model training for your material

When standard models do not reach the accuracy you need — unusual handwriting, degraded scans, rare scripts — we train AI models specifically on your material. Multiple training rounds until we hit the target accuracy.

Rare or personal handwriting stylesDegraded scans or microfilm digitisationNon-Latin writing systems

See the Bautzen project — custom Kurrent model for 200 years of council minutes →

Schema definition, data extraction & system integration

Beyond plain text: we define extraction schemas for your document types — tables, fields, structured records — and deliver data in the format your systems need. Publication as a searchable Transkribus Site with custom branding.

Table and field extraction from registersCSV, Excel, or database-ready outputIntegration with ArchivesSpace, AtoM, scopeArchivTranskribus Sites with full-text search

See the St. Gallen project — 200,000 pages published as a searchable Site →

New frameworks when standard approaches fail

Some collections cannot be solved with existing tools. We develop novel AI approaches: end-to-end Smart Extract models that understand document structure contextually, Named Entity Recognition for automatic tagging, and custom frameworks for problems no off-the-shelf method can handle.

Smart Extract — contextual document understandingNamed Entity Recognition and geo-enrichmentNovel frameworks for non-standard documents

See the MfN Berlin project — first real-world Smart Extract deployment →

Understanding your material

We analyse your collection: document types, scripts, layouts, condition, volume. What data do you need extracted? What systems does it need to integrate with? What does success look like for your institution?

Proof of concept

You send us a representative sample. We run the full pipeline — including custom model training if needed — and return results with Character Error Rate measurements and a realistic cost estimate.

Project planning & kickoff

We define scope, timeline, milestones, deliverables, and pricing. A dedicated project manager with a background in digital humanities or archival science becomes your single point of contact.

Processing, training & quality assurance

Your PM coordinates the technical pipeline: recognition, model refinement, data extraction, quality checks. Bi-weekly sync meetings keep you informed.

Milestone delivery & review

Results are delivered progressively at agreed milestones, each with quality metrics and sample review. You review and approve before we continue.

Final handover & integration

The complete dataset in your required format — PAGE XML, ALTO, TEI, CSV, searchable PDF — or published as a Transkribus Site. All custom-trained models are yours to keep.

Museum für Naturkunde Berlin

Germany

The challenge

250,000 specimen labels with handwritten metadata spanning two centuries. Standard OCR failed entirely — faded ink, damaged paper, mixed scripts, and non-standard layouts.

What we did

Developed a Smart Extract model — a single-pass AI that understands label structure contextually. Added Named Entity Recognition with GeoNames enrichment to automatically tag species and resolve place names.

The outcome

First real-world Smart Extract deployment. Complete machine-readable dataset with NER-enriched metadata — species tagged, place names resolved via GeoNames. A replicable model for natural history collections worldwide.

Read the full story →

Zeitpunkt.NRW

North Rhine-Westphalia, Germany

The challenge

The complete historical newspaper holdings of North Rhine-Westphalia — 20 million pages spanning centuries. Complex multi-column layouts, Fraktur print, advertisements, and mixed content types.

What we did

Full-text recognition at unprecedented scale. AI layout segmentation for complex newspaper pages, batch processing with quality assurance, and publication through a state-level digital newspaper portal.

The outcome

Citizens and researchers can now full-text search across centuries of regional history through the publicly accessible Zeitpunkt.NRW portal — one of the largest HTR projects ever completed.

Visit zeitpunkt.nrw →

Noord-Hollands Archief

Haarlem, Netherlands

The challenge

Centuries of notarial archives — testaments, property transfers, inventories, witness statements — spanning 1570 to 1925. Nearly 2 million scans of handwritten documents across Haarlem, Kennemerland, and Amstel- en Meerlanden, inaccessible to anyone who cannot read historical scripts.

What we did

Applied HTR to the complete notarial archives. Published as a searchable Transkribus Site with fuzzy search for person names and locations. Part of the pioneering HTR project "De ijsberg zichtbaar maken" (2019–2021).

The outcome

Notarial acts spanning 1570–1925 now fully text-searchable online. Researchers, genealogists, and citizens can search for names, locations, and subjects across 350 years of North Holland's notarial history — with 93–98.6% character accuracy.

Explore the collection →

Council meeting minutes from the St. Gallen archive

State Archives of St. Gallen

Switzerland

The challenge

417 books, 200,000 pages of council meeting minutes — handwritten and typewritten, many digitised from older microfilm scans. Only accessible through in-person visits.

What we did

Custom model training on the council minutes. Combined automated transcription with manual correction. Published as a searchable Transkribus Site with side-by-side document and transcription views.

The outcome

Council minutes from 1803 onward publicly accessible online — searchable around the clock. No expertise in historical handwriting required to access two centuries of government records.

Read the full story →

Historical Kurrent handwriting from the Bautzen archive

Archivverbund Bautzen

Germany

The challenge

257 volumes of city council minutes spanning 1623–1832 — 55,000 pages of Kurrent script. Digitised but inaccessible because the handwriting was too difficult for untrained researchers.

What we did

Applied the Early Kurrent model, then trained a custom model to improve accuracy. Published as a Transkribus Site with permalinks integrating into Archivportal-D and Findbuch.

The outcome

200 years of Bautzen city history fully searchable. Seamless discovery through existing archival portals — Archivportal-D and Findbuch integration via permalinks.

Read the full story →

Trusted by leading institutions worldwide