Skip to content
  • Pricing

Archival Backlog Reduction with AI-Powered Text Recognition

Millions of unprocessed pages, not enough staff. Transkribus batch-processes entire collections — turning hidden holdings into searchable, discoverable records at institutional scale.

Batch ProcessingHidden CollectionsAI at ScaleTalk to Us

Drag an image here

Select a file...

PNG or JPG up to 10 Mb

Wolpi
AI Assistant

By uploading an image, you accept our terms and privacy policy.

Trusted by 500,000+ users worldwide — 200M+ pages processed

2,000+
Archives and libraries
200M+
Pages processed
300+
Public AI models
250+
Cooperative members

The problem

The Hidden Collections Crisis: Archive Digitization Backlogs Keep Growing

OCLC estimates that more than 30% of archival collections in the United States alone remain "hidden" — unprocessed, uncatalogued, and effectively invisible to researchers. The situation is comparable across Europe and beyond. These are not marginal materials. They include correspondence, legal records, administrative files, and manuscripts that researchers cannot discover because no finding aid, catalogue entry, or searchable text exists for them. Every year the backlog grows as new acquisitions arrive faster than understaffed teams can process them.
Staff shortages are structural, not temporary — archives cannot hire their way out of the backlog
Manual transcription of a single archival box can take weeks of skilled labour
Unprocessed collections generate no citations, no research, and no public engagement
Grant-funded digitisation projects often cover imaging but not text recognition or metadata creation
Mixed collections — typescript, handwriting, printed forms — require different approaches that slow manual workflows further
Handwritten archival protocol from 1805 — typical of unprocessed backlog materials

The solution

Reduce Archival Backlog with AI: From Unprocessed Boxes to Searchable Records

Transkribus enables archives to process collections at a scale that manual workflows cannot achieve. Upload scanned images — entire boxes, series, or fonds — and run AI text recognition across thousands of pages in a single batch. The platform's handwritten text recognition (HTR) handles the scripts and document types most common in archival holdings: administrative handwriting, official correspondence, court records, municipal registers, and mixed-format files. The result is machine-readable, searchable text that can be exported directly into archival information systems.
Batch processing: queue thousands of pages and process them unattended — no page-by-page intervention
300+ public AI models trained on historical scripts from the 15th century onward
Export to PAGE XML, ALTO XML, and TEI-XML for ingest into ArchivesSpace, AtoM, and other systems
Metagrapho API enables fully automated pipelines for mass digitisation workflows
Publish processed collections directly as searchable digital editions via Transkribus Sites
Historical register document — the type of structured archival record processed at scale

Comparison

AI-Assisted Processing vs. Manual Transcription for Archives

Archives face a fundamental throughput problem: millions of pages waiting to be catalogued, searchable, and accessible. Here is how AI-assisted processing compares to traditional manual workflows.

FeatureTranskribus AI ProcessingManual Transcription
ThroughputThousands of pages per day with batch processing — scales with collection sizeA skilled transcriber processes 5–15 pages per day depending on difficulty
Cost per pageFraction of a cent per page with credit-based pricingLabour-intensive — costs accumulate linearly with every page
ConsistencySame model produces consistent output across thousands of pagesQuality varies between transcribers, fatigue, and interpretation differences
SearchabilityEvery processed page becomes full-text searchable immediatelyOnly transcribed pages are searchable — the backlog remains dark
Handling historical scripts300+ public models covering scripts from the 9th century to the presentRequires specialised paleography training — few staff have the necessary skills
Time to accessCollections become accessible within days or weeks of digitisationBacklogs of years or decades are common in large institutions
Quality reviewConfidence scores flag uncertain lines for targeted human reviewRequires full proofreading of every transcription

Comparison reflects typical institutional workflows. AI processing works best as a complement to human expertise — automated first pass with targeted manual review.

How to process an archival collection in 4 steps

Upload scanned collections

Upload entire series or fonds as multi-page PDFs, TIFFs, or image batches. Transkribus handles layout detection — columns, tables, marginalia — automatically.

Select an AI model

Choose from 300+ public models filtered by language, century, and script type. For mixed collections, run multiple models on different document groups within the same project.

Run batch recognition

Queue thousands of pages for processing. Transkribus runs text recognition in the background — no manual intervention required. Monitor progress from the dashboard.

Export and integrate

Export results as PAGE XML, ALTO XML, TEI-XML, plain text, or searchable PDF. Ingest directly into ArchivesSpace, AtoM, or publish via Transkribus Sites.

At scale

Automated Archival Processing with the Metagrapho API

For institutions running large-scale or recurring digitisation programmes, the Metagrapho REST API enables fully automated processing pipelines. Integrate text recognition directly into your existing imaging and cataloguing workflows — no manual uploads, no browser-based interaction. The API supports model selection, batch job management, and structured output retrieval, making it suitable for production-grade mass digitisation projects.
REST API with full documentation for integration into institutional workflows
Programmatic model selection — choose different models for different collection types automatically
Structured JSON output with text, coordinates, and confidence scores for each text region
Batch job management: submit, monitor, and retrieve results for thousands of pages
Combine with entity recognition to extract names, dates, and places for catalogue enrichment
batch_process.py
import requests

API = "https://transkribus.eu/processing/v1"
TOKEN = "your-api-token"

# 1. Upload collection
upload = requests.post(f"{API}/uploads",
  headers={"Authorization": f"Bearer {TOKEN}"},
  json={"collectionId": 12345}
)

# 2. Start recognition on all pages
job = requests.post(f"{API}/processes",
  headers={"Authorization": f"Bearer {TOKEN}"},
  json={
    "docId": upload.json()["docId"],
    "htrId": 53042,   # model ID
    "pages": "all"
  }
)

# 3. Poll for completion
status = requests.get(
  f"{API}/processes/{job.json()['processId']}",
  headers={"Authorization": f"Bearer {TOKEN}"}
).json()
print(f"Status: {status['state']}")

Frequently Asked Questions

Processing speed depends on document complexity and page count, but as a benchmark: a single page typically takes 15–30 seconds. Batch processing runs in parallel, so a collection of 10,000 pages can be processed in hours rather than the weeks or months required for manual transcription. The Metagrapho API enables continuous, unattended processing for even larger volumes.
Accuracy varies by script type and document condition. On well-preserved 19th and 20th-century administrative handwriting, character error rates (CER) below 5% are typical with appropriate public models. Older or more challenging scripts may require custom model training to reach comparable accuracy. Every text line includes a confidence score, enabling quality-focused review workflows — staff can concentrate on low-confidence sections rather than re-reading entire documents.
Transkribus exports in PAGE XML, ALTO XML, TEI-XML, and other standard formats that ArchivesSpace, AtoM, and similar archival information systems can ingest. The API enables automated export pipelines. While there is no direct plug-in connector, the structured XML output is designed for interoperability with archival metadata standards (EAD, Dublin Core).
One trained staff member can manage a batch processing project covering thousands of pages. Transkribus handles layout detection, text recognition, and export automatically. Staff time is best spent on quality review of low-confidence segments and on curatorial decisions — selecting which collections to prioritise, choosing appropriate models, and validating results.
Transkribus offers institutional plans designed for high-volume processing. Pricing depends on page volume and whether API access is required. Contact our team at transkribus.org/contact for a tailored quote. Every account includes 50 free credits per month to evaluate the platform before committing.
All processing runs on Transkribus's own servers in Austria (EU). No data is sent to third-party cloud services. Documents and transcriptions remain under the institution's full ownership and can be deleted at any time. Transkribus is operated by READ-COOP SCE, a European cooperative — not a venture-backed startup. Data processing agreements are available for institutions that require them.
Institutions typically achieve the best return by starting with collections that are (1) already digitised (scanned) but lack searchable text, (2) in high demand from researchers, or (3) written in scripts for which strong public models already exist. This approach maximises immediate impact with minimal setup. Transkribus's model catalogue can be filtered by language, script type, and century to identify which collections will work well out of the box.
Yes. Archival collections frequently contain mixed materials — typescript forms with handwritten annotations, printed headers with cursive entries, or pages that alternate between print and script. Transkribus handles layout detection for these mixed formats and supports running different models on different document types within the same project.
EUAT

Institutional-grade infrastructure for archival collections.

Transkribus is built and hosted in Europe by a cooperative of 250+ archives, libraries, and universities. Your collections stay under your control.

Your data stays yours

Full ownership. Delete anytime.

Hosted in Austria, EU

All processing on our own servers. GDPR-compliant. No third-party cloud dependencies.

Cooperative, not a startup

Thousands of archives, libraries, and universities as co-owners. Built for decades, not a VC exit.

Related resources

More for archives and institutions

Explore how Transkribus fits into your institutional workflows: Transkribus for archives · What is HTR? · Create searchable PDFs · Medieval manuscripts
Archive collections being digitised

Ready to address your archival backlog?

Speak with our team about institutional plans for large-scale collection processing, or create a free account to evaluate Transkribus on your own materials.

Used by 2,000+ archives and libraries worldwide

200M+Pages processed
2,000+Archives and libraries
300+Public AI models