Skip to content
  • Pricing
On-Prem Overview

Technical Reference

Architecture, processing pipeline, system requirements, and performance benchmarks for Transkribus On-Prem.

Processing Pipeline

Image InputTIFF, JPEG, PNG, PDF
PreprocessingBinarization, deskew
Layout AnalysisRegions & baselines
Line ExtractionText segmentation
RecognitionHTR / OCR (GPU)
OutputPageXML, PDF, ALTO

Stages execute as a streaming pipeline. While one page is being recognized, the next is already having its layout detected. Pipeline steps are customisable — steps that are irrelevant for your current workflow can be omitted, and intermediate output data can be submitted to humans-in-the-loop for review. Future AI models may provide end-to-end architectures for combining some or all of the steps (see <a href="#extensibility">Extensible Architecture</a>). Build the tailored and efficient pipeline that you need.

Text-recognition Engines

Standard HTR

Encoder-decoder neural network for handwritten and printed text. Optimized for throughput. Supports custom model training on your own data and works with the full catalog of public and private Transkribus models. Language model support improves accuracy on domain-specific content.

Scripts
Latin, German (Kurrent, Fraktur), major European scripts
Accuracy
CER 2–5% on clean documents, 5–10% on challenging material
Throughput
~2–3 s/page per GPU (warm, ~20 lines/page)
VRAM
~4 GB per concurrent model

Best for: Large-scale batch processing, well-supported scripts, custom-trained models

Super Models

Larger architecture with broader script coverage and higher accuracy on difficult material. Access to the full Transkribus Super Models catalog — dozens of scripts and languages, including historical German, Latin, Greek, Cyrillic, Hebrew, Arabic, and East Asian scripts.

Scripts
70+ scripts including Latin, Greek, Cyrillic, Hebrew, Arabic, East Asian
Accuracy
CER 1–3% on common scripts, 3–7% on rare material
Throughput
~4–5 s/page per GPU (warm, ~20 lines/page)
VRAM
~8 GB per concurrent model

Best for: Rare scripts, mixed-language documents, highest-accuracy requirements

Both engines can be available simultaneously on the same installation. The user selects per job. Use Standard HTR for high-volume batch processing of well-supported scripts. Use Super Models when working with rare scripts, mixed-language documents, or when minimizing CER is the primary concern.

Layout Analysis

Automatic detection of page structure before recognition. The layout model identifies where text, tables, headers, and other content regions are located, establishes baselines within text regions, and determines reading order. Multiple layout models are available for different document types and historical periods.

  • Text regions
  • Baselines
  • Reading order
  • Tables
  • Headers & footers
  • Marginalia
  • Illustrations
  • Drop capitals

Tables & Fields

Dedicated model types for structured data extraction. Table models detect row and column structure within table regions identified during layout analysis. Field models extract values from forms and standardized documents with known layouts. Both produce structured output ready for database ingestion or downstream processing.

  • Table extraction with row and column structure
  • Cell content recognition within detected tables
  • Field extraction from forms and standardized document types
  • Structured output as part of PageXML or standalone export
  • Custom field models for domain-specific document layouts

Output Formats

FormatWhat's includedTypical use
PageXMLBaselines, polygons, text, per-character confidence, metadataRound-trip with Transkribus, scholarly editing, preservation
ALTO XMLLibrary-standard OCR structureMETS containers, institutional repositories, Europeana
Searchable PDFInvisible word-level text layer over original scanEnd-user access, full-text search, citation
Plain TextUTF-8 text, one file per pageFull-text indexing, NLP pipelines, corpus building

Model Training

Train custom recognition models on your own documents. All training runs locally on your GPU — no data leaves your infrastructure. Enterprise deployments can train heavier architectures, including Super Model-class models, for maximum accuracy on institution-specific collections.

  1. Prepare Ground Truth

    Transcribe a sample of your documents — typically 50–100 pages for fine-tuning an existing base model. The web dashboard includes ground truth editing tools.

  2. Train

    Select a base model and start training on your GPU. Training time is typically 2–6 hours for a fine-tuning run, depending on dataset size and hardware.

  3. Evaluate

    The system reports CER (Character Error Rate) on a held-out validation set. Compare against the base model to measure improvement.

  4. Deploy

    Publish the trained model to your local model registry. It becomes available for recognition jobs immediately — no restart needed.

Fine-tuning typically takes hours, not days. A base model trained on similar material can be adapted to a specific hand or document collection with surprisingly little ground truth.

Extensible Architecture

The processing pipeline is designed as a framework, not a fixed sequence. New model architectures and recognition tasks can be integrated over time as they become available — the system is not limited to the current set of HTR, layout, table, and field models. The containerized architecture allows new processing stages to be added without disrupting existing workflows.

Architecture

Workstation

Access
BrowserWeb Dashboard
Services
Web Servernginx / port 443
Processing
RecognitionGPU-accelerated
TrainingOptional
Data
DatabasePostgreSQL
StorageLocal / NAS

Single-server deployment with Docker Compose. All services run on one machine — web dashboard, recognition engine, training, database, and local storage. Set up in an afternoon. No Kubernetes, no cluster infrastructure. Models stay loaded on the GPU across jobs for sub-second startup on subsequent pages.

Enterprise (Kubernetes / OpenShift)

Access
IngressAPI Gateway / LB
Services
REST APIRecognition Service
DashboardWeb UI
Processing
GPU Worker 1A100 / H100
GPU Worker 2A100 / H100
GPU Worker NScale out
Training JobsK8s Jobs
Data
S3 StorageMinIO / Ceph
MonitoringPrometheus

Kubernetes-native deployment with horizontal scaling. Each pipeline stage scales independently via HPA. GPU inference uses a server/client architecture — a single GPU serves multiple client watchers. Supports full NVIDIA GPUs and MIG partitions. Event coordination via Redis pub/sub. Storage via S3-compatible object storage (MinIO, Ceph, AWS S3). Deployed via Helm with ArgoCD recommended for GitOps. Rolling updates without downtime.

System Requirements

Workstation

ComponentMinimumRecommended
OSUbuntu 22.04+ / Windows Server 2022Ubuntu 22.04 LTS
CPU8 cores16+ cores
RAM32 GB64 GB
GPUNVIDIA, 12 GB VRAM (RTX 3060+)RTX 4090 / A6000 (24 GB VRAM)
Storage500 GB SSD1 TB+ NVMe
NVIDIA Driver565.57+Latest stable
CUDA12.4+12.4+
Docker24.0+Latest stable

Enterprise

ComponentRequirement
OrchestrationKubernetes 1.27+ or OpenShift 4.x
GPU OperatorNVIDIA GPU Operator with MIG support
StorageS3-compatible object storage (MinIO, Ceph, AWS S3)
GPU per workerNVIDIA A100 or H100 recommended (MIG partitioning supported)
Event coordinationRedis (pub/sub for job coordination)
MonitoringPrometheus + Grafana (metrics exported natively)
DeploymentHelm chart provided, ArgoCD recommended
NVIDIA Driver565.57+ / CUDA 12.4+

Performance

Throughput benchmarks at ~20 lines per page. Actual results depend on document complexity, page dimensions, and lines per page. Sparse pages run faster, dense pages slower — roughly linear with line count.

Workstation (single GPU, RTX 3090)

WorkloadStandard HTRSuper Models
Single page (cold start)~10 s~13 s
Per page (warm, amortized)~3 s~5 s
Archive box (100 pages)~5 min~8 min
Archival run (500 pages)~25 min~42 min
Daily throughput (24 h)~27,000 pages~16,500 pages

Enterprise (per A100)

WorkloadStandard HTRSuper Models
Per page (warm, amortized)~2 s~4 s
Archive box (100 pages)~3.5 min~7 min
Archival run (500 pages)~17 min~33 min
Daily per GPU (24 h)~42,000 pages~21,000 pages
8× A100 cluster (24 h)~300,000 pages~168,000 pages

Cold start adds 5–10 seconds for model loading. Subsequent pages in the same batch use the warm throughput above. Throughput scales linearly with GPU count — add inference-server replicas with dedicated GPUs or MIG partitions to multiply capacity.

API & Integration

Transkribus On-Prem exposes integration points for embedding recognition into your existing workflows, archive systems, and downstream pipelines.

  • REST API

    Submit jobs, query status, and retrieve results via HTTP. OpenAPI specification exposed at /openapi.json and /openapi.yaml — generate clients in any language. Available in the Enterprise edition.

  • S3 Ingestion

    Drop files into a designated S3/MinIO bucket and jobs start automatically. Results appear back in S3 as PageXML, ALTO, TXT, or PDF. Enterprise edition.

  • Streaming API

    Open live-streaming interface for real-time results. Results flow out line-by-line as pages are processed — embed into your own dashboards or downstream workflows.

  • Transkribus Compatibility

    Filenames, metadata, and PageXML output round-trip cleanly back into Transkribus. Compatible with existing Transkribus integrations — no workflow rewrites needed.