Technical Reference

Architecture, processing pipeline, system requirements, and performance benchmarks for Transkribus On-Prem.

Processing Pipeline

Image InputTIFF, JPEG, PNG, PDF

→

PreprocessingBinarization, deskew

→

Layout AnalysisRegions & baselines

→

Line ExtractionText segmentation

→

RecognitionHTR / OCR (GPU)

→

OutputPageXML, PDF, ALTO

Stages execute as a streaming pipeline. While one page is being recognized, the next is already having its layout detected. Pipeline steps are customisable — steps that are irrelevant for your current workflow can be omitted, and intermediate output data can be submitted to humans-in-the-loop for review. Future AI models may provide end-to-end architectures for combining some or all of the steps (see <a href="#extensibility">Extensible Architecture</a>). Build the tailored and efficient pipeline that you need.

Text-recognition Engines

Standard HTR

Encoder-decoder neural network for handwritten and printed text. Optimized for throughput. Supports custom model training on your own data and works with the full catalog of public and private Transkribus models. Language model support improves accuracy on domain-specific content.

Scripts: Latin, German (Kurrent, Fraktur), major European scripts
Accuracy: CER 2–5% on clean documents, 5–10% on challenging material
Throughput: ~2–3 s/page per GPU (warm, ~20 lines/page)
VRAM: ~4 GB per concurrent model

Best for: Large-scale batch processing, well-supported scripts, custom-trained models

Super Models

Larger architecture with broader script coverage and higher accuracy on difficult material. Access to the full Transkribus Super Models catalog — dozens of scripts and languages, including historical German, Latin, Greek, Cyrillic, Hebrew, Arabic, and East Asian scripts.

Scripts: 70+ scripts including Latin, Greek, Cyrillic, Hebrew, Arabic, East Asian
Accuracy: CER 1–3% on common scripts, 3–7% on rare material
Throughput: ~4–5 s/page per GPU (warm, ~20 lines/page)
VRAM: ~8 GB per concurrent model

Best for: Rare scripts, mixed-language documents, highest-accuracy requirements

Both engines can be available simultaneously on the same installation. The user selects per job. Use Standard HTR for high-volume batch processing of well-supported scripts. Use Super Models when working with rare scripts, mixed-language documents, or when minimizing CER is the primary concern.

Layout Analysis

Automatic detection of page structure before recognition. The layout model identifies where text, tables, headers, and other content regions are located, establishes baselines within text regions, and determines reading order. Multiple layout models are available for different document types and historical periods.

Text regions
Baselines
Reading order
Tables
Headers & footers
Marginalia
Illustrations
Drop capitals

Tables & Fields

Dedicated model types for structured data extraction. Table models detect row and column structure within table regions identified during layout analysis. Field models extract values from forms and standardized documents with known layouts. Both produce structured output ready for database ingestion or downstream processing.

Table extraction with row and column structure
Cell content recognition within detected tables
Field extraction from forms and standardized document types
Structured output as part of PageXML or standalone export
Custom field models for domain-specific document layouts

Output Formats

Format	What's included	Typical use
PageXML	Baselines, polygons, text, per-character confidence, metadata	Round-trip with Transkribus, scholarly editing, preservation
ALTO XML	Library-standard OCR structure	METS containers, institutional repositories, Europeana
Searchable PDF	Invisible word-level text layer over original scan	End-user access, full-text search, citation
Plain Text	UTF-8 text, one file per page	Full-text indexing, NLP pipelines, corpus building

Model Training

Train custom recognition models on your own documents. All training runs locally on your GPU — no data leaves your infrastructure. Enterprise deployments can train heavier architectures, including Super Model-class models, for maximum accuracy on institution-specific collections.

Prepare Ground Truth
Transcribe a sample of your documents — typically 50–100 pages for fine-tuning an existing base model. The web dashboard includes ground truth editing tools.
Train
Select a base model and start training on your GPU. Training time is typically 2–6 hours for a fine-tuning run, depending on dataset size and hardware.
Evaluate
The system reports CER (Character Error Rate) on a held-out validation set. Compare against the base model to measure improvement.
Deploy
Publish the trained model to your local model registry. It becomes available for recognition jobs immediately — no restart needed.

Fine-tuning typically takes hours, not days. A base model trained on similar material can be adapted to a specific hand or document collection with surprisingly little ground truth.

Extensible Architecture

The processing pipeline is designed as a framework, not a fixed sequence. New model architectures and recognition tasks can be integrated over time as they become available — the system is not limited to the current set of HTR, layout, table, and field models. The containerized architecture allows new processing stages to be added without disrupting existing workflows.

Architecture

Workstation

Access

BrowserWeb Dashboard

Services

Web Servernginx / port 443

Processing

RecognitionGPU-accelerated

TrainingOptional

Data

DatabasePostgreSQL

StorageLocal / NAS

Single-server deployment with Docker Compose. All services run on one machine — web dashboard, recognition engine, training, database, and local storage. Set up in an afternoon. No Kubernetes, no cluster infrastructure. Models stay loaded on the GPU across jobs for sub-second startup on subsequent pages.

Enterprise (Kubernetes / OpenShift)

Access

IngressAPI Gateway / LB

Services

REST APIRecognition Service

DashboardWeb UI

Processing

GPU Worker 1A100 / H100

GPU Worker 2A100 / H100

GPU Worker NScale out

Training JobsK8s Jobs

Data

S3 StorageMinIO / Ceph

MonitoringPrometheus

Kubernetes-native deployment with horizontal scaling. Each pipeline stage scales independently via HPA. GPU inference uses a server/client architecture — a single GPU serves multiple client watchers. Supports full NVIDIA GPUs and MIG partitions. Event coordination via Redis pub/sub. Storage via S3-compatible object storage (MinIO, Ceph, AWS S3). Deployed via Helm with ArgoCD recommended for GitOps. Rolling updates without downtime.

System Requirements

Workstation

Component	Minimum	Recommended
OS	Ubuntu 22.04+ / Windows Server 2022	Ubuntu 22.04 LTS
CPU	8 cores	16+ cores
RAM	32 GB	64 GB
GPU	NVIDIA, 12 GB VRAM (RTX 3060+)	RTX 4090 / A6000 (24 GB VRAM)
Storage	500 GB SSD	1 TB+ NVMe
NVIDIA Driver	565.57+	Latest stable
CUDA	12.4+	12.4+
Docker	24.0+	Latest stable

Enterprise

Component	Requirement
Orchestration	Kubernetes 1.27+ or OpenShift 4.x
GPU Operator	NVIDIA GPU Operator with MIG support
Storage	S3-compatible object storage (MinIO, Ceph, AWS S3)
GPU per worker	NVIDIA A100 or H100 recommended (MIG partitioning supported)
Event coordination	Redis (pub/sub for job coordination)
Monitoring	Prometheus + Grafana (metrics exported natively)
Deployment	Helm chart provided, ArgoCD recommended
NVIDIA Driver	565.57+ / CUDA 12.4+

Performance

Throughput benchmarks at ~20 lines per page. Actual results depend on document complexity, page dimensions, and lines per page. Sparse pages run faster, dense pages slower — roughly linear with line count.

Workstation (single GPU, RTX 3090)

Workload	Standard HTR	Super Models
Single page (cold start)	~10 s	~13 s
Per page (warm, amortized)	~3 s	~5 s
Archive box (100 pages)	~5 min	~8 min
Archival run (500 pages)	~25 min	~42 min
Daily throughput (24 h)	~27,000 pages	~16,500 pages

Enterprise (per A100)

Workload	Standard HTR	Super Models
Per page (warm, amortized)	~2 s	~4 s
Archive box (100 pages)	~3.5 min	~7 min
Archival run (500 pages)	~17 min	~33 min
Daily per GPU (24 h)	~42,000 pages	~21,000 pages
8× A100 cluster (24 h)	~300,000 pages	~168,000 pages

Cold start adds 5–10 seconds for model loading. Subsequent pages in the same batch use the warm throughput above. Throughput scales linearly with GPU count — add inference-server replicas with dedicated GPUs or MIG partitions to multiply capacity.

API & Integration

Transkribus On-Prem exposes integration points for embedding recognition into your existing workflows, archive systems, and downstream pipelines.

REST API
Submit jobs, query status, and retrieve results via HTTP. OpenAPI specification exposed at /openapi.json and /openapi.yaml — generate clients in any language. Available in the Enterprise edition.
S3 Ingestion
Drop files into a designated S3/MinIO bucket and jobs start automatically. Results appear back in S3 as PageXML, ALTO, TXT, or PDF. Enterprise edition.
Streaming API
Open live-streaming interface for real-time results. Results flow out line-by-line as pages are processed — embed into your own dashboards or downstream workflows.
Transkribus Compatibility
Filenames, metadata, and PageXML output round-trip cleanly back into Transkribus. Compatible with existing Transkribus integrations — no workflow rewrites needed.

Technical Reference

Processing Pipeline

Text-recognition Engines

Standard HTR

Super Models

Layout Analysis

Tables & Fields

Output Formats

Model Training

Extensible Architecture

Architecture

Workstation

Enterprise (Kubernetes / OpenShift)

System Requirements

Workstation

Enterprise

Performance

Workstation (single GPU, RTX 3090)

Enterprise (per A100)

API & Integration

REST API

S3 Ingestion

Streaming API

Transkribus Compatibility