f.erhard · PyLaia · Published July 18, 2025

Tibetan Generic 0.1

Text Recognition

Description

First version of a generic Tibetan model that includes Uchan (dbu can), Ume (dbu med) as well as some English and Chinese. The texts come from the 18th to 20th century, including legal texts (Daniel Wojahn), modern books from the 1950s to 1980s (Divergent Discourses) as well as Tibetan Language Newspapers from the 1950s and 1960s (Divergent Discourses). "Test model Chinese" was chosen as base model to introduce some basic knowledge of Chinese, which features often in Tibetan texts and is contained in the training data only to some extend. Word count: 161482 words; validation set: 153 pages; training set: 1380 pages. Training cycles: 250; Early Stopping: 20; lines tagged "unclear" or "gap" were omitted; binarization enabled

Try this model

Tibetan Generic 0.1
Use this modelOpen in Transkribus
Very low error rate3.58% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 3.58% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words143,742
Lines97,979
Training Pages1,380
Model ID373545
Languages
Tibetan
Centuries
18th c.19th c.20th c.