Franz Xaver Erhard (Leipzig University), Xiaoying 笑影 · PyLaia · Published March 12, 2024

Tibetan Modern U-chen Print 0.1

Text Recognition

Description

Tibetan Modern U-chen Print 0.1 (TMUP 0.1) is the first Transkribus HTR model for printed Tibetan language publications in Uchen (དབུ་ཅན་ dbu can) script. It has been trained on texts that were published in the PRC between the 1950s and 1980s. The model was trained on 522 pages in 20 documents. The training set consists of 470 pages; the validation set consists of 52 (10%) automatically selected pages. No basemodel was used. The model was developed by Franz Xaver Erhard (Leipzig University) and Xiaoying 笑影 (Leipzig University) for the Divergent Discourses project (DFG/AHRC). https://research.uni-leipzig.de/diverge/

Try this model

Tibetan Modern U-chen Print 0.1
Use this modelOpen in Transkribus
Very low error rate1.8% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 1.8% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words9,822
Lines7,989
Training Pages432
Model ID60669
Languages
Tibetan
Centuries
20th c.