f.erhard · PyLaia · Published September 6, 2024

TibNewsOne4All 0.2

Text Recognition

Description

The model TibNewsOne4All is trained on 500 pages (ca. 100.037 words) of 13 different Tibetan language newspapers of the 1950s and 1960s published in both India and the PRC. The model mainly transcribes Tibetan Uchen script, but can also handle cursive scripts and - very limited - Chinese and English. TibNewsOne4All was trained for the Divergent Discourses, a collaborative research project led by Robert Barnett at SOAS and Franz Xaver Erhard at Leipzig University with funding from AHRC and DFG. For best results, it is recommended to perform text region and line polygon detection before HTR. Settings: - training set of 500 pages - validation set of 27 pages - lines tagged "unclear" were excluded. - 250 epochs - early stopping: 20. - Existing line polygons were not used in the training! - Tibetan language model TMUP 0.1 used as a basemodel

Try this model

Use this modelOpen in Transkribus
Very low error rate2.52% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 2.52% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words92,423
Lines67,093
Training Pages500
Model ID169581
Languages
Tibetan
Centuries
20th c.