National Archives of Finland · PyLaia · Published January 22, 2020

NAF Court Records M10 v2

Text Recognition

Description

This model is based on Renovated District Court Records (Fi: Kihlakunnanoikeuksien renovoidut tuomiokirjat, Swe: Häradsrätternas renoverade domböcker) from the years 1809-1870. Models training set consists of 2841 double-pages and the validation set 100 double-pages.  Since there were many (dozens) scribes it is a combination of many different handwritings. The Ground Truth material is picked across Finland from 58 different court districts. Most of the Ground Truth is in Swedish, but there is also some Finnish since from 1850s some of the court districts started to write Court Records in Finnish. Renovated District Court Records are split into two series: Main Records & Notification Records. This model includes mostly Notification Records. Nevertheless the model also works fine with Main Records. This model was created as part of the READ project at National Archives of Finland (NAF). It has been used to transcribe the Notification Records from the years 1809-1870 (all districts). As a result, a search interface has been implemented where you can perform full text searches and browse automatically transcribed documents. The search interface and more information can be found at: https://tuomiokirjat.kansallisarkisto.fi/

Try this model

Use this modelOpen in Transkribus
Very low error rate2.5% CER

Character Error Rate (CER) measures the percentage of characters incorrectly recognised. Lower is better. This model scored 2.5% on its validation set. As a rule of thumb, a CER below 10% is considered good for most handwritten material. This is a larger model trained on diverse material, which generally makes it more robust across different handwriting styles. That said, larger training sets also make it harder to push the CER down further.

Measured on the model's own validation data. Results on your documents may differ depending on handwriting style, document condition, language, and how closely your material resembles the training data.

Words1,226,202
Lines207,773
Training Pages2,841
Model ID20686
Languages
Swedish