20th Century Typewritten Letters - Diplomatics' Elements | Field Extraction Model

Description

This model was created by the Historical Archive of Mediobanca, an Italian investment bank, to identify and tag the elements of 20th century typewritten correspondence, in accordance with contemporary diplomatics. The training set consists of 251 pages: it contains outgoing letters written by the bank and a wide variety of letters received from various senders pertaining to big entities (banks and corporations, both Italian and international), in order to train the model on a broad range of formats and structures. For the same reason, the training set also includes outgoing and incoming telegrams. The model was trained to recognize the following diplomatics' elements and related tags: 1) "INTESTAZIONE_letterhead"; 2) "DATA_date"; 3) "RICEZIONE_date-received", usually a stamp; 4) "MITTENTE_sender"; 5) "DESTINATARIO_recipient"; 6) "OGGETTO_subject"; 7) "CORPO_textbody"; 8) "FIRMA_signature"; 9) "NOTE_notes", a tag used for typed notes; 10) "NOTE-MS_handwritten-notes"; 11) "RESPONSABILI_written-by", initials of those responsible for writing the letter; 12) "VISTO_read-by", initials indicating who read the letter; 13) "ALLEGATO_attachment", as body text. The tags are written both in Italian and English, with the following structure: "ITALIAN_english". The model achieves a Mean Average Precision of 59.43%. This model was created alongside the "20th Century Typewritten Italian" Text Recognition Model, as part of a larger project. It was trained by Silvia Carboni for Mediobanca's Historical Archive.

20th Century Typewritten Letters - Diplomatics' Elements

Open in Transkribus

Good precision59.43% MaP

Mean Average Precision (MaP) measures how accurately the model detects field regions (higher is better). This model scored 59.43% on its validation set. MaP is harder to compare across models than CER, because the score depends heavily on how many distinct region types the model must distinguish. A model detecting a handful of simple fields will naturally score higher than one trained to recognise many fine-grained regions, even if both perform well in practice.

This score reflects performance on the model's own validation data. Your results will depend on how closely your documents match the training material and the complexity of the structures you need to detect.

Words32,155

Lines5,346

Training Pages251

Model ID421081

Related models

20th Century Typewritten Letters - Diplomatics' Elements

Description

Related models

Field-model, 1700-tallets supplikprotokoller, supplik og svar

Danish Newspapers 1800-1900

Basic Book Fields II

Page Layout of printed books (around 1800)