+ What is a text? Starting to understand the theory behind Automated Text Recognition

Transkribus Team

November 19, 2017·2 min read

+ What is a text? Starting to understand the theory behind Automated Text Recognition

What is a text? A simple question with a not so simple answer. Coming from the scholarly editing tradition, Patrick Sahle, Professor at Albertus Magnus University of Cologne, has demonstrated in detail how different the perception or rather the understanding of text can be: from a string of signs on a paper to a work by a literate individual, that has to be (re)constructed from several versions and prints.

To systematically analyze different aspects of a text, Sahle started drawing the so called ‘text-wheel; (there’s a chapter about this in his third volume on scholarly digital editions, p. 45-55; see also Sahle, Patrick: What is a Scholarly Digital Edition?, in: Matthew James Driscoll and Elena Pierazzo (eds.), Digital Scholarly Editing: Theories and Practices. Cambridge, UK: Open Book Publishers, 2016. OBP.0095, p. 20-39 ).

The result is a range of different entities that a text can be understood as; some of the meanings oppose each other, others do not differ much.

In order to start understanding Automated Text Recognition from a theoretical stand-point, we started discussing with Professor Sahle, how and what form of ‘text’ is recognized in Transkribus (and also in general, if you’re using recognition tools such as OCR engines). The result is our own ‘text-wheel’, drawn by Julia Sorouri.

Most importantly text in Transkribus is understood as signs on a surface; you will need facsimiles or rather digitized images of documents in order to perform Automated Text Recognition. Through interpretation via machine learning (or typing by a human), it’s possible to produce text as it exists as a document (separated into text and line regions, and possibly word regions too in the future). From this point you can go on to extract text as a linguistic entity or as a work (for example by using Document Understanding technology to identify titles or marginalia) or even build upon entities in the text, understanding text as a carrier of information.

The wheel demonstrates what aspects of a text can be identified and the direction we are aiming at with the READ project. We want to provide high-quality Automated Text Recognition but we are also thinking about how to assure the validity and plausibility of text.

Let’s start a discussion that goes beyond the quality of text recognition but rather aims at a theory of Automated Text Recognition.

——–

By Dr Tobias Hodel, University of Zurich and State Archives of Zurich.

+ New and improved Transkribus How to Guides!

We’ve given the Transkribus How to Guides a reboot! New and improved guidelines are here to help Transkribus users achieve the automated transcription and full-text search of their collections. These...

+ Teach yourself to read historical handwriting with Transkribus Learn!

For anyone who has ever struggled to decipher a word in a manuscript, help is at hand! The first version of our new e-learning app, Transkribus Learn, is now live! Transkribus Learn allows users to...

+ Watch videos from the first Transkribus User Conference

We had lots of interest in our first Transkribus User Conference and anyone who missed it can now catch up by watching videos of the presentations. On 2-3 November 2017, around 80 Transkribus users...

Related Articles

+ New and improved Transkribus How to Guides!

+ Teach yourself to read historical handwriting with Transkribus Learn!

+ Watch videos from the first Transkribus User Conference