Description
A specialized Handwritten Text Recognition (HTR) model was developed using Pylaia in Transkribus to improve access to challenging plea rolls (CP40, KB27) from The National Archives, utilizing AALT website images provided by Robert Palmer, Elspeth Rosbrook, and Susanne Brand. Initially focused on KB27/795, the model tackles dense, abbreviated Court Hand script.
An innovative iterative strategy involved HTR processing, followed by refinement using an LLM (Anthropic's Claude 3.7 Sonnet) guided by paleographic rules and Vance Mead's index. Uncertain lines, identified by high Character Error Rate (CER) from multiple LLM transcriptions, were tagged "unclear." Crucially, these "unclear" lines—often due to manuscript damage or difficult script—were excluded from the ground truth used to retrain Pylaia. This created a "clean" training set focused on high-confidence transcriptions, improving the model's accuracy on clearer text and achieving ~5% CER on the target roll.
The transcription philosophy emphasizes manuscript fidelity: non-expansion of abbreviations, strict line integrity, and precise letterforms/capitalization. While trained on clean data from KB27/795, the model offers high accuracy there and is expected to perform well on similar rolls with graceful degradation. It provides visually faithful, non-expanded transcriptions, enhancing access to these vital historical records, especially their clearer sections.