Machine transcription pipeline for historical Irish death records
The analysis and interpretation of primary sources, including transcribed documents, form the bedrock of social and demographic historical research. The Death and Burial Data: Ireland 1864-1922 (DB-DIrl) project, led by Dr. Ciara Breathnach, focuses on transcribing a significant volume of civil registration death records to understand mortality patterns and historical events in Ireland during the late 19th and early 20th centuries. Accurate transcriptions play a crucial role as they empower historians to compare and cross-reference multiple sources, unravelling connections and discrepancies that deepen our understanding of historical events. Through in-depth textual analysis, researchers can extract valuable data for both quantitative and qualitative studies, shedding light on patterns, trends, and societal changes that shape our collective past. Computer scientists specializing in model-driven development, deep learning, and artificial intelligence have been incorporated into the DBDIrl team to contribute towards these objectives. Their expertise enables the use of advanced technologies and algorithms to streamline and expedite the transcription process, thereby enhancing the overall workflow.
The focus of this research is the development of an automated end-to-end machine transcription pipeline that facilitates the transcription of extensive amounts of civil registration data provided by the General Register Office (GRO) in Ireland to the DBDIrl team. To achieve this, the pipeline leverages the structured tabular templates found across all death record documents during this time-span. Accurately identifying and segregating important elements like tables, images, and handwritten text, provides researchers with the ability to understand how information is organized within the documents. By leveraging semantic and instance segmentation techniques, we prioritize the segmentation of tables into their constituent elements, thereby ensuring accurate detection of rows and cells within each table structure and achieving nearly comprehensive coverage. Following the segmentation stage, object detection networks are utilized to extract all pertinent handwritten textual data from each individual column cell. To enhance the training process and capture essential characteristics of the handwritten text, data synthesis methods are employed to generate synthetic data. This synthetic data is incorporated to augment the diversity and variability of the training dataset, enabling the network to learn robust representations of the handwritten text.
Through the utilization of word-level classification, this research attains comprehensive classification of three columns among the eleven available, while also achieving partial classification of several others. The selection of these columns was specifically done to establish linkage with corresponding partially digitized GRO ledgers. This accomplishment forms the foundation for the creation of a large-scale image database, encompassing tens of thousands of handwritten names, with the potential to expand this dataset into the millions. With the data diligently segmented to its finest level of granularity and poised for processing, our ongoing efforts are centred on expanding this work to encompass the remaining eight columns. Through this expansion, we aim to unearth deeper insights and foster a more comprehensive understanding of the death records spanning from 1864 to 1922. Building upon this initial success, the scope of the project can be expanded to encompass birth and marriage records across the same period. This expansion will enable a broader examination of historical events and societal dynamics, offering a holistic view of demographic patterns and life events during this significant era.
History
Faculty
- Faculty of Science and Engineering
Degree
- Doctoral
First supervisor
Tiziana MargariaDepartment or School
- Computer Science & Information Systems