University of Limerick
Browse

Towards automatic data cleansing and classification of valid historical data an incremental approach based on MDD

Download (2.18 MB)
conference contribution
posted on 2023-07-24, 11:25 authored by Enda O'Shea, Rafflesia KhanRafflesia Khan, Ciara BreathnachCiara Breathnach, Tiziana MargariaTiziana Margaria

The project Death and Burial Data: Ireland 1864-1922 (DBDIrl) examines the relationship between historical death registration data and burial data to explore the history of power in Ireland from 1864 to 1922. Its core Big Data arises from historical records from a variety of heterogeneous sources, some aspects are pre-digitized and machine readable. A huge data set (over 4 million records in each source) and its slow manual enrichment (ca 7,000 records processed so far) pose issues of quality, scalability, and creates the need for a quality assurance technology that is accessible to non-programmers. An important goal for the researcher community is to produce a reusable, high-level quality assurance tool for the ingested data that is domain specific (historic data), highly portable across data sources, thus independent of storage technology.

This paper outlines the step-wise design of the finer granular digital format, aimed for storage and digital archiving, and the design and test of two generations of the techniques, used in the first two data ingestion and cleaning phases.

The first small scale phase was exploratory, based on metadata enrichment transcription to Excel, and conducted in parallel with the design of the final digital format and the discovery of all the domain-specific rules and constraints for the syntax and semantic validity of individual entries. Excel embedded quality checks or database-specific techniques are not adequate due to the technology independence requirement. This first phase produced a Java parser with an embedded data cleaning and evaluation classifier, continuously improved and refined as insights grew.

The next, larger scale phase uses a bespoke Historian Web Application that embeds the Java validator from the parser, as well as a new Boolean classifier for valid and complete data assurance built using a Model-Driven Development technique that we also describe. This solution enforces property constraints directly at data capture time, removing the need for additional parsing and cleaning stages. The new classifier is built in an easy to use graphical technology, and the ADD-Lib tool it uses is a modern low-code development environment that auto-generates code in a large number of programming languages. It thus meets the technology independence requirement and historians are now able to produce new classifiers themselves without being able to program. We aim to infuse the project with computational and archival thinking in order to produce a robust data set that is FAIR compliant (Free Accessible Inter-operable and Re-useable)


Funding

Lero - the Irish Software Research Centre

Science Foundation Ireland

Find out more...

SFI Centre for Research Training in Artificial Intelligence

Science Foundation Ireland

Find out more...

History

Publication

2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 2020, pp. 1914-1923

Publisher

Institute of Electrical and Electronics Engineers

Other Funding information

We are grateful for the full cooperation of the Registrar General of Ireland for permission to use these data for research purposes. This research is funded by the Irish Research Council Laureate Award 2017/32 and by Science Foundation Ireland through the grants 13/RC/2094 to Lero - the Irish Software Research Centre (www.lero.ie) and 18/CRT/6223 to the Centre for Research Training in Artificial Intelligence.

Rights

© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Also affiliated with

  • LERO - The Irish Software Research Centre

Sustainable development goals

  • (4) Quality Education

Department or School

  • History

Usage metrics

    University of Limerick

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC