University of Limerick
Browse

Automated anomaly detection for categorical data by repurposing a form filling recommender system

Download (1.01 MB)

Data quality is crucial in modern software systems, like data-driven decision support systems. However, data quality is affected by data anomalies, which represent instances that deviate from most of the data. These anomalies affect the reliability and trustworthiness of software systems, and may propagate and cause more issues. Although many anomaly detection approaches have been proposed, they mainly focus on numerical data. Moreover, the few approaches targeting anomaly detection for categorical data do not yield consistent results across datasets.

In this article, we propose a novel anomaly detection approach for categorical data named LAFF-AD (LAFF-based Anomaly Detection), which takes advantage of the learning ability of a state-of-the-art form filling tool (LAFF) to perform value inference on suspicious data. LAFF-AD runs a variant of LAFF that predicts the possible values of a suspicious categorical field in the suspicious instance. LAFF-AD then compares the output of LAFF to the recorded values in the suspicious instance, and uses a heuristic-based strategy to detect categorical data anomalies.

We evaluated LAFF-AD by assessing its effectiveness and efficiency on six datasets. Our experimental results show that LAFF-AD can accurately determine a high range of data anomalies, with recall values between 0.6 and 1 and a precision value of at least 0.808. Furthermore, LAFF-AD is efficient, taking at most 7000s and 735ms to perform training and prediction, respectively.

Funding

Lero - the Irish Software Research Centre

Science Foundation Ireland

Find out more...

History

Publication

ACM Journal of Data and Information Quality, 2024, 16 (3), article 16, pp. 1-28

Publisher

Association for Computing Machinery

Other Funding information

Part of this work was financially supported by the Alphonse Weicker Foundation and by our industrial partner BGL BNP Paribas Luxembourg. We thank Anne Goujon, Clément Lefebvre Renard, and Andrey Boytsov for their feedback on LAFF-AD and earlier drafts of the article. We also thank Chunfeng Yuan, for his help with the experiments. Lionel Briand was partly supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), through the Canada Research Chairs and discovery programs, and the Science Foundation Ireland grant 13/RC/2094-2. This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant reference C22/IS/17373407/LOGODOR. For the purpose of open access, and in fulfillment of the obligations arising from the grant agreement, the authors have applied a Creative Commons Attribution 4.0 International (CC BY 4.0) license to any Author Accepted Manuscript version arising from this submission.

Also affiliated with

  • LERO - The Science Foundation Ireland Research Centre for Software

Sustainable development goals

  • (3) Good Health and Well-being
  • (8) Decent Work and Economic Growth
  • (9) Industry, Innovation and Infrastructure
  • (16) Peace, Justice, and Strong Institutions
  • (17) Partnerships for the Goals

Usage metrics

    University of Limerick

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC