University of Limerick
Browse

Recognising specific named entities in a new restricted domain using conditional random fields

Download (901.52 kB)
thesis
posted on 2023-01-26, 10:00 authored by Igal Gabbay
Named-entity recognition (NER) plays a vital role in information extraction, question answering and text mining. Classic NER research activity has focused on tagging instances of PERSON, LOCATION and ORGANISATION in the newswire domain. New fine-grained NER (FG-NER) covers subtypes of the classic NEs. The goal of this study was to investigate an FG-NER scenario with a set of new specific NEs (SNEs) typical to a new restricted journalistic domain. Reports on birth of animals in zoos were identified as such a productive domain. A 700-document corpus (241K tokens) named ZooBirth was compiled from a newspaper archive and annotated. It contained 2,811 instances of the ten most frequent numerical SNEs shortlisted from 43 candidates. Using Conditional Random Fields allowed testing positional and orderwithin- document features which were hypothesized to improve tagging SNEs. In support of positional features, analysis of distribution of SNEs within documents yielded SNE-specific patterns. The feature token position produced statistically significant but modest improvement in the case of two SNEs (82.2 to 84.4 strict precision, and 59.5 to 61.1 F-measure). Order-effect features improved with statistical significance the F-measure when tagging the weight at birth (from 68.4 to 71.1 strict, and from 75.5 to 80.6 lenient). In the final stage of the study a novel technique named subtractive tagging was introduced to enrich negative examples when training CRF. When tagging the newborn animal’s date of birth and the age of its mother strict recall improved from 52.8 to 60.1 and 65.5 to 68.9, respectively, with statistical significance.

History

Faculty

  • Faculty of Science and Engineering

Degree

  • Doctoral

First supervisor

Sutcliffe, Richard

Note

peer-reviewed

Language

English

Department or School

  • Computer Science & Information Systems

Usage metrics

    University of Limerick Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC