Automatic subject classification of textual documents using limited or no training data

Joorabchi, Arash

Automatic subject classification of textual documents using limited or no training data

thesis

posted on 2022-09-02, 13:24 authored by Arash Joorabchi

With the explosive growth in the number of electronic documents available on the internet, intranets, and digital libraries, there is a growing need for automatic systems capable of indexing and organising such large volumes of data more that ever. Automatic Text Classification (ATC) has become one of the principal means for enhancing the performance of information retrieval systems and organising digital libraries and other textual collections. Within this context, the use of Machine Learning (ML) algorithms has been the dominant approach to ATC since the 1990s. However, one of the major obstacles in the deployment of ML-based ATC systems for practical real-world applications, is the lack or absence of high quality and/or quantity labelled datasets for training the ML algorithms. The aim of this work is to address this problem via investigating two lines of research: (a) the development of new bootstrapping methods which automate the process of creating labelled corpora required for training ML-based ATC systems; and (b) the development of a new breed of ATC algorithms which are unsupervised, and therefore do not require any training data. In order to achieve this aim, the project has mainly focused on utilising two knowledge sources whose potential application in ATC has yet to be fully explored. Namely, the conventional library organisation resources (e.g., library classification schemes, thesauri, and online public access catalogues); and the linkage among documents in form of citation and reference networks. In relation to bootstrapping methods for ML-based ATC systems, our investigation has resulted in the development of two new methods. The developed methods greatly reduce the human involvement in the process of building training datasets by utilising the documents and textual contents that are abundantly available on the Internet as training samples. The other major contribution of this work is the development and evaluation of a new unsupervised ATC method which is capable of classifying a wide range of documents with high accuracy according to a library classification scheme without requiring any training data. This method, which has been named as Bibliography Based ATC (BBATC), is based on the hypothesis that citations and references in a document can be used as primary sources of information to determine the subject of the document with a high accuracy. The proposed BB-ATC method automatically mines the citation and reference networks among the documents and uses the classification metadata of documents which are manually classified to predict the subject/class of unlabelled documents. Finally, our further investigation into the application of citation networks in topical indexing of documents has resulted in the development of a new unsupervised keyword/keyphrase extraction method for scientific documents which is based on the same underlying theorem as the BB-ATC. The developed keyphrase extraction method does not require any training data and yields an accuracy similar to that obtained by human indexers and state-of-the-art ML-based keyphrase extraction methods, whose accuracy is highly dependant on the quality and quantity of the manually labelled training data.

History

Degree

Doctoral

First supervisor

Mahdi, Abdulhussain E.

Note

peer-reviewed

Language

English

Automatic subject classification of textual documents using limited or no training data

History

Degree

First supervisor

Note

Language

Usage metrics

Categories

Keywords

Licence

Exports