posted on 2022-09-02, 13:24authored byArash Joorabchi
With the explosive growth in the number of electronic documents available on the
internet, intranets, and digital libraries, there is a growing need for automatic systems
capable of indexing and organising such large volumes of data more that ever.
Automatic Text Classification (ATC) has become one of the principal means for
enhancing the performance of information retrieval systems and organising digital
libraries and other textual collections. Within this context, the use of Machine
Learning (ML) algorithms has been the dominant approach to ATC since the 1990s.
However, one of the major obstacles in the deployment of ML-based ATC systems for practical real-world applications, is the lack or absence of high quality and/or
quantity labelled datasets for training the ML algorithms. The aim of this work is to
address this problem via investigating two lines of research: (a) the development of
new bootstrapping methods which automate the process of creating labelled corpora
required for training ML-based ATC systems; and (b) the development of a new breed of ATC algorithms which are unsupervised, and therefore do not require any training data. In order to achieve this aim, the project has mainly focused on utilising two knowledge sources whose potential application in ATC has yet to be fully explored.
Namely, the conventional library organisation resources (e.g., library classification schemes, thesauri, and online public access catalogues); and the linkage among documents in form of citation and reference networks. In relation to bootstrapping methods for ML-based ATC systems, our investigation has resulted in the development of two new methods. The developed methods greatly reduce the human
involvement in the process of building training datasets by utilising the documents
and textual contents that are abundantly available on the Internet as training samples.
The other major contribution of this work is the development and evaluation of a new
unsupervised ATC method which is capable of classifying a wide range of documents
with high accuracy according to a library classification scheme without requiring any
training data. This method, which has been named as Bibliography Based ATC (BBATC),
is based on the hypothesis that citations and references in a document can be
used as primary sources of information to determine the subject of the document with
a high accuracy. The proposed BB-ATC method automatically mines the citation and
reference networks among the documents and uses the classification metadata of
documents which are manually classified to predict the subject/class of unlabelled
documents. Finally, our further investigation into the application of citation networks in topical indexing of documents has resulted in the development of a new unsupervised keyword/keyphrase extraction method for scientific documents which is based on the same underlying theorem as the BB-ATC. The developed keyphrase extraction method does not require any training data and yields an accuracy similar to that obtained by human indexers and state-of-the-art ML-based keyphrase extraction methods, whose accuracy is highly dependant on the quality and quantity of the
manually labelled training data.