University of Limerick
Browse

Automatic text classification using bag of words and bag of concepts based representations

Download (4.59 MB)
thesis
posted on 2022-08-30, 14:06 authored by Alaa Alahmadi
Automatic Text Classification (ATC) is one of the most important tasks in data mining for organizing information and knowledge discovery. The goal of ATC is to alleviate the need of manually organizing large collections of text documents, which is done by assigning one or more predefined categories to a given textual document via applying appropriate natural language processing techniques. Overall, the classification process involves three components: text pre-processing, text representation and the classifier which is built using one of the Machine Learning (ML) algorithms. In general, all existing text representations are based on the Bag-of-Words (BOW) and Bag-of-Concepts (BOC) models and their variations. The BOW representation model ignores the semantic connections between words by breaking terms into their constituent words, and synonymous words are considered as independent words with no semantic association. The BOW limitations are addressed by using concepts as features in BOC model to represent text in ATC systems. The aim of this work is to investigate and assess the effect of communally available text representation models on the performance of ATC system, in term of the accuracy of the classification and the efficiency of implementation. To achieve that, both BOW and BOC representation models are used with the ATC system and Wikipedia as a knowledge base is utilized to provide concepts. In addition, different strategies that use both words and concepts to build combined models are reviewed and compared to BOW and BOC representation models. Moreover, two languages are used to evaluate these representation models in their ATC system, which are English and Arabic. For Arabic ATC system, different variations of BOW representation models are compared which is a result of different methods that used in text pre-processing component. Furthermore, WordNet as KBs is used to provide concepts to represent Arabic text in the ATC system. This is then followed by attempts to enrich text representation by combining the features of both BOW and BOC models, in order to further enhance the performance of the ATC. Our investigation has resulted in the development of two new strategies, namely Adding Unmapped Concepts (AUC) and Using Concepts for Terms which do not appear in the Document (CTD). Both developed strategies improve ATC systems’ performance in comparison with BOW and BOC representation models. They also bring text classification to a qualitatively new level of performance when compared to other strategies. In addition, CTD developed strategy reduced the time and memory required compared to other strategies used to enrich text representation in ATC systems. The results of our experiments show that text representation is a key element affecting the performance of both English and Arabic ATC systems, and the developed strategies show improvement in both languages in ATC systems. Furthermore, using Wikipedia concepts to build BOC model for Arabic ATC shows more efficiency for representing text than BOW model which does not line with what has been stated in English ATC. The reason behind that is the complex nature of the Arabic language which contains rich morphology and a large degree of the inflections and derivations. In addition, Arabic suffers from poor a morphological tool which makes Wikipedia concepts better features to represent text.

History

Degree

  • Doctoral

First supervisor

Mahdi, Abdulhussain E.

Note

peer-reviewed

Language

English

Usage metrics

    University of Limerick Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC