University of Limerick
Browse

Improving non-negative positive-unlabeled learning for news headline classification

Download (2.4 MB)
journal contribution
posted on 2023-05-04, 11:31 authored by Zhanlin Ji, Chengyuan Du, Jiawen Jiang, Li Zhao, Haiyang Zhang, Ivan GanchevIvan Ganchev

With the development of Internet technology, network platforms have gradually become a tool for people to obtain hot news. How to filter out the current hot news from a large number of news collections and push them to users has important application value. In supervised learning scenarios, each piece of news needs to be labeled manually, which takes a lot of time and effort. From the perspective of semi-supervised learning, on the basis of the non-negative Positive-Unlabeled (nnPU) learning, this paper proposes a novel algorithm, called ‘Enhanced nnPU with Focal Loss’ (FLPU), for news headline classification, which uses the Focal Loss to replace the way the classical nnPU calculates the empirical risk of positive and negative samples. Then, by introducing the Virtual Adversarial Training (VAT) of the Adversarial training for large neural LangUage Models (ALUM) into FLPU, another (and better) algorithm, called ‘FLPU+ALUM’, is proposed for the same purpose, aiming to label only a small number of positive samples. The superiority of both algorithms to the state-of-the-art PU algorithms considered is demonstrated by means of experiments, conducted on two datasets for performance comparison. Moreover, through another set of experiments, it is shown that, if enriched by the proposed algorithms, the RoBERTa-wwm-ext model can achieve better classification performance than the state-of-the-art binary classification models included in the comparison.In addition, a ‘Ratio Batch’ method is elaborated and proposed as more stable for use in scenarios involving only a small number of labeled positive samples, which is also experimentally demonstrated.

History

Publication

IEEE Access, vol. 11, pp. 40192-40203

Publisher

IEEE Computer Society

Other Funding information

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFE0135700; in part by the Tsinghua Precision Medicine Foundation under Grant 2022TS003; in part by the Science and Education for Smart Growth Operational Program (2014-2020) and co-financed by the European Union through the European Structural and Investment Funds under Grant BG05M2OP001-1.001-0003; and in part by the Telecommunications Research Centre (TRC), University of Limerick, Ireland

Department or School

  • Electronic & Computer Engineering

Usage metrics

    University of Limerick

    Categories

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC