Can automated text classification improve content analysis of software profect data?

Noll, John; Seichter, Dominik; Beecham, Sarah

Noll_2013_automated.pdf (142.11 kB)

Can automated text classification improve content analysis of software profect data?

conference contribution

posted on 2014-01-06, 16:57 authored by John Noll, Dominik Seichter, Sarah Beecham

Content analysis is a useful approach for analyzing unstructured software project data, but it is labor-intensive and slow. Can automated text classification (using supervised machine learning) be used to reduce the labor or improve the speed of content analysis? We conducted a case study involving data from a previous study that employed content analysis of an open source software project. We used a human-coded data set with 3256 samples to create different size training sets ranging in size from 100 to 3000 samples to train an “ensemble” text classifier to assign one of five different categories to a test set of samples. The results show that the automated classifier could be trained to recognize categories, but much less accurately than the human classifiers. In particular, both precision and recall for lowfrequency categories was very low (less than 20%). Nevertheless, we hypothesize that automated classifiers could be used to filter a sample to identify common categories before human researchers examine the remainder for more difficult categories.

History

Publication

The International Symposium on Empirical Software Engineering and Measurement (ESEM);pp. 300-303

Publisher

IEEE Computer Society

Note

peer-reviewed

Other Funding information

SFI

Rights

“© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.”

Language

English

External identifier

https://doi.org/10.1109/ESEM.2013.52

Usage metrics

Keywords

qualitative research content analysis text classification machine learning software engineering open source software

Licence

CC BY-NC-SA 1.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Can automated text classification improve content analysis of software profect data?

History

Publication

Publisher

Note

Other Funding information

Rights

Language

External identifier

Usage metrics

Categories

Keywords

Licence

Exports