posted on 2014-01-06, 16:57authored byJohn Noll, Dominik Seichter, Sarah Beecham
Content analysis is a useful approach for analyzing
unstructured software project data, but it is labor-intensive and
slow. Can automated text classification (using supervised machine
learning) be used to reduce the labor or improve the speed of
content analysis?
We conducted a case study involving data from a previous
study that employed content analysis of an open source software
project. We used a human-coded data set with 3256 samples to
create different size training sets ranging in size from 100 to
3000 samples to train an “ensemble” text classifier to assign one
of five different categories to a test set of samples.
The results show that the automated classifier could be trained
to recognize categories, but much less accurately than the human
classifiers. In particular, both precision and recall for lowfrequency
categories was very low (less than 20%). Nevertheless,
we hypothesize that automated classifiers could be used to filter a
sample to identify common categories before human researchers
examine the remainder for more difficult categories.
History
Publication
The International Symposium on Empirical Software Engineering and Measurement (ESEM);pp. 300-303