posted on 2023-01-25, 11:23authored byMuslim Chochlov
Feature location is finding the source code that implements specific
functionality in software systems. Feature location is a complex activity
and, when performed manually, it may require significant developers'
effort. Consequently, semi-/automated feature location techniques
have been proposed to assist developers. One common group
of such approaches utilizes textual information in source code, and
applying information retrieval techniques. Since there is a paucity of
meaningful terms in source code, a reasonable research direction is to
mix various data sources to expand upon the dataset of meaningful
terms in source code entities, for information retrieval. One such data
source is the set of change-set descriptions. Not much work has been
done in the area of meaningful term expansion using change-set descriptions
and the extent to which such expansions are useful has not
been thoroughly studied in the literature.
This work proposes a technique which leverages change-set data sets
as a source of meaningful terms that can act as source code descriptors
(ACIR). It is the rst work to study change-sets in such a role in isolation
and characterize their e ectiveness as a data-set for information
retrieval based feature location. Specifically, it characterizes the performance
of ACIR in terms of granularity, recentness of change-sets,
aggregation of recent change-sets by change request, and filtering of
"management" change-sets using textual classification via a custom
built tool, implementing ACIR. The evaluation work is larger than
the other works in this area, employing 8 di erent subject systems
with a total of 600 re-enactment samples.
It was found, for ACIR, that the e ort required to locate entities is,
in general, lower at method level than le level of granularity. Additionally,
using more recent change-sets improves the effectiveness
of ACIR. However, aggregation of recent change-sets by change request,
decreases effectiveness. Surprisingly, it was also found, that
text classification based filtering of "management" change-sets, based
on generic management terms, decreases the e effectiveness of ACIR.
Further, the findings indicate that certain characteristics of subject
systems seem to affect the performance of ACIR: a strongly pronounced
dichotomy of subject systems emerged, where one set recorded
better feature location using ACIR and another recorded better FL
using a more traditional baseline approach. Finally, it was found, that
merging ACIR and a baseline approach significantly improves performance
over the baseline approach by 95% and over ACIR alone by
17%.
Apart from the more concrete findings on the effectiveness of the newly
proposed technique itself, the most fundamental finding is the importance
of rigorously characterizing proposed feature location techniques,
to identify their optimal configurations. The results also suggest
it is important to characterize the software systems under study
when selecting the appropriate feature location technique. In the
past, configuration of the techniques and characterization of subject
systems have not been considered first-class entities in research papers,
whereas the results presented here suggests these factors can
have a big impact.