DiffLDA: Topic Evolution in Software Projects (2010)
BibTeX
@MISC{Thomas10difflda:topic,
author = {Stephen W. Thomas and Bram Adams and Ahmed E. Hassan and Dorothea Blostein},
title = {DiffLDA: Topic Evolution in Software Projects},
year = {2010}
}
OpenURL
Abstract
Previous research has shown that topics can be automatically discovered in a software project’s source code. Topics are collections of words that co-occur frequently in a text collection and are discovered using topic models such as latent Dirichlet allocation (LDA). Tracking how topics evolve, i.e., grow and spread, over time is useful for supporting software maintenance, comprehension, and re-engineering activities. The evolution of topics is typically recovered by applying LDA to all versions of a project’s source code at once, followed by post processing to map topics across versions. Although this technique works well in applications where each version of the data is completely different, for example in the analysis of conference proceedings, the technique does not work well with source code, which typically changes only incrementally and contains significant duplication across versions. In this paper, we present a new approach, called DiffLDA, for automatically mining topic evolution in source code. The approach addresses LDA’s sensitivity to document duplication by operating on the differences between versions of a source code document, resulting in a more accurate, finer-grained representation of topic evolution. We validate our approach through case studies on simulated data and two open source projects. 1







