Results 1 -
8 of
8
Classifying Software Changes: Clean or Buggy?
, 2008
"... This paper introduces a new technique for predicting latent software bugs, called change classification. Change classification uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes or clean changes. In this manner, change classification ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
This paper introduces a new technique for predicting latent software bugs, called change classification. Change classification uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes or clean changes. In this manner, change classification predicts the existence of bugs in software changes. The classifier is trained using features (in the machine learning sense) extracted from the revision history of a software project stored in its software configuration management repository. The trained classifier can classify changes as buggy or clean, with a 78 percent accuracy and a 60 percent buggy change recall on average. Change classification has several desirable qualities: 1) The prediction granularity is small (a change to a single file), 2) predictions do not require semantic information about the source code, 3) the technique works for a broad array of project types and programming languages, and 4) predictions can be made immediately upon the completion of a change. Contributions of this paper include a description of the change classification approach, techniques for extracting features from the source code and change histories, a characterization of the performance of change classification across 12 open source projects, and an evaluation of the predictive power of different groups of features.
Text is Software Too
, 2004
"... Software compiles and therefore is characterized by a parseable grammar. Natural language text rarely conforms to prescriptive grammars and therefore is much harder to parse. Mining parseable structures is easier than mining less structured entities. Therefore, most work on mining repositories focus ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
Software compiles and therefore is characterized by a parseable grammar. Natural language text rarely conforms to prescriptive grammars and therefore is much harder to parse. Mining parseable structures is easier than mining less structured entities. Therefore, most work on mining repositories focuses on software, not natural language text. Here, we report experiments with mining natural language text (requirements documents) suggesting that: (a) mining natural language is not too difficult, so (b) software repositories should routinely be augmented with all the natural language text used to develop that software.
Assigning Bug Reports using a Vocabulary-Based Expertise Model of Developers ∗
, 2009
"... For popular software systems, the number of daily submitted bug reports is high. Triaging these incoming reports is a time consuming task. Part of the bug triage is the assignment of a report to a developer with the appropriate expertise. In this paper, we present an approach to automatically sugges ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
For popular software systems, the number of daily submitted bug reports is high. Triaging these incoming reports is a time consuming task. Part of the bug triage is the assignment of a report to a developer with the appropriate expertise. In this paper, we present an approach to automatically suggest developers who have the appropriate expertise for handling a bug report. We model developer expertise using the vocabulary found in their source code contributions and compare this vocabulary to the vocabulary of bug reports. We evaluate our approach by comparing the suggested experts to the persons who eventually worked on the bug. Using eight years of Eclipse development as a case study, we achieve 33.6 % top-1 precision and 71.0 % top-10 recall.
Mining source code to automatically split identifiers for software analysis
- Mining Software Repositories, International Workshop on
, 2009
"... Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their co ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques. 1.
Recovering Traceability Links in Software Artifact Management Systems using Information Retrieval Methods
, 2007
"... The main drawback of existing software artifact management systems is the lack of automatic or semi-automatic traceability link generation and maintenance. We have improved an artifact management system with a traceability recovery tool based on Latent Semantic Indexing (LSI), an information retriev ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The main drawback of existing software artifact management systems is the lack of automatic or semi-automatic traceability link generation and maintenance. We have improved an artifact management system with a traceability recovery tool based on Latent Semantic Indexing (LSI), an information retrieval technique. We have assessed LSI to identify strengths and limitations of using information retrieval techniques for traceability recovery and devised the need for an incremental approach. The method and the tool have been evaluated during the development of seventeen software projects involving about 150 students. We observed that although tools based on information retrieval provide a useful support for the identification of traceability links during software development, they are still far to support a complete semi-automatic recovery of all links. The results of our experience have also shown that such tools can help to identify quality problems
Merging Duplicate Bug Reports by Sentence Clustering
"... Duplicate bug reports are often unfavorable because they tend to take many man hours for being identified as duplicates, marked so and eventually discarded. In this time, no progress occurs on the program in question, and is justifiably an overhead which should be minimized. Considerable research ha ..."
Abstract
- Add to MetaCart
Duplicate bug reports are often unfavorable because they tend to take many man hours for being identified as duplicates, marked so and eventually discarded. In this time, no progress occurs on the program in question, and is justifiably an overhead which should be minimized. Considerable research has been carried out to alleviate this problem. Many methods have been proposed for bug report categorization and duplicate bug report detection. However, it is often the case that a duplicate bug report can provide some additional information about a problem which could help in faster resolution of the bug. We propose that duplicate bug reports be merged when possible instead of being discarded, so that maximum information is captured. We propose a clustering-based algorithm to group together similar sentences and create a union of bug reports considered duplicates of each other. 1
TSE-0061-0306 1 Classifying Software Changes: Clean or Buggy?
"... Abstract — This paper introduces a new technique for finding latent software bugs called change classification. Change classification uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes, or clean changes. In this manner, change classif ..."
Abstract
- Add to MetaCart
Abstract — This paper introduces a new technique for finding latent software bugs called change classification. Change classification uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes, or clean changes. In this manner, change classification predicts the existence of bugs in software changes. The classifier is trained using features (in the machine learning sense) extracted from the revision history of a software project, as stored in its software configuration management repository. The trained classifier can classify changes as buggy or clean with 78 % accuracy and 65 % buggy change recall (on average). Change classification has several desirable qualities: (1) the prediction granularity is small (a change to a single file), (2) predictions do not require semantic information about the source code, (3) the technique works for a broad array of project types and programming languages, and (4) predictions can be made immediately upon completion of a change. Contributions of the paper include a description of the change classification approach, techniques for extracting features from source code and change histories, a characterization of the performance of change classification across 12 open source projects, and evaluation of the predictive power of different groups of features.

