Results 1 -
7 of
7
Mining Business Topics in Source Code using Latent Dirichlet Allocation ABSTRACT
"... One of the difficulties in maintaining a large software system is the absence of documented business domain topics and correlation between these domain topics and source code. Without such a correlation, people without any prior application knowledge would find it hard to comprehend the functionalit ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
One of the difficulties in maintaining a large software system is the absence of documented business domain topics and correlation between these domain topics and source code. Without such a correlation, people without any prior application knowledge would find it hard to comprehend the functionality of the system. Latent Dirichlet Allocation (LDA), a statistical model, has emerged as a popular technique for discovering topics in large text document corpus. But its applicability in extracting business domain topics from source code has not been explored so far. This paper investigates LDA in the context of comprehending large software systems and proposes a human assisted approach based on LDA for extracting domain topics from source code. This method has been applied on a number of open source and proprietary systems. Preliminary results indicate that LDA is able to identify some of the domain topics and is a satisfactory starting point for further manual refinement of topics.
Assigning Bug Reports using a Vocabulary-Based Expertise Model of Developers ∗
, 2009
"... For popular software systems, the number of daily submitted bug reports is high. Triaging these incoming reports is a time consuming task. Part of the bug triage is the assignment of a report to a developer with the appropriate expertise. In this paper, we present an approach to automatically sugges ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
For popular software systems, the number of daily submitted bug reports is high. Triaging these incoming reports is a time consuming task. Part of the bug triage is the assignment of a report to a developer with the appropriate expertise. In this paper, we present an approach to automatically suggest developers who have the appropriate expertise for handling a bug report. We model developer expertise using the vocabulary found in their source code contributions and compare this vocabulary to the vocabulary of bug reports. We evaluate our approach by comparing the suggested experts to the persons who eventually worked on the bug. Using eight years of Eclipse development as a case study, we achieve 33.6 % top-1 precision and 71.0 % top-10 recall.
Interactive exploration of semantic clusters
- In Proceedings of VISSOFT 2005 (3rd IEEE International Workshop on Visualizing Software For Understanding and Analysis
, 2005
"... Using visualization and exploration tools can be of great use for the understanding of a software system when only its source code is available. However, understanding a large software system by visualizing only its lower level artifacts (e.g., classes, methods) and the relations between them does n ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Using visualization and exploration tools can be of great use for the understanding of a software system when only its source code is available. However, understanding a large software system by visualizing only its lower level artifacts (e.g., classes, methods) and the relations between them does not scale for industrial-size systems. To address the scalability issue, higher level hierarchical abstractions (e.g., package structure, clustered decompositions of the system) should be used together with relations between them that are usually aggregated from the lower level relations. In this paper, we present the concepts behind Softwarenaut, a tool aimed at exploring any kind of hierarchical decompositions of a system, and then we look at a specific exploration of a system. In the experiment, the hierarchical decomposition of the system is the result of applying a semantical clustering to group classes that use similar terms.
Automatic Labeling of Software Components and their Evolution using Log-Likelihood Ratio of Word Frequencies in Source Code
"... As more and more open-source software components become available on the internet we need automatic ways to label and compare them. For example, a developer who searches for reusable software must be able to quickly gain an understanding of retrieved components. This understanding cannot be gained a ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
As more and more open-source software components become available on the internet we need automatic ways to label and compare them. For example, a developer who searches for reusable software must be able to quickly gain an understanding of retrieved components. This understanding cannot be gained at the level of source code due to the semantic gap between source code and the domain model. In this paper we present a lexical approach that uses the log-likelihood ratios of word frequencies to automatically provide labels for software components. We present a prototype implementation of our labeling/comparison algorithm and provide examples of its application. In particular, we apply the approach to detect trends in the evolution of a software system. 1.
Working Session: Information Retrieval Based Approaches in Software Evolution
"... During software evolution a collection of related artifacts with different representations are created. Some of these are composed of structured data (e.g., analysis data), some contain semi-structured information (e.g., source code), and many include unstructured information (e.g., text). Research ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
During software evolution a collection of related artifacts with different representations are created. Some of these are composed of structured data (e.g., analysis data), some contain semi-structured information (e.g., source code), and many include unstructured information (e.g., text). Research efforts exist that are trying to extract, represent, and analyze the unstructured information in software. Information retrieval (IR) techniques are used quite successfully in the past years to represent and extract textual information from software artifacts, with application to many maintenance tasks. This working session will focus on the state on the art in the application of IR-based techniques to support software maintenance activities. The session aims to identify the main research and practical issues in the field, to determine future work directions, and to foster collaborations among the participants.
Recovering management information from source code
"... IT has become a production means for many organizations and an important element of business strategy. Even though its effective management is a must, reality shows that this area still remains in its infancy. IT management relies profoundly on relevant information which enables risk mitigation or c ..."
Abstract
- Add to MetaCart
IT has become a production means for many organizations and an important element of business strategy. Even though its effective management is a must, reality shows that this area still remains in its infancy. IT management relies profoundly on relevant information which enables risk mitigation or cost control. However, the needed information is either missing or its gathering boils down to daunting tasks. We propose an approach to recovery of management information from the essence of IT; the software’s source code. In this paper we show how to employ source code analysis techniques and recover management information. In our approach we exploit the potential of the concealed data which resides in the source code statements, source comments, and also compiler listings. We show how to depart from the raw sources, extract data, organize it, and eventually utilize so that the bit level data provides IT executives with support at the portfolio level. Our approach is pragmatic as we rely on real management questions, best practices in software engineering, and also IT market specifics. We enable, for instance, an assessment of the IT-portfolio market value, support for carrying out what-if scenarios, or identification and evaluation of the hidden risks for IT-portfolio maintainability. Our approach was deployed in an industrial setting. The study is based on a real-life IT-portfolio which supports business functions of an organization operating in the financial sector. The IT-portfolio comprises Cobol applications run on a mainframe with the total number of lines of code amounting to over 18 million. The approach we propose is suited for facilitation within a large organization. It provides for a fact-based support for strategic decision making at the portfolio level. Keywords: IT-portfolio management; management information; source code analysis; lexical analysis; Latent Semantic Indexing; source code comments; compilers; obsolete language constructs; volatility; vendor locks; legacy systems; operational risk; technology risk; risk mitigation;
Approaches for Categorization of Reusable Software Components
, 2007
"... Reuse repositories manager manages the reusable software components in different categories and needs to find the category of reusable software components. In this paper, we have used different pure and hybrid approaches to find the domain relevancy of the component to a particular domain. Probabili ..."
Abstract
- Add to MetaCart
Reuse repositories manager manages the reusable software components in different categories and needs to find the category of reusable software components. In this paper, we have used different pure and hybrid approaches to find the domain relevancy of the component to a particular domain. Probabilistic Latent Semantic Analysis (PLSA) approach, LSA, Singular Value Decomposition (SVD) technique, LSA Semi-Discrete Matrix Decomposition (SDD) technique and Naive Bayes Approach purely as well as hybrid, are evaluated to determine the Domain Relevancy of software components. It exploits the fact that Feature Vector codes can be seen as documents containing terms-the identifiers present in the components- and so text modeling methods that capture co-occurrence information in low-dimensional spaces can be used. The FV code representation of clusters or domains is used to find the domain-relevancy of the software components. PLSA has provided better results than LSA retrieval techniques in terms of Precision and Recall but its time complexity is too high. SVD Transformation with Naïve Bayes scheme has outperformed all other approaches and shows better results than the existing approach (LSA) being used by some open source code repositories e.g. Sourceforge. The DR-value determined is close to the manual analysis, used to be performed by the programmers/repository managers. Hence, the tool can also be utilized for the automatic categorization of software components and this kind of automation may improve the productivity and quality of software development.

