Results 1 - 10
of
19
Recovering traceability links between code and documentation
- IEEE Trans. Softw. Eng
, 2002
"... Abstract—Software system documentation is almost always expressed informally in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs, and related maintenance reports. We propose a method based on informat ..."
Abstract
-
Cited by 140 (15 self)
- Add to MetaCart
Abstract—Software system documentation is almost always expressed informally in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs, and related maintenance reports. We propose a method based on information retrieval to recover traceability links between source code and free text documents. A premise of our work is that programmers use meaningful names for program items, such as functions, variables, types, classes, and methods. We believe that the application-domain knowledge that programmers process when writing the code is often captured by the mnemonics for identifiers; therefore, the analysis of these mnemonics can help to associate high-level concepts with program concepts and vice-versa. We apply both a probabilistic and a vector space information retrieval model in two case studies to trace C++ source code onto manual pages and Java code to functional requirements. We compare the results of applying the two models, discuss the benefits and limitations, and describe directions for improvements.
Language Modeling With Sentence-Level Mixtures
, 1994
"... Language models play an important role in improving the accuracy of a continuous speech recognizer. In this thesis, we introduce a new statistical language model which captures long term topic dependencies of words within and across sentences. The model includes two main contributions. First, we dev ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Language models play an important role in improving the accuracy of a continuous speech recognizer. In this thesis, we introduce a new statistical language model which captures long term topic dependencies of words within and across sentences. The model includes two main contributions. First, we develop a topic-dependent sentence-level mixture language model which takes advantage of the topic constraints in a sentence or a paragraph. Since this language model is not Markov and has a large search space, it is used only in the last stage of a multi-pass search strategy in the recognizer. Second, we introduce topic-dependent dynamic adaptation techniques in the framework of the mixture model. During the course of this thesis, we also investigate robust parameter estimation techniques, which are extremely important in light of the sparse data problems in language modeling. The model is implemented in the BU speech recognition system and provides a significant improvement in recognition accuracy. An important advantage of the framework of our model is that it is a simple extension of existing language modeling techniques that can easily be integrated with other language modeling advances.
Rational Interpolation Of Maximum Likelihood Predictors In Stochastic Language Modeling
, 1997
"... In our paper, we address the problem of estimating stochastic language models based on n-gram statistics. We present a novel approach, rational interpolation, for the combination of a competing set of conditional n-gram word probability predictors, which consistently outperforms the traditional lin ..."
Abstract
-
Cited by 14 (11 self)
- Add to MetaCart
In our paper, we address the problem of estimating stochastic language models based on n-gram statistics. We present a novel approach, rational interpolation, for the combination of a competing set of conditional n-gram word probability predictors, which consistently outperforms the traditional linear interpolation scheme. The superiority of rational interpolation is substantiated by experimental results from language modeling, speech recognition, dialog act classification, and language identification. 1. INTRODUCTION In our paper, we address the problem of estimating stochastic language models P (w) for sentences w = w1 : : : wT of words w t from a finite vocabulary V. The joint distribution P (w) can be decomposed by the wellknown chain rule P (w) = T Y t=1 P (w t jw t\Gamma1 1 ) = T Y t=1 P (w t j w1 : : : w t\Gamma1 ) (1) into a product of conditional word probabilities (by w t s we denote the substring ws : : : w t of w). The latter, in turn, are usually approximate...
Automatic Acquisition of Language Models for Speech Recognition
, 1994
"... This thesis focuses on the automatic acquisition of language structure and the subsequent use of the learned language structure to improve the performance of a speech recognition system. First, we develop a grammar inference process which is able to learn a grammar describing a large set of training ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
This thesis focuses on the automatic acquisition of language structure and the subsequent use of the learned language structure to improve the performance of a speech recognition system. First, we develop a grammar inference process which is able to learn a grammar describing a large set of training sentences. The process of acquiring this grammar is one of generalization so that the resulting grammar predicts likely sentences beyond those contained in the training set. From the grammar we construct a novel probabilistic language model called the phrase class n-gram model (pcng), which is a natural generalization of the word class n-gram model [11] to phrase classes. This model utilizes the grammar in such a way that it maintains full coverage of any test set while at the same time reducing the complexity, or number of parameters, of the resulting predictive model. Positive results are shown in terms of perplexity of the acquired phrase class n-gram models and in terms of reduction of ...
Tracing object-oriented code into functional requirements
- In Proceedings of the 8th International Workshop on Program Comprehension
, 2000
"... Software system documentation is almost always expressed informally, in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs and related maintenance reports. We propose an approach to establish and mainta ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Software system documentation is almost always expressed informally, in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs and related maintenance reports. We propose an approach to establish and maintain traceability links between source code and free text documents. A premise of our work is that programmers use meaningful names for program items, such as functions, variables, types, classes, and methods. We believe that the application-domain knowledge that programmers process when writing the code is often captured by the mnemonics for identifiers; therefore, the analysis of these mnemonics can help to associate high level concepts with program concepts, and vice-versa. In this paper, the approach is applied to software written in an object-oriented language, namely Java, to trace classes to functional requirements.
Ergodic Hidden Markov Models And Polygrams For Language Modeling
- In Proc. Int. Conf. on Acoustics, Speech and Signal Processing
, 1994
"... In this paper we present two new techniques for language modeling in speech recognition. The first technique is based on ergodic discrete density Hidden Markov Models (HMM) which can be applied to bigrams based on word categories. This statistical approach of the so-called Markov bigrams enables an ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
In this paper we present two new techniques for language modeling in speech recognition. The first technique is based on ergodic discrete density Hidden Markov Models (HMM) which can be applied to bigrams based on word categories. This statistical approach of the so-called Markov bigrams enables an efficient unsupervised learning procedure for the bigram probabilities with the well-known Baum-Welch algorithm. Furthermore, maximizing the model-conditional probability is equivalent to minimizing the perplexity of the training corpus. The second technique is based on polygrams which are an extension of the bigram (n = 2) or trigram (n = 3) grammars to any possible value of n. According to the smoothing techniques for bigram or trigram models, the probabilities of the n-grams in the polygram model are interpolated using the relative frequencies of all n 0 -grams with n 0 n. Both techniques were evaluated on the ATIS corpus by computing the test set perplexity. Furthermore we integr...
An approach to classify software maintenance requests
- In Proc., International Conference on Software Maintenance (ICSM
, 2002
"... When a software system critical for an organization exhibits a problem during its operation, it is relevant to fix it in a short period of time, to avoid serious economical losses. The problem is therefore noticed to the organization having in charge the maintenance, and it should be correctly and q ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
When a software system critical for an organization exhibits a problem during its operation, it is relevant to fix it in a short period of time, to avoid serious economical losses. The problem is therefore noticed to the organization having in charge the maintenance, and it should be correctly and quickly dispatched to the right maintenance team. We propose to automatically classify incoming maintenance requests (also said tickets), routing them to specialized maintenance teams. The final goal is to develop a router, working around the clock, that, without human intervention, dispatches incoming tickets with the lowest misclassification error, measured with respect to a given routing policy. 6000 maintenance tickets from a large, multi-site, software system, spanning about two years of system in-field operation, were used to compare and assess the accuracy of different classification approaches (i.e., Vector Space model, Bayesian model, support vectors, classification trees and k-nearest neighbor classification). The application and the tickets were divided into eight areas and pre-classified by human experts. Preliminary results were encouraging, up to 84 % of the incoming tickets were correctly classified.
Improving And Predicting Performance Of Statistical Language Models In Sparse Domains
, 1998
"... Standard statistical language models, or n-gram models, which represent the probability of word sequences, suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. This thesis focuses on improving the estimation of domain-dependent n-gram models by usi ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Standard statistical language models, or n-gram models, which represent the probability of word sequences, suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. This thesis focuses on improving the estimation of domain-dependent n-gram models by using out-of-domain text data. Previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains. In contrast, this thesis introduces two approaches that compensate for multi-domain differences, both representing "style" by part-of-speech (POS) sequences and "content" by the particular choice of words. First, data from multiple domains is combined using similarity weighting schemes that discriminate for content and style relevance prior to pooling multi-domain text. Second, n-gram distributions from multiple domains are combined, via a POS-dependent n-gram framework that separately compensate for word and POS usage differences. Two variations are explored: explicitly transforming the out-of-domain distribution before combining with an in-domain model, and vi separately estimating components of the POS-dependent n-gram model using multidomain data. Finally, measures to analyze and predict recognition performance of language models are also investigated, resulting in an algorithm for predicting performance differences associated with localized changes in language models given a recognition system.
Parsing N Best Trees from a Word Lattice
- In Advances in Artificial Intelligence. Proceedings of KI-97, number 1303 in LNAI
, 1997
"... . This article describes a probabilistic context free grammar approximation method for unification grammars. In order to produce good results, the method is combined with an N best parsing extension to chart parsing. The first part of the paper introduces the grammar approximation method, while the ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
. This article describes a probabilistic context free grammar approximation method for unification grammars. In order to produce good results, the method is combined with an N best parsing extension to chart parsing. The first part of the paper introduces the grammar approximation method, while the second part describes details of an efficient N-best packing and unpacking scheme for chart parsing. 1 Introduction Recently much attention has been payed to the integration of speech and language technology 1 . The concentration on spontaneous speech understanding led to the definition of a robust interface known as the word graph or word lattice between recognition and understanding. Depending on the application, systems are built to provide a shallow stochastic analysis or a deep linguistic analysis of the word lattice. Using a shallow stochastic approach, a rough template-based analysis can be achieved which makes sense in those cases where a fine grained reconstruction of meanings is...

