Results 1 - 10
of
60
Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks
, 2000
"... We describe a new framework for distilling information from word lattices to improve the accuracy of speech recognition and obtain a more perspicuous representation of a set of alternative hypotheses. In the standard MAP decoding approach the recognizer outputs the string of words corresponding ..."
Abstract
-
Cited by 115 (14 self)
- Add to MetaCart
We describe a new framework for distilling information from word lattices to improve the accuracy of speech recognition and obtain a more perspicuous representation of a set of alternative hypotheses. In the standard MAP decoding approach the recognizer outputs the string of words corresponding to the path with the highest posterior probability given the acoustics and a language model. However, even given optimal models, the MAP decoder does not necessarily minimize the commonly used performance metric, word error rate (WER). We describe a method for explicitly minimizing WER by extracting word hypotheses with the highest posterior probabilities from word lattices. We change the standard problem formulation by replacing global search over a large set of sentence hypotheses with local search over a small set of word candidates. In addition to improving the accuracy of the recognizer, our method produces a new representation of the set of candidate hypotheses that specifies ...
Finding consensus among words: lattice-based word error minimisation
- Computer Speech and Language
, 2000
"... can approximate1 We describe a new algorithm for finding the hypothesis in a recognition lattice that is expected to minimize the word error rate (WER). Our approach thus overcomes the mismatch between the word-based performance metric and the standard MAP scoring paradigm that is sentence-based, an ..."
Abstract
-
Cited by 89 (10 self)
- Add to MetaCart
can approximate1 We describe a new algorithm for finding the hypothesis in a recognition lattice that is expected to minimize the word error rate (WER). Our approach thus overcomes the mismatch between the word-based performance metric and the standard MAP scoring paradigm that is sentence-based, and that can lead to sub-optimal recognition results. To this end we first find a complete alignment of all words in the recognition lattice, identifying mutually supporting and competing word hypotheses. Finally, a new sentence hypothesis is formed by concatenating the words with maximal posterior probabilities. Experimentally, this approach leads to a significant WER reduction in a large vocabulary recognition task. 1.
Sublinear Time Algorithms for Metric Space Problems
"... In this paper we give approximation algorithms for the following problems on metric spaces: Furthest Pair, k- median, Minimum Routing Cost Spanning Tree, Multiple Sequence Alignment, Maximum Traveling Salesman Problem, Maximum Spanning Tree and Average Distance. The key property of our algorithms i ..."
Abstract
-
Cited by 68 (2 self)
- Add to MetaCart
In this paper we give approximation algorithms for the following problems on metric spaces: Furthest Pair, k- median, Minimum Routing Cost Spanning Tree, Multiple Sequence Alignment, Maximum Traveling Salesman Problem, Maximum Spanning Tree and Average Distance. The key property of our algorithms is that their running time is linear in the number of metric space points. As the full specification o`f an n-point metric space is of size \Theta(n 2 ), the complexity of our algorithms is sublinear with respect to the input size. All previous algorithms (exact or approximate) for the problems we consider have running time\Omega\Gamma n 2 ). We believe that our techniques can be applied to get similar bounds for other problems. 1 Introduction In recent years there has been a dramatic growth of interest in algorithms operating on massive data sets. This poses new challenges for algorithm design, as algorithms quite efficient on small inputs (for example, having quadratic running time) ...
Finding Similar Regions In Many Strings
- Journal of Computer and System Sciences
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely ..."
Abstract
-
Cited by 45 (6 self)
- Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely studied in bioinformatics research [26, 16, 12, 25, 4, 6, 15, 22, 24, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem [2, 3, 7, 9, 18] asks for the smallest d and a string s which is within Hamming distance d to each s i . The problem is NP-hard [7, 18]. [3] gives a polynomial time algorithm for constant d. For super-logarithmic d, [2, 9] give efficient approximation algorithms using linear program ralaxation techniques. The best polynomial time approximation has ratio 4 3 for all d, given by [18] ([9] also independently claimed the 4 3 ratio but only for super-logarithmic d). We settle the problem with a PTAS. We then give the first nontrivial better-than-2 approximation with ratio 2 \Gamma 2 2j\Sigmaj+1 for the more elusive Closest
On The Closest String and Substring Problems
- Journal of the ACM
, 2002
"... The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of lengt ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of length m, the Closest String problem is to find the smallest d and a string s of length m which is within Hamming distance d to each s i 2 S. This problem comes from coding theory when we are looking for a code not too far away from a given set of codes. Closest Substring problem, with an additional input integer L, asks for the smallest d and a string s, of length L, which is within Hamming distance d away from a substring, of length L, of each s i . This problem is much more elusive than the Closest String problem. The Closest Substring problem is formulated from applications in finding conserved regions, identifying genetic drug targets and generating genetic probes in molecular biology. Whether there are efficient approximation algorithms for both problems are major open questions in this area. We present two polynomial time approximation algorithms with approximation ratio 1 + ffl for any small ffl to settle both questions.
A Polynomial Time Approximation Scheme for Minimum Routing Cost Spanning Trees
, 1998
"... Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NP-hard, even when the costs obey the triangle i ..."
Abstract
-
Cited by 37 (5 self)
- Add to MetaCart
Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NP-hard, even when the costs obey the triangle inequality. We show that the general case is in fact reducible to the metric case and present a polynomial-time approximation scheme valid for both versions of the problem. In particular, we show how to build a spanning tree of an n-vertex weighted graph with routing cost within (1 + ffl) from the minimum in time O(n O( 1 ffl ) ). Besides the obvious connection to network design, trees with small routing cost also find application in the construction of good multiple sequence alignments in computational biology. The communication cost spanning tree problem is a generalization of the minimum routing cost tree problem where the routing costs of different pairs are weighted by different r...
The Parameterized Complexity of Sequence Alignment and Consensus
, 1994
"... The Longest common subsequence problem is examined from the point of view of parameterized computational complexity. There are several different ways in which parameters enter the problem, such as the number of sequences to be analyzed, the length of the common subsequence, and the size of the alpha ..."
Abstract
-
Cited by 35 (13 self)
- Add to MetaCart
The Longest common subsequence problem is examined from the point of view of parameterized computational complexity. There are several different ways in which parameters enter the problem, such as the number of sequences to be analyzed, the length of the common subsequence, and the size of the alphabet. Lower bounds on the complexity of this basic problem imply lower bounds on a number of other sequence alignment and consensus problems. At issue in the theory of parameterized complexity is whether a problem which takes input (x; k) can be solved in time f(k) \Delta n ff where ff is independent of k (termed fixed-parameter tractability). It can be argued that this is the appropriate asymptotic model of feasible computability for problems for which a small range of parameter values covers important applications --- a situation which certainly holds for many problems in biological sequence analysis. Our main results show that: (1) The Longest Common Subsequence (LCS) parameterized by t...
Multiple structural alignment by secondary structures: Algorithm and applications
- PROTEIN SCI.
, 2003
"... ..."
Finding Similar Regions in Many Sequences
- JOURNAL OF COMPUTER AND SYSTEM SCIENCES
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s 1 ; : : : ; s n . The Consensus Patterns problem, which has been widely studied in bioinformatics research [22, 10, 7 ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s 1 ; : : : ; s n . The Consensus Patterns problem, which has been widely studied in bioinformatics research [22, 10, 7, 21, 2, 3, 9, 18, 19, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show that the problem is NP-hard and give a polynomial time approximation scheme (PTAS) for it. We then present an efficient approximation algorithm for the consensus pattern problem under the original relative entropy measure of [22, 10, 7, 21]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important consensus alignment problem [6] allowing at most constant number of gaps, each of arbitrary length, in each sequence.
Evaluation of Techniques for Classifying Biological Sequences
, 2001
"... In recent years we have witnessed an exponential increase in the amount of biological information, either DNA or protein sequences, that has become available in public databases. This has been followed by an increased interest in developing computational techniques to automatically classify these ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
In recent years we have witnessed an exponential increase in the amount of biological information, either DNA or protein sequences, that has become available in public databases. This has been followed by an increased interest in developing computational techniques to automatically classify these large volumes of sequence data into various categories corresponding to either their role in the chromosomes, their structure, and/or their function. In this paper we evaluate some of the widely-used sequence classification algorithms and develop a framework for modeling sequences in a fashion so that traditional machine learning algorithms, such as support vector machines, can be applied easily.

