Results 1  10
of
105
Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks
, 2000
"... We describe a new framework for distilling information from word lattices to improve the accuracy of speech recognition and obtain a more perspicuous representation of a set of alternative hypotheses. In the standard MAP decoding approach the recognizer outputs the string of words corresponding ..."
Abstract

Cited by 238 (14 self)
 Add to MetaCart
We describe a new framework for distilling information from word lattices to improve the accuracy of speech recognition and obtain a more perspicuous representation of a set of alternative hypotheses. In the standard MAP decoding approach the recognizer outputs the string of words corresponding to the path with the highest posterior probability given the acoustics and a language model. However, even given optimal models, the MAP decoder does not necessarily minimize the commonly used performance metric, word error rate (WER). We describe a method for explicitly minimizing WER by extracting word hypotheses with the highest posterior probabilities from word lattices. We change the standard problem formulation by replacing global search over a large set of sentence hypotheses with local search over a small set of word candidates. In addition to improving the accuracy of the recognizer, our method produces a new representation of the set of candidate hypotheses that specifies ...
Finding consensus among words: latticebased word error minimisation
 Computer Speech and Language
, 2000
"... can approximate1 We describe a new algorithm for finding the hypothesis in a recognition lattice that is expected to minimize the word error rate (WER). Our approach thus overcomes the mismatch between the wordbased performance metric and the standard MAP scoring paradigm that is sentencebased, an ..."
Abstract

Cited by 120 (10 self)
 Add to MetaCart
(Show Context)
can approximate1 We describe a new algorithm for finding the hypothesis in a recognition lattice that is expected to minimize the word error rate (WER). Our approach thus overcomes the mismatch between the wordbased performance metric and the standard MAP scoring paradigm that is sentencebased, and that can lead to suboptimal recognition results. To this end we first find a complete alignment of all words in the recognition lattice, identifying mutually supporting and competing word hypotheses. Finally, a new sentence hypothesis is formed by concatenating the words with maximal posterior probabilities. Experimentally, this approach leads to a significant WER reduction in a large vocabulary recognition task. 1.
Sublinear Time Algorithms for Metric Space Problems
"... In this paper we give approximation algorithms for the following problems on metric spaces: Furthest Pair, k median, Minimum Routing Cost Spanning Tree, Multiple Sequence Alignment, Maximum Traveling Salesman Problem, Maximum Spanning Tree and Average Distance. The key property of our algorithms i ..."
Abstract

Cited by 91 (2 self)
 Add to MetaCart
In this paper we give approximation algorithms for the following problems on metric spaces: Furthest Pair, k median, Minimum Routing Cost Spanning Tree, Multiple Sequence Alignment, Maximum Traveling Salesman Problem, Maximum Spanning Tree and Average Distance. The key property of our algorithms is that their running time is linear in the number of metric space points. As the full specification o`f an npoint metric space is of size \Theta(n 2 ), the complexity of our algorithms is sublinear with respect to the input size. All previous algorithms (exact or approximate) for the problems we consider have running time\Omega\Gamma n 2 ). We believe that our techniques can be applied to get similar bounds for other problems. 1 Introduction In recent years there has been a dramatic growth of interest in algorithms operating on massive data sets. This poses new challenges for algorithm design, as algorithms quite efficient on small inputs (for example, having quadratic running time) ...
Finding Similar Regions In Many Strings
 Journal of Computer and System Sciences
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely ..."
Abstract

Cited by 66 (8 self)
 Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely studied in bioinformatics research [26, 16, 12, 25, 4, 6, 15, 22, 24, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NPhard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem [2, 3, 7, 9, 18] asks for the smallest d and a string s which is within Hamming distance d to each s i . The problem is NPhard [7, 18]. [3] gives a polynomial time algorithm for constant d. For superlogarithmic d, [2, 9] give efficient approximation algorithms using linear program ralaxation techniques. The best polynomial time approximation has ratio 4 3 for all d, given by [18] ([9] also independently claimed the 4 3 ratio but only for superlogarithmic d). We settle the problem with a PTAS. We then give the first nontrivial betterthan2 approximation with ratio 2 \Gamma 2 2j\Sigmaj+1 for the more elusive Closest
On The Closest String and Substring Problems
 Journal of the ACM
, 2002
"... The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of ..."
Abstract

Cited by 65 (15 self)
 Add to MetaCart
The problem of finding a center string that is `close' to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = fs 1 ; s 2 ; : : : ; s n g, each of length m, the Closest String problem is to find the smallest d and a string s of length m which is within Hamming distance d to each s i 2 S. This problem comes from coding theory when we are looking for a code not too far away from a given set of codes. Closest Substring problem, with an additional input integer L, asks for the smallest d and a string s, of length L, which is within Hamming distance d away from a substring, of length L, of each s i . This problem is much more elusive than the Closest String problem. The Closest Substring problem is formulated from applications in finding conserved regions, identifying genetic drug targets and generating genetic probes in molecular biology. Whether there are efficient approximation algorithms for both problems are major open questions in this area. We present two polynomial time approximation algorithms with approximation ratio 1 + ffl for any small ffl to settle both questions.
Multiple structural alignment by secondary structures: Algorithm and applications
 PROTEIN SCI.
, 2003
"... ..."
The Parameterized Complexity of Sequence Alignment and Consensus
, 1994
"... The Longest common subsequence problem is examined from the point of view of parameterized computational complexity. There are several different ways in which parameters enter the problem, such as the number of sequences to be analyzed, the length of the common subsequence, and the size of the alpha ..."
Abstract

Cited by 47 (12 self)
 Add to MetaCart
The Longest common subsequence problem is examined from the point of view of parameterized computational complexity. There are several different ways in which parameters enter the problem, such as the number of sequences to be analyzed, the length of the common subsequence, and the size of the alphabet. Lower bounds on the complexity of this basic problem imply lower bounds on a number of other sequence alignment and consensus problems. At issue in the theory of parameterized complexity is whether a problem which takes input (x; k) can be solved in time f(k) \Delta n ff where ff is independent of k (termed fixedparameter tractability). It can be argued that this is the appropriate asymptotic model of feasible computability for problems for which a small range of parameter values covers important applications  a situation which certainly holds for many problems in biological sequence analysis. Our main results show that: (1) The Longest Common Subsequence (LCS) parameterized by t...
A Polynomial Time Approximation Scheme for Minimum Routing Cost Spanning Trees
, 1998
"... Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NPhard, even when the costs obey the triangle i ..."
Abstract

Cited by 47 (7 self)
 Add to MetaCart
Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NPhard, even when the costs obey the triangle inequality. We show that the general case is in fact reducible to the metric case and present a polynomialtime approximation scheme valid for both versions of the problem. In particular, we show how to build a spanning tree of an nvertex weighted graph with routing cost within (1 + ffl) from the minimum in time O(n O( 1 ffl ) ). Besides the obvious connection to network design, trees with small routing cost also find application in the construction of good multiple sequence alignments in computational biology. The communication cost spanning tree problem is a generalization of the minimum routing cost tree problem where the routing costs of different pairs are weighted by different r...
Finding Similar Regions in Many Sequences
 JOURNAL OF COMPUTER AND SYSTEM SCIENCES
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s 1 ; : : : ; s n . The Consensus Patterns problem, which has been widely studied in bioinformatics research [22, 10, 7 ..."
Abstract

Cited by 36 (9 self)
 Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s 1 ; : : : ; s n . The Consensus Patterns problem, which has been widely studied in bioinformatics research [22, 10, 7, 21, 2, 3, 9, 18, 19, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show that the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We then present an efficient approximation algorithm for the consensus pattern problem under the original relative entropy measure of [22, 10, 7, 21]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NPhard) version of the important consensus alignment problem [6] allowing at most constant number of gaps, each of arbitrary length, in each sequence.
The complexity of multiple sequence alignment with SPscore that is a metric
 TCS
, 2001
"... This paper analyzes the computational complexity of computing the optimal alignment of a set of sequences under the SP (sum of all pairs) score scheme. We solve an open question by showing that the problem is NP complete in the very restricted case in which the sequences are over a binary alphabet ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
(Show Context)
This paper analyzes the computational complexity of computing the optimal alignment of a set of sequences under the SP (sum of all pairs) score scheme. We solve an open question by showing that the problem is NP complete in the very restricted case in which the sequences are over a binary alphabet and the score is a metric. This result establishes the intractability of multiple sequence alignment under a score function of mathematical interest, which has indeed received much attention in biological sequence comparison.