Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 239 (19 self)
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Comparative Experiments on Learning Information Extractors for Proteins and their Interactions
, 2004
"... Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computeraccessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in M ..."
Abstract

Cited by 105 (7 self)
Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computeraccessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction eorts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manuallydeveloped rules.
Probabilistic and Statistical Properties of Words: An Overview
 Journal of Computational Biology
, 2000
"... In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process a ..."
Abstract

Cited by 105 (4 self)
In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein’s method, and the ChenStein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to selfoverlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, con � dence intervals for tests. Key words: word counts, renewal counts, Markov model, exact distribution, normal approximation, Poisson process approximation, compound Poisson approximation, occurrences of multiple words, sequencing by hybridization, martingales, moment generating functions, Stein’s method, ChenStein method. 1.
Nearest Common Ancestors: A survey and a new distributed algorithm
, 2002
"... Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete ba ..."
Abstract

Cited by 90 (12 self)
Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete balanced binary trees is straightforward. Furthermore, for complete balanced binary trees we can easily solve the problem in a distributed way by labeling the nodes of the tree such that from the labels of two nodes alone one can compute the label of their nearest common ancestor. Whether it is possible to distribute the data structure into short labels associated with the nodes is important for several applications such as routing. Therefore, related labeling problems have received a lot of attention recently.
A General Edit Distance between RNA Structures
, 2001
"... Arcannotated sequences are useful in representing the structural information of RNA sequences. ..."
Abstract

Cited by 90 (0 self)
Arcannotated sequences are useful in representing the structural information of RNA sequences.
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring ..."
Abstract

Cited by 74 (4 self)
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
A General Model for Authenticated Data Structures
 Algorithmica
, 2001
"... Query answers from online databases can easily be corrupted by hackers or malicious database publishers. Thus it is important to provide mechanisms which allow clients to trust the results from online queries. Authentic publication is a novel approach which allows untrusted publishers to securely ..."
Abstract

Cited by 70 (1 self)
Query answers from online databases can easily be corrupted by hackers or malicious database publishers. Thus it is important to provide mechanisms which allow clients to trust the results from online queries. Authentic publication is a novel approach which allows untrusted publishers to securely answer queries from clients on behalf of trusted offline data owners. Publishers validate answers using compact, hardtoforge verification objects (VOs), which clients can check efficiently. This approach provides greater scalability (by adding more publishers) and better security (online publishers don't need to be trusted).
Algorithms for Phylogenetic Footprinting
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 2001
"... Phylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species. In an earlier paper, we presented an exact algorithm that identifies the most conserved region of a set of ..."
Abstract

Cited by 64 (3 self)
Phylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species. In an earlier paper, we presented an exact algorithm that identifies the most conserved region of a set of sequences. Here, we present a number of algorithmic improvements that produce a 1000 fold speedup over the original algorithm. We also show how prior knowledge can be used to identify weaker motifs, and how to handle data sets in which only an unknown subset of the sequences contain the regulatory element. Each technique is implemented and successfully identifies a large number of known binding sites, as well as several highly conserved but uncharacterized regions.
Dynamic LCA queries on trees
 SIAM Journal on Computing
, 1999
"... Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worstcase constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any ..."
Abstract

Cited by 59 (0 self)
Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worstcase constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any two nodes. We also generalize the Dietz–Sleator “cupfilling ” scheduling methodology, which may be of independent interest.
Matching Planar Maps
, 2003
"... The subject of this paper are algorithms for measuring the similarity of patterns of line segments in the plane, a standard problem in, e.g., computer vision, geographic information systems, etc. More precisely, we define feasible distance measures that reflect how close a given pattern H is to some ..."
Abstract

Cited by 53 (15 self)
The subject of this paper are algorithms for measuring the similarity of patterns of line segments in the plane, a standard problem in, e.g., computer vision, geographic information systems, etc. More precisely, we define feasible distance measures that reflect how close a given pattern H is to some part of a larger pattern G. These distance measures are generalizations of the wellknown Frchet distance for curves. We first give an efficient algorithm for the case that H is a polygonal curve and G is a geometric graph. Then, slightly relaxing the definition of distance measure, we give an algorithm for the general case where both, H and G, are geometric graphs.