Results 1  10
of
45
Probabilistic and Statistical Properties of Words: An Overview
 Journal of Computational Biology
, 2000
"... In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process a ..."
Abstract

Cited by 83 (1 self)
 Add to MetaCart
In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein’s method, and the ChenStein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to selfoverlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, con � dence intervals for tests. Key words: word counts, renewal counts, Markov model, exact distribution, normal approximation, Poisson process approximation, compound Poisson approximation, occurrences of multiple words, sequencing by hybridization, martingales, moment generating functions, Stein’s method, ChenStein method. 1.
Comparative Experiments on Learning Information Extractors for Proteins and their Interactions
, 2004
"... Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computeraccessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in M ..."
Abstract

Cited by 79 (7 self)
 Add to MetaCart
Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computeraccessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction eorts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manuallydeveloped rules.
Nearest Common Ancestors: A survey and a new distributed algorithm
, 2002
"... Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete ba ..."
Abstract

Cited by 76 (11 self)
 Add to MetaCart
Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete balanced binary trees is straightforward. Furthermore, for complete balanced binary trees we can easily solve the problem in a distributed way by labeling the nodes of the tree such that from the labels of two nodes alone one can compute the label of their nearest common ancestor. Whether it is possible to distribute the data structure into short labels associated with the nodes is important for several applications such as routing. Therefore, related labeling problems have received a lot of attention recently.
A General Edit Distance between RNA Structures
, 2001
"... Arcannotated sequences are useful in representing the structural information of RNA sequences. ..."
Abstract

Cited by 69 (0 self)
 Add to MetaCart
Arcannotated sequences are useful in representing the structural information of RNA sequences.
Algorithms for Phylogenetic Footprinting
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 2001
"... Phylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species. In an earlier paper, we presented an exact algorithm that identifies the most conserved region of a set of ..."
Abstract

Cited by 57 (3 self)
 Add to MetaCart
Phylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species. In an earlier paper, we presented an exact algorithm that identifies the most conserved region of a set of sequences. Here, we present a number of algorithmic improvements that produce a 1000 fold speedup over the original algorithm. We also show how prior knowledge can be used to identify weaker motifs, and how to handle data sets in which only an unknown subset of the sequences contain the regulatory element. Each technique is implemented and successfully identifies a large number of known binding sites, as well as several highly conserved but uncharacterized regions.
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring ..."
Abstract

Cited by 56 (4 self)
 Add to MetaCart
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
A General Model for Authenticated Data Structures
 Algorithmica
, 2001
"... Query answers from online databases can easily be corrupted by hackers or malicious database publishers. Thus it is important to provide mechanisms which allow clients to trust the results from online queries. Authentic publication is a novel approach which allows untrusted publishers to securely ..."
Abstract

Cited by 47 (1 self)
 Add to MetaCart
Query answers from online databases can easily be corrupted by hackers or malicious database publishers. Thus it is important to provide mechanisms which allow clients to trust the results from online queries. Authentic publication is a novel approach which allows untrusted publishers to securely answer queries from clients on behalf of trusted offline data owners. Publishers validate answers using compact, hardtoforge verification objects (VOs), which clients can check efficiently. This approach provides greater scalability (by adding more publishers) and better security (online publishers don't need to be trusted).
Dynamic LCA queries on trees
 SIAM Journal on Computing
, 1999
"... Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worstcase constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worstcase constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any two nodes. We also generalize the Dietz–Sleator “cupfilling ” scheduling methodology, which may be of independent interest.
Approximate Distance Oracles for Geometric Graphs
, 2002
"... Given a geometric tspanner graph G in E d with n points and m edges, with edge lengths that lie within a polynomial (in n) factor of each other. Then, after O(m+n log n) preprocessing, we present an approximation scheme to answer (1+") approximate shortest path queries in O(1) time. The data str ..."
Abstract

Cited by 34 (10 self)
 Add to MetaCart
Given a geometric tspanner graph G in E d with n points and m edges, with edge lengths that lie within a polynomial (in n) factor of each other. Then, after O(m+n log n) preprocessing, we present an approximation scheme to answer (1+") approximate shortest path queries in O(1) time. The data structure uses O(n log n) space.
Finding maximal pairs with bounded gap
 Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 1645 of Lecture Notes in Computer Science
, 1999
"... A pair in a string is the occurrence of the same substring twice. A pair is maximal if the two occurrences of the substring cannot be extended to the left and right without making them different. The gap of a pair is the number of characters between the two occurrences of the substring. In this pape ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
A pair in a string is the occurrence of the same substring twice. A pair is maximal if the two occurrences of the substring cannot be extended to the left and right without making them different. The gap of a pair is the number of characters between the two occurrences of the substring. In this paper we present methods for finding all maximal pairs under various constraints on the gap. In a string of length n we can find all maximal pairs with gap in an upper and lower bounded interval in time O(n log n + z) where z is the number of reported pairs. If the upper bound is removed the time reduces to O(n+z). Since a tandem repeat is a pair where the gap is zero, our methods can be seen as a generalization of finding tandem repeats. The running time of our methods equals the running time of well known methods for finding tandem repeats.