Results 1  10
of
59
Probabilistic and Statistical Properties of Words: An Overview
 Journal of Computational Biology
, 2000
"... In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process a ..."
Abstract

Cited by 99 (3 self)
 Add to MetaCart
In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein’s method, and the ChenStein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to selfoverlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, con � dence intervals for tests. Key words: word counts, renewal counts, Markov model, exact distribution, normal approximation, Poisson process approximation, compound Poisson approximation, occurrences of multiple words, sequencing by hybridization, martingales, moment generating functions, Stein’s method, ChenStein method. 1.
Comparative Experiments on Learning Information Extractors for Proteins and their Interactions
, 2004
"... Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computeraccessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in M ..."
Abstract

Cited by 98 (7 self)
 Add to MetaCart
Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computeraccessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction eorts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manuallydeveloped rules.
Nearest Common Ancestors: A survey and a new distributed algorithm
, 2002
"... Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete ba ..."
Abstract

Cited by 90 (12 self)
 Add to MetaCart
(Show Context)
Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete balanced binary trees is straightforward. Furthermore, for complete balanced binary trees we can easily solve the problem in a distributed way by labeling the nodes of the tree such that from the labels of two nodes alone one can compute the label of their nearest common ancestor. Whether it is possible to distribute the data structure into short labels associated with the nodes is important for several applications such as routing. Therefore, related labeling problems have received a lot of attention recently.
A General Edit Distance between RNA Structures
, 2001
"... Arcannotated sequences are useful in representing the structural information of RNA sequences. ..."
Abstract

Cited by 86 (0 self)
 Add to MetaCart
Arcannotated sequences are useful in representing the structural information of RNA sequences.
A General Model for Authenticated Data Structures
 Algorithmica
, 2001
"... Query answers from online databases can easily be corrupted by hackers or malicious database publishers. Thus it is important to provide mechanisms which allow clients to trust the results from online queries. Authentic publication is a novel approach which allows untrusted publishers to securely ..."
Abstract

Cited by 70 (1 self)
 Add to MetaCart
(Show Context)
Query answers from online databases can easily be corrupted by hackers or malicious database publishers. Thus it is important to provide mechanisms which allow clients to trust the results from online queries. Authentic publication is a novel approach which allows untrusted publishers to securely answer queries from clients on behalf of trusted offline data owners. Publishers validate answers using compact, hardtoforge verification objects (VOs), which clients can check efficiently. This approach provides greater scalability (by adding more publishers) and better security (online publishers don't need to be trusted).
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring ..."
Abstract

Cited by 70 (4 self)
 Add to MetaCart
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Algorithms for Phylogenetic Footprinting
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 2001
"... Phylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species. In an earlier paper, we presented an exact algorithm that identifies the most conserved region of a set of ..."
Abstract

Cited by 63 (3 self)
 Add to MetaCart
(Show Context)
Phylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species. In an earlier paper, we presented an exact algorithm that identifies the most conserved region of a set of sequences. Here, we present a number of algorithmic improvements that produce a 1000 fold speedup over the original algorithm. We also show how prior knowledge can be used to identify weaker motifs, and how to handle data sets in which only an unknown subset of the sequences contain the regulatory element. Each technique is implemented and successfully identifies a large number of known binding sites, as well as several highly conserved but uncharacterized regions.
Dynamic LCA queries on trees
 SIAM Journal on Computing
, 1999
"... Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worstcase constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any ..."
Abstract

Cited by 59 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We show how to maintain a data structure on trees which allows for the following operations, all in worstcase constant time: 1. insertion of leaves and internal nodes, 2. deletion of leaves, 3. deletion of internal nodes with only one child, 4. determining the least common ancestor of any two nodes. We also generalize the Dietz–Sleator “cupfilling ” scheduling methodology, which may be of independent interest.
Efficient implementation of lazy suffix trees
 MESSAGE SEQUENCE CHARTS AND PETRI NETS, CITESEER.NJ.NEC.COM/VANDERAALST99INTERORGANIZATIONAL.HTML
, 1999
"... We present an efficient implementation of a writeonly topdown construction for suffix trees. Our implementation is based on a new, spaceefficient representation of suffix trees which requires only 12 bytes per input character in the worst case, and 8:5 bytes per input character on average for a c ..."
Abstract

Cited by 51 (5 self)
 Add to MetaCart
We present an efficient implementation of a writeonly topdown construction for suffix trees. Our implementation is based on a new, spaceefficient representation of suffix trees which requires only 12 bytes per input character in the worst case, and 8:5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated not before it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy topdown construction is often faster and more space efficient than other methods.
Matching Planar Maps
, 2003
"... The subject of this paper are algorithms for measuring the similarity of patterns of line segments in the plane, a standard problem in, e.g., computer vision, geographic information systems, etc. More precisely, we define feasible distance measures that reflect how close a given pattern H is to some ..."
Abstract

Cited by 50 (15 self)
 Add to MetaCart
The subject of this paper are algorithms for measuring the similarity of patterns of line segments in the plane, a standard problem in, e.g., computer vision, geographic information systems, etc. More precisely, we define feasible distance measures that reflect how close a given pattern H is to some part of a larger pattern G. These distance measures are generalizations of the wellknown Frchet distance for curves. We first give an efficient algorithm for the case that H is a polygonal curve and G is a geometric graph. Then, slightly relaxing the definition of distance measure, we give an algorithm for the general case where both, H and G, are geometric graphs.