Results 1  10
of
144
Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract

Cited by 237 (4 self)
 Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Probabilistic discovery of time series motifs
, 2003
"... Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of thi ..."
Abstract

Cited by 119 (21 self)
 Add to MetaCart
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise. Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or “don’t care ” symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
Modeling Dependencies in ProteinDNA Binding Sites
, 2003
"... The availability of whole genome sequences and highthroughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNAbinding proteins, such as transcription factors. A common representation ..."
Abstract

Cited by 79 (2 self)
 Add to MetaCart
The availability of whole genome sequences and highthroughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNAbinding proteins, such as transcription factors. A common representation of transcription factor binding sites is aposition specific score matrix (PSSM). This representation makes the strong assumption that binding site positions are independent of each other. In this work, we explore Bayesian network representations of binding sites that provide different tradeoffs between complexity (number of parameters) and the richness of dependencies between positions. We develop the formal machinery for learning such models from data and for estimating the statistical significance of putative binding sites. We then evaluate the ramifications of these richer representations in characterizing binding site motifs and predicting their genomic locations. We show that these richer representations improve over the PSSM model in both tasks.
Additivity in protein–DNA interactions: how good an approximation is it
 Nucleic Acids Res
, 2002
"... Man and Stormo and Bulyk et al. recently presented their results on the study of the DNA binding af®nity of proteins. In both of these studies the main conclusion is that the additivity assumption, usually applied in methods to search for binding sites, is not true. In the ®rst study, the analysis o ..."
Abstract

Cited by 79 (11 self)
 Add to MetaCart
Man and Stormo and Bulyk et al. recently presented their results on the study of the DNA binding af®nity of proteins. In both of these studies the main conclusion is that the additivity assumption, usually applied in methods to search for binding sites, is not true. In the ®rst study, the analysis of binding af®nity data from the Mnt repressor protein bound to all possible DNA (sub)targets at positions 16 and 17 of the binding site, showed that those positions are not independent. In the second study, the authors analysed DNA binding af®nity data of the wildtype mouse EGR1 protein and four variants differing on the middle ®nger. The binding af®nity of these proteins was measured to all 64 possible trinucleotide (sub)targets of the middle ®nger using microarray technology. The analysis of the measurements also showed interdependence among the positions in the DNA target. In the present report, we review the data of both studies and we reanalyse them using various statistical methods, including a comparison with a multiple regression approach. We conclude that despite the fact that the additivity assumption does not ®t the data perfectly, in most cases it provides a very good approximation of the true nature of the speci®c protein±DNA interactions. Therefore, additive models can be very useful for the discovery and prediction of binding sites in genomic DNA.
Nearest Neighbors In HighDimensional Spaces
, 2004
"... In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer sci ..."
Abstract

Cited by 76 (2 self)
 Add to MetaCart
In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer science, including pattern recognition, searching in multimedial data, vector compression [GG91], computational statistics [DW82], and data mining. Many of these applications involve data sets which are very large (e.g., a database containing Web documents could contain over one billion documents). Moreover, the dimensionality of the points is usually large as well (e.g., in the order of a few hundred). Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension. The nearestneighbor problem is an example of a large class of proximity problems, which, roughly speaking, are problems whose definitions involve the notion of...
Efficient LargeScale Sequence Comparison by LocalitySensitive Hashing
 Bioinformatics
, 2001
"... Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences e#cie ..."
Abstract

Cited by 73 (6 self)
 Add to MetaCart
Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences e#ciently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other di#erences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade o# between e#ciency and sensitivity to features without long exact matches. Results: We introduce a new algorithm, lshallpairs, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments e#ciently using a randomized search technique, localitysensitive hashing. We have found lshallpairs to be both e#cient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length. Availability: Contact the author at the address below. Contact: jbuhler@cs.washington.edu Supplementary Information: the sequences and local alignment data described in this work are available at http://bio.cs.washington.edu/jbuhlerbioinformatics2001/. Keywords: local alignment, genome annotation, localitysensitive hashing Sequence Comparison by LocalitySensitive Hashing 1
Finding Motifs in Time Series
, 2002
"... The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously ..."
Abstract

Cited by 72 (15 self)
 Add to MetaCart
The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns "motifs," because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition, it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this work we carefully motivate, then introduce, a nontrivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.
Finding composite regulatory patterns in DNA sequences
 Bioinformatics
, 2002
"... Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actua ..."
Abstract

Cited by 71 (3 self)
 Add to MetaCart
Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad patterns that occur near each other. A difficulty in discovering composite patterns is that one or both of the component monad patterns in the group may be “too weak”. Since the traditional monadbased motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. In this paper, we present a MITRA (MIsmatch TRee Algorithm) approach for discovering composite signals. We demonstrate that MITRA performs well for both monad and composite patterns by presenting experiments over biological and synthetic data. Availability: MITRA is available at
From Promoter Sequence to Expression: A Probabilistic Framework
, 2002
"... We present a probabilistic framework that models the process by which transcriptional binding explains the mRNA expression of different genes. Our joint probabilistic model unifies the two key components of this process: the prediction of gene regulation events from sequence motifs in the gene's pro ..."
Abstract

Cited by 55 (5 self)
 Add to MetaCart
We present a probabilistic framework that models the process by which transcriptional binding explains the mRNA expression of different genes. Our joint probabilistic model unifies the two key components of this process: the prediction of gene regulation events from sequence motifs in the gene's promoter region, and the prediction of mRNA expression from combinations of gene regulation events in different settings. Our approach has several advantages. By learning promoter sequence motifs that are directly predictive of expression data, it can improve the identification of binding site patterns. It is also able to identify combinatorial regulation via interactions of different transcription factors. Finally, the general framework allows us to integrate additional data sources, including data from the recent binding localization assays. We demonstrate our approach on the cell cycle data of Spellman et al., combined with the binding localization information of Simon et al. We show that the learned model predicts expression from sequence, and that it identifies coherent coregulated groups with significant transcription factor motifs. It also provides valuable biological insight into the domain via these coregulated "modules" and the combinatorial regulation effects that govern their behavior.
Experiencing SAX: A Novel Symbolic Representation of Time Series. Data Mining and Knowledge Discovery Journal
, 2007
"... Abstract Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would pote ..."
Abstract

Cited by 51 (13 self)
 Add to MetaCart
Abstract Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction,