Mining Sequential Patterns
, 1995
"... We are given a large database of customer transactions, where each transaction consists of customerid, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empiri ..."
We are given a large database of customer transactions, where each transaction consists of customerid, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scaleup experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scaleup properties with respect to the number of transactions per customer and the number of items in a transaction. 1 Introduction Database mining is motivated by the decision support problem faced by most large retail organizations. Progress in barcode technology has made it po...
Mining Sequential Patterns: Generalizations and Performance Improvements
 Research Report RJ 9994, IBM Almaden Research
, 1995
"... Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transactiontime, and each transaction is a set of items. The problem is to discover all sequential patterns with a user ..."
Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transactiontime, and each transaction is a set of items. The problem is to discover all sequential patterns with a userspeci ed minimum support, where the support of a pattern is the number of datasequences that contain the pattern. An example of a sequential pattern is \5 % of customers bought `Foundation' and `Ringworld ' in one transaction, followed by `Second Foundation ' in a later transaction". We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transactiontimes are within a userspeci ed time window. Third, given a userde ned taxonomy (isa hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy. We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and reallife data indicates that GSP is much faster than the AprioriAll algorithm presented in [3]. GSP scales linearly with the number of datasequences, and has very good scaleup properties with respect to the average datasequence size. 1
Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Finding motifs using random projections
, 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevz ..."
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4), (16,5), and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4), (16,5), and (18,6)motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in TimeSeries Databases
 In VLDB
, 1995
"... We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough nonoverlapping timeordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled ..."
We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough nonoverlapping timeordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled by any suitable amount and its offset adjusted appropriately. Two subsequences are considered similar if one can be enclosed within an envelope of a specified width drawn around the other. The model also allows nonmatching gaps in the matching subsequences. The matching subsequences need not be aligned along the time axis. Given this model of similarity,we present fast search techniques for discovering all similar sequences in a set of sequences. These techniques can also be used to find all (sub)sequences similar to a given sequence. We applied this matching system to the U.S. mutual funds data and discovered interesting matches.
PatternHunter II: Highly Sensitive and Fast Homology Search
, 2003
"... Extending the single optimized spaced seed of PatternHunter [20] to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of SmithWaterman, for homology search. At Blastn speed, PatternHunter II approaches SmithWaterman sensitivity, bring ..."
Extending the single optimized spaced seed of PatternHunter [20] to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of SmithWaterman, for homology search. At Blastn speed, PatternHunter II approaches SmithWaterman sensitivity, bringing homology search technology back to a full circle.
Designing seeds for similarity search in genomic dna
 Journal of Computer and System Sciences
, 2003
"... Abstract: Largescale comparisons of genomic DNA are of fundamental importance in annotating functional elements in genomes. To perform large comparisons efficiently, BLAST [3, 2] and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common patt ..."
Abstract: Largescale comparisons of genomic DNA are of fundamental importance in annotating functional elements in genomes. To perform large comparisons efficiently, BLAST [3, 2] and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common pattern or “seed ” of matching bases. The literature suggests that the choice of seed substantially affects the sensitivity of seeded alignment, but designing and evaluating seeds is computationally challenging. This work addresses problems arising in seed design. We give the fastest known algorithm for evaluating the sensitivity of a seed in a Markov model of ungapped alignments, as well as theoretical results on which seeds are good choices. We also describe Mandala, a software tool for seed design, and show that it can be used to improve the sensitivity of alignment in practice. 1
qgram based database searching using a suffix array
 QUASAR). Proceedings of the third annual international conference on Computational molecular biology (Recomb 99
, 1999
"... With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to allversusall comparisons of large databases. Her ..."
With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to allversusall comparisons of large databases. Here we present a new database searching algorithm called QUASAR (Qgram Alignment based on Suffix ARrays) which was designed to quickly detect sequences with strong similarity to the query in a context where many searches are conducted on one database. Our algorithm applies a modification of qtuple filtering implemented on top of a suffix array. Two versions were developed, one for a RAM resident suffix array and one for access to the suffix array on disk. We compared our implementation with BLAST and found that our approach is an order of magnitude faster. It is, however, restricted to the search for strongly similar DNA sequences as is typically required, e.g., in the context of clustering expressed sequence tags (ESTs). 1
Better Filtering with Gapped qGrams
, 2001
"... A popular and wellstudied class of filters for approximate string matching compares substrings of length q, the qgrams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped qgrams instead of contiguous substrings is men ..."
A popular and wellstudied class of filters for approximate string matching compares substrings of length q, the qgrams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped qgrams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped qgrams. We show that gapped qgrams can provide orders of magnitude faster and/or more efficient filtering than contiguous qgrams. To achieve these results the arrangement of the gaps in the qgram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.
YASS: enhancing the sensitivity of DNA similarity search
 NUCLEIC ACIDS RES
, 2005
"... YASS is a DNA local alignment tool based on an efficient and sensitive filtering algorithm. It applies transitionconstrained seeds to specify the most probable conserved motifs between homologous sequences, combined with a flexible hit criterion used to identify groups of seeds that are likely to e ..."
YASS is a DNA local alignment tool based on an efficient and sensitive filtering algorithm. It applies transitionconstrained seeds to specify the most probable conserved motifs between homologous sequences, combined with a flexible hit criterion used to identify groups of seeds that are likely to exhibit significant alignments. A web interface (http://www.loria.fr/projects/YASS/) is available to upload input sequences in fasta format, query the program and visualize the results obtained in several forms (dotplot, tabular output and others). A standalone version is available for download from the web page.