Results 1  10
of
21
Mining Sequential Patterns
, 1995
"... We are given a large database of customer transactions, where each transaction consists of customerid, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empiri ..."
Abstract

Cited by 1173 (5 self)
 Add to MetaCart
We are given a large database of customer transactions, where each transaction consists of customerid, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scaleup experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scaleup properties with respect to the number of transactions per customer and the number of items in a transaction. 1 Introduction Database mining is motivated by the decision support problem faced by most large retail organizations. Progress in barcode technology has made it po...
Mining Sequential Patterns: Generalizations and Performance Improvements
 Research Report RJ 9994, IBM Almaden Research
, 1995
"... Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transactiontime, and each transaction is a set of items. The problem is to discover all sequential patterns with a user ..."
Abstract

Cited by 548 (4 self)
 Add to MetaCart
Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transactiontime, and each transaction is a set of items. The problem is to discover all sequential patterns with a userspeci ed minimum support, where the support of a pattern is the number of datasequences that contain the pattern. An example of a sequential pattern is \5 % of customers bought `Foundation' and `Ringworld ' in one transaction, followed by `Second Foundation ' in a later transaction". We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transactiontimes are within a userspeci ed time window. Third, given a userde ned taxonomy (isa hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy. We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and reallife data indicates that GSP is much faster than the AprioriAll algorithm presented in [3]. GSP scales linearly with the number of datasequences, and has very good scaleup properties with respect to the average datasequence size. 1
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in TimeSeries Databases
 In VLDB
, 1995
"... We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough nonoverlapping timeordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled ..."
Abstract

Cited by 198 (6 self)
 Add to MetaCart
We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough nonoverlapping timeordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled by any suitable amount and its offset adjusted appropriately. Two subsequences are considered similar if one can be enclosed within an envelope of a specified width drawn around the other. The model also allows nonmatching gaps in the matching subsequences. The matching subsequences need not be aligned along the time axis. Given this model of similarity,we present fast search techniques for discovering all similar sequences in a set of sequences. These techniques can also be used to find all (sub)sequences similar to a given sequence. We applied this matching system to the U.S. mutual funds data and discovered interesting matches.
Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm
 BIOINFORMATICS
, 1998
"... Motivation: The discovery of motifs in biological sequences is an important problem. Results: This paper presents a new algorithm for the discovery of rigid patterns (motifs) in biological sequences. Our method is combinatorial in nature and able to produce all patterns that appear in at least a (us ..."
Abstract

Cited by 165 (11 self)
 Add to MetaCart
Motivation: The discovery of motifs in biological sequences is an important problem. Results: This paper presents a new algorithm for the discovery of rigid patterns (motifs) in biological sequences. Our method is combinatorial in nature and able to produce all patterns that appear in at least a (userdefined) minimum number of sequences, yet it manages to be very efficient by avoiding the enumeration of the entire pattern space. Furthermore, the reported patterns are maximal: any reported pattern cannot be made more specific and still keep on appearing at the exact same positions within the input sequences. The effectiveness of the proposed approach is showcased on a number of test cases which aim to: (i) validate the approach through the discovery of previously reported patterns; (ii) demonstrate the capability to identify automatically highly selective patterns particular to the sequences under consideration. Finally, experimental analysis indicates that the algorithm is output sensitive, i.e. its running time is quasilinear to the size of the generated output.
Approaches to the Automatic Discovery of Patterns in Biosequences
, 1995
"... This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which a ..."
Abstract

Cited by 138 (21 self)
 Add to MetaCart
This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis presented of the ways in which an assessment can be made of the significance and usefulness of the discovered patterns. It is shown that this problem is related to problems studied in the field of machine learning. The largest part of this paper comprises a review of a number of existing methods developed to solve this problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered...
Efficient Discovery of Conserved Patterns Using a Pattern Graph
 Comput. Appl. Biosci
, 1997
"... Motivation: We have previously reported an algorithm for discovering patterns conserved in sets of related unaligned protein sequences. The algorithm was implemented in a program called Pratt. Pratt allows the user to define a class of patterns (e.g. the degree of ambiguity allowed and the length an ..."
Abstract

Cited by 66 (8 self)
 Add to MetaCart
Motivation: We have previously reported an algorithm for discovering patterns conserved in sets of related unaligned protein sequences. The algorithm was implemented in a program called Pratt. Pratt allows the user to define a class of patterns (e.g. the degree of ambiguity allowed and the length and number of gaps), and is then guaranteed to find the consen>ed patterns in this class scoring highest according to a defined fitness measure. In many cases, this version of Pratt was very efficient, but in other cases it was too time consuming to be applied. Hence, a more efficient algorithm was needed. Results: In this paper, we describe a new and improved searching strategy that has two main advantages over the old strategy. First, it allows for easier integration with programs for multiple sequence alignment and data base search. Secondly, it makes it possible to use branchandbound search, and heuristics, to speed up the search. The new search strategy has been implemented in a new version of the Pratt program. Availability: The source code for the Pratt programs can be
Finding Similar Regions In Many Strings
 Journal of Computer and System Sciences
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1 ; : : : ; sn . The Consensus Patterns problem, which has been widely studied in bioinformatics research [26, 16, 12, 25, 4, 6, 15, 22, 24, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NPhard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem [2, 3, 7, 9, 18] asks for the smallest d and a string s which is within Hamming distance d to each s i . The problem is NPhard [7, 18]. [3] gives a polynomial time algorithm for constant d. For superlogarithmic d, [2, 9] give efficient approximation algorithms using linear program ralaxation techniques. The best polynomial time approximation has ratio 4 3 for all d, given by [18] ([9] also independently claimed the 4 3 ratio but only for superlogarithmic d). We settle the problem with a PTAS. We then give the first nontrivial betterthan2 approximation with ratio 2 \Gamma 2 2j\Sigmaj+1 for the more elusive Closest
The emergence of pattern discovery techniques in computational biology
 Metabolic Engineering
, 2000
"... In the past few years, pattern discovery has been emerging as a generic tool of choice for tackling problems from the computational biology domain. In this presentation, and after defining the problem in its generality, we review some of the algorithms that have appeared in the literature and descri ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
In the past few years, pattern discovery has been emerging as a generic tool of choice for tackling problems from the computational biology domain. In this presentation, and after defining the problem in its generality, we review some of the algorithms that have appeared in the literature and describe several applications of pattern discovery to problems from computational biology. 2000 Academic Press 1.
Finding Similar Regions in Many Sequences
 JOURNAL OF COMPUTER AND SYSTEM SCIENCES
, 1999
"... Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s 1 ; : : : ; s n . The Consensus Patterns problem, which has been widely studied in bioinformatics research [22, 10, 7 ..."
Abstract

Cited by 28 (9 self)
 Add to MetaCart
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s 1 ; : : : ; s n . The Consensus Patterns problem, which has been widely studied in bioinformatics research [22, 10, 7, 21, 2, 3, 9, 18, 19, 27], in its simplest form, asks for a region of length L in each s i , and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show that the problem is NPhard and give a polynomial time approximation scheme (PTAS) for it. We then present an efficient approximation algorithm for the consensus pattern problem under the original relative entropy measure of [22, 10, 7, 21]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NPhard) version of the important consensus alignment problem [6] allowing at most constant number of gaps, each of arbitrary length, in each sequence.
'Computing' as information compression by multiple alignment, unification and search
 Journal of Universal Computer Science
, 1999
"... This paper argues that the operations of a `Universal Turing Machine' (UTM) and equivalent mechanisms such as the `Post Canonical System' (PCS)  which are widely accepted as definitions of the concept of `computing'  may be interpreted as information compression by multiple alignment, unificat ..."
Abstract

Cited by 28 (14 self)
 Add to MetaCart
This paper argues that the operations of a `Universal Turing Machine' (UTM) and equivalent mechanisms such as the `Post Canonical System' (PCS)  which are widely accepted as definitions of the concept of `computing'  may be interpreted as information compression by multiple alignment, unification and search (ICMAUS). The motivation for this interpretation is that it suggests ways in which the UTM/PCS model may be augmented in a proposed new computing system designed to exploit the ICMAUS principles as fully as possible. The provision of a relatively sophisticated search mechanism in the proposed `SP' system appears to open the door to the integration and simplification of a range of functions including unsupervised inductive learning, bestmatch pattern recognition and information retrieval, probabilistic reasoning, planning and problem solving, and others. Detailed consideration of how the ICMAUS principles may be applied to these functions is outside the scope of this article but relevant sources are cited in this article.