Results 1 - 10
of
13
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
, 2009
"... Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent on-line queries of the form “what is the position of a minimum element in the sub-array ranging from i to j? ” efficiently. We focus on two sett ..."
Abstract
-
Cited by 47 (3 self)
- Add to MetaCart
(Show Context)
Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent on-line queries of the form “what is the position of a minimum element in the sub-array ranging from i to j? ” efficiently. We focus on two settings, where (1) the input array is available at query time, and (2) the input array is only available at construction time. In setting (1), we show new data structures (a) of n c(n) (2 + o(1)) bits and query time O(c(n)), or (b) with O(nHk) + o(n) bits and O(1) query size time, where Hk denotes the empirical entropy of k’th order of the input array. In setting (2), we give a data structure of optimal size 2n + o(n) bits and query time O(1). All data structures can be constructed in linear time and almost in-place.
A new representation for protein secondary structure prediction based on frequent patterns
- BIOINFORMATICS
, 2006
"... Motivation: A new representation for protein secondary structure prediction based on frequent amino acid patterns is described and evaluated. We discuss in detail how to identify frequent patterns in a protein sequence database using a level-wise search technique, how to define a set of features fro ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
Motivation: A new representation for protein secondary structure prediction based on frequent amino acid patterns is described and evaluated. We discuss in detail how to identify frequent patterns in a protein sequence database using a level-wise search technique, how to define a set of features from those patterns and how to use those features in the prediction of the secondary structure of a protein sequence using Support Vector Machines (SVMs). Results: Three different sets of features based on frequent patterns are evaluated in a blind testing setup using 150 targets from the EVA contest and compared to predictions of PSI-PRED, PHD and PROFsec. Even though being trained on only 940 proteins, a simple SVM classifier based on this new representation yields results comparable to PSI-PRED and PROFsec. Finally, we show that the method contributes significant information to consensus predictions. Availability: The method is available from the authors upon request. Contact:
Space-Efficient String Mining under Frequency Constraints
"... Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other f ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |Σ | << n (in particular for constant |Σ|), as the databases themselves occupy only n log |Σ | bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log |Σ | + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data. 1.
Efficient string mining under constraints via the deferred frequency index
- In Proc. 8th Industrial Conf. on Data Mining (ICDM), volume 5077 of LNCS
, 2008
"... Abstract. We propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel algorithm based on a deferred data structure. Despite its simplicity, our approach is up to 4 times faster and uses about half the memory ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel algorithm based on a deferred data structure. Despite its simplicity, our approach is up to 4 times faster and uses about half the memory compared to the best-known algorithm of Fischer et al. Applications in various string domains, e.g. natural language, DNA or protein sequences, demonstrate the improvement of our algorithm. 1
Contrast Data Mining: Methods and Applications
"... Contrast- ``To compare or appraise in respect to differences’ ’ (Merriam Webster Dictionary) Contrast data mining- The mining of patterns and models contrasting two or more classes/conditions. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Contrast- ``To compare or appraise in respect to differences’ ’ (Merriam Webster Dictionary) Contrast data mining- The mining of patterns and models contrasting two or more classes/conditions.
Indices and applications in . . .
, 2012
"... During the last years, sequencing throughput increased dramatically with the introduction of so-called high-throughput sequencing. It allows the production of billions of base pairs (bp) per day in the form of reads of length 100 bp and more, and current developments promise the personal $1,000 geno ..."
Abstract
- Add to MetaCart
During the last years, sequencing throughput increased dramatically with the introduction of so-called high-throughput sequencing. It allows the production of billions of base pairs (bp) per day in the form of reads of length 100 bp and more, and current developments promise the personal $1,000 genome in a couple of years. These advances in sequencing technology demand for novel approaches and effiicient data structures specifically designed for the analysis of mass data. One such data structure is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis, we present three different substring indices and their applications in the analysis of high-throughput sequencing data. Our contribution is threefold: ϐirst, we extend the indices which were originally designed to index a single sequence to be applicable to datasets consisting of millions of multiple strings. Further, we implement algorithms for the internal memory construction of each index and devise effiicient external memory algorithms for indexing large datasets, e.g. multiplemammal genomes. To make
A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets
"... New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms ar ..."
Abstract
- Add to MetaCart
(Show Context)
New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the ( , ) motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the ( , ) motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.
2008 Eighth IEEE International Conference on Data Mining Space-Efficient String Mining under Frequency Constraints
"... Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other f ..."
Abstract
- Add to MetaCart
(Show Context)
Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |Σ | << n (in particular for constant |Σ|), as the databases themselves occupy only n log |Σ | bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log |Σ | + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data. 1.
Sequential Pattern Mining from Trajectory Data
"... In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approa ..."
Abstract
- Add to MetaCart
(Show Context)
In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, hand-off in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation. 1.
Contrast Data Mining: Methods and Applications
"... respect to differences’ ’ (Merriam Webster Dictionary) Contrast data mining- The mining of patterns and models contrasting two or more classes/conditions. Contrast Data Mining- What is it? Cont. ``Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more ..."
Abstract
- Add to MetaCart
respect to differences’ ’ (Merriam Webster Dictionary) Contrast data mining- The mining of patterns and models contrasting two or more classes/conditions. Contrast Data Mining- What is it? Cont. ``Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more’’ Darby Conley, Get Fuzzy, 2001 What can be contrasted? � Objects at different time periods � ``Compare ICDM papers published in 2006-2007 versus those in 2004-2005’’ � Objects for different spatial locations