Results 1  10
of
13
SpaceEfficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
, 2009
"... Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent online queries of the form “what is the position of a minimum element in the subarray ranging from i to j? ” efficiently. We focus on two sett ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
(Show Context)
Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent online queries of the form “what is the position of a minimum element in the subarray ranging from i to j? ” efficiently. We focus on two settings, where (1) the input array is available at query time, and (2) the input array is only available at construction time. In setting (1), we show new data structures (a) of n c(n) (2 + o(1)) bits and query time O(c(n)), or (b) with O(nHk) + o(n) bits and O(1) query size time, where Hk denotes the empirical entropy of k’th order of the input array. In setting (2), we give a data structure of optimal size 2n + o(n) bits and query time O(1). All data structures can be constructed in linear time and almost inplace.
A new representation for protein secondary structure prediction based on frequent patterns
 BIOINFORMATICS
, 2006
"... Motivation: A new representation for protein secondary structure prediction based on frequent amino acid patterns is described and evaluated. We discuss in detail how to identify frequent patterns in a protein sequence database using a levelwise search technique, how to define a set of features fro ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Motivation: A new representation for protein secondary structure prediction based on frequent amino acid patterns is described and evaluated. We discuss in detail how to identify frequent patterns in a protein sequence database using a levelwise search technique, how to define a set of features from those patterns and how to use those features in the prediction of the secondary structure of a protein sequence using Support Vector Machines (SVMs). Results: Three different sets of features based on frequent patterns are evaluated in a blind testing setup using 150 targets from the EVA contest and compared to predictions of PSIPRED, PHD and PROFsec. Even though being trained on only 940 proteins, a simple SVM classifier based on this new representation yields results comparable to PSIPRED and PROFsec. Finally, we show that the method contributes significant information to consensus predictions. Availability: The method is available from the authors upon request. Contact:
SpaceEfficient String Mining under Frequency Constraints
"... Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other f ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequencyrelated constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where superlinear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for Σ  << n (in particular for constant Σ), as the databases themselves occupy only n log Σ  bits. Because in many reallife applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log Σ  + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genomescale data. 1.
Efficient string mining under constraints via the deferred frequency index
 In Proc. 8th Industrial Conf. on Data Mining (ICDM), volume 5077 of LNCS
, 2008
"... Abstract. We propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel algorithm based on a deferred data structure. Despite its simplicity, our approach is up to 4 times faster and uses about half the memory ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel algorithm based on a deferred data structure. Despite its simplicity, our approach is up to 4 times faster and uses about half the memory compared to the bestknown algorithm of Fischer et al. Applications in various string domains, e.g. natural language, DNA or protein sequences, demonstrate the improvement of our algorithm. 1
Contrast Data Mining: Methods and Applications
"... Contrast ``To compare or appraise in respect to differences’ ’ (Merriam Webster Dictionary) Contrast data mining The mining of patterns and models contrasting two or more classes/conditions. ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Contrast ``To compare or appraise in respect to differences’ ’ (Merriam Webster Dictionary) Contrast data mining The mining of patterns and models contrasting two or more classes/conditions.
Indices and applications in . . .
, 2012
"... During the last years, sequencing throughput increased dramatically with the introduction of socalled highthroughput sequencing. It allows the production of billions of base pairs (bp) per day in the form of reads of length 100 bp and more, and current developments promise the personal $1,000 geno ..."
Abstract
 Add to MetaCart
During the last years, sequencing throughput increased dramatically with the introduction of socalled highthroughput sequencing. It allows the production of billions of base pairs (bp) per day in the form of reads of length 100 bp and more, and current developments promise the personal $1,000 genome in a couple of years. These advances in sequencing technology demand for novel approaches and effiicient data structures specifically designed for the analysis of mass data. One such data structure is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis, we present three different substring indices and their applications in the analysis of highthroughput sequencing data. Our contribution is threefold: ϐirst, we extend the indices which were originally designed to index a single sequence to be applicable to datasets consisting of millions of multiple strings. Further, we implement algorithms for the internal memory construction of each index and devise effiicient external memory algorithms for indexing large datasets, e.g. multiplemammal genomes. To make
A Fast Cluster Motif Finding Algorithm for ChIPSeq Data Sets
"... New highthroughput technique ChIPseq, coupling chromatin immunoprecipitation experiment with highthroughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genomewide regions. However, the most existing motif discovery algorithms ar ..."
Abstract
 Add to MetaCart
(Show Context)
New highthroughput technique ChIPseq, coupling chromatin immunoprecipitation experiment with highthroughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genomewide regions. However, the most existing motif discovery algorithms are timeconsuming and limited to identify binding motifs in ChIPseq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the ( , ) motifs in large scale ChIPseq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIPseq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the ( , ) motif finding in the ChIPseq data; meanwhile it also demonstrates better performance than other current widelyused algorithms such as MEME, Weeder, ChIPMunk, and DREME.
2008 Eighth IEEE International Conference on Data Mining SpaceEfficient String Mining under Frequency Constraints
"... Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other f ..."
Abstract
 Add to MetaCart
(Show Context)
Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequencyrelated constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where superlinear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for Σ  << n (in particular for constant Σ), as the databases themselves occupy only n log Σ  bits. Because in many reallife applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log Σ  + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genomescale data. 1.
Sequential Pattern Mining from Trajectory Data
"... In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, handoff in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approa ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper, we study the problem of mining for frequent trajectories, which is crucial in many application scenarios, such as vehicle traffic management, handoff in cellular networks, supply chain management. We approach this problem as that of mining for frequent sequential patterns. Our approach consists of a partitioning strategy for incoming streams of trajectories in order to reduce the trajectory size and represent trajectories as strings. We mine frequent trajectories using a sliding windows approach combined with a counting algorithm that allows us to promptly update the frequency of patterns. In order to make counting really efficient, we represent frequent trajectories by prime numbers, whereby the Chinese reminder theorem can then be used to expedite the computation. 1.
Contrast Data Mining: Methods and Applications
"... respect to differences’ ’ (Merriam Webster Dictionary) Contrast data mining The mining of patterns and models contrasting two or more classes/conditions. Contrast Data Mining What is it? Cont. ``Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more ..."
Abstract
 Add to MetaCart
respect to differences’ ’ (Merriam Webster Dictionary) Contrast data mining The mining of patterns and models contrasting two or more classes/conditions. Contrast Data Mining What is it? Cont. ``Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more’’ Darby Conley, Get Fuzzy, 2001 What can be contrasted? � Objects at different time periods � ``Compare ICDM papers published in 20062007 versus those in 20042005’’ � Objects for different spatial locations