Results

**11 - 20**of**20**### Algorithms for Molecular Biology Research SMOTIF: efficient structured pattern and profile motif search

, 2006

"... Background: A structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We propose an ef ..."

Abstract
- Add to MetaCart

Background: A structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We propose an efficient algorithm, called SMOTIF, to solve the structured motif search problem, i.e., given one or more sequences and a structured motif, SMOTIF searches the sequences for all occurrences of the motif. Potential applications include searching for long terminal repeat (LTR) retrotransposons and composite regulatory binding sites in DNA sequences. Results: SMOTIF can search for both pattern and profile motifs, and it is efficient in terms of both time and space; it outperforms SMARTFINDER, a state-of-the-art algorithm for structured motif search. Experimental results show that SMOTIF is about 7 times faster and consumes 100 times less memory than SMARTFINDER. It can effectively search for LTR retrotransposons and is well suited to searching for motifs with long range gaps. It is also successful in finding potential composite transcription factor binding sites. Conclusion: SMOTIF is a useful and efficient tool in searching for structured pattern and profile motifs. The algorithm is available as open-source at:

### A NOVEL APPROACH FOR STRUCTURED CONSENSUS MOTIF INFERENCE UNDER SPECIFICITY AND QUORUM CONSTRAINTS

"... We address the issue of structured motif inference. This problem is stated as follows: given a set of n DNA sequences and a quorum q (%), find the optimal structured consensus motif described as gaps alternating with specific regions and shared by at least q × n sequences. Our proposal is in the dom ..."

Abstract
- Add to MetaCart

(Show Context)
We address the issue of structured motif inference. This problem is stated as follows: given a set of n DNA sequences and a quorum q (%), find the optimal structured consensus motif described as gaps alternating with specific regions and shared by at least q × n sequences. Our proposal is in the domain of metaheuristics: it runs solutions to convergence through a cooperation between a sampling strategy of the search space and a quick detection of local similarities in small sequence samples. The contributions of this paper are: (1) the design of a stochastic method whose genuine novelty rests on driving the search with a threshold frequency f discrimining between specific regions and gaps; (2) the original way for justifying the operations especially designed; (3) the implementation of a mining tool well adapted to biologists’exigencies: few input parameters are required (quorum q, minimal threshold frequency f, maximal gap length g). Our approach proves efficient on simulated data, promoter sites in Dicot plants and transcription factor binding sites in E. coli genome. Our algorithm, Kaos, compares favorably with MEME and STARS in terms of accuracy. 1.

### IDENTIFICATION OF SPACED REGULATORY SITES VIA SUBMOTIF MODELING

"... In this paper we propose a novel approach for identification of generic motifs in an integrated manner by introducing the notion of submotifs. We formulate the motif finding problem as a constrained submotif pattern mining and present an algorithm called SPACE for identifying motifs that may contain ..."

Abstract
- Add to MetaCart

(Show Context)
In this paper we propose a novel approach for identification of generic motifs in an integrated manner by introducing the notion of submotifs. We formulate the motif finding problem as a constrained submotif pattern mining and present an algorithm called SPACE for identifying motifs that may contain spacers. When spacers are present, we show that the algorithm can identify motifs where 1) the spacers may be of varying lengths, 2) the number of motif segments may be unknown, and 3) the lengths of motif segments may be unknown. We perform rigorous experiments with the Motif Assessment Benchmarks by Tompa et al., and observe that our algorithm overall is able to outperform all popular algorithms tested so far, with significant improvements on sensitivity and specificity. 1.

### Composite Pattern Discovery for PCR Application

"... Abstract. We consider the problem of finding pairs of short patterns such that, in a given input sequence of length n, the distance between each pair’s patterns is at least α. The problem was introduced in [1] and is motivated by the optimization of multiplexed nested PCR. We study algorithms for th ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract. We consider the problem of finding pairs of short patterns such that, in a given input sequence of length n, the distance between each pair’s patterns is at least α. The problem was introduced in [1] and is motivated by the optimization of multiplexed nested PCR. We study algorithms for the following two cases; the special case when the two patterns in the pair are required to have the same length, and the more general case when the patterns can have different lengths. For the first case we present an O(αn log log n) time and O(n) space algorithm, and for the general case we give an O(αn log n) time and O(n) space algorithm. The algorithms work for any alphabet size and use asymptotically less space than the algorithms presented in [1]. For alphabets of constant size we also give an O(n √ n log 2 n) time algorithm for the general case. We demonstrate that the algorithms perform well in practice and present our findings for the human genome. In addition, we study an extended version of the problem where patterns in the pair occur at certain positions at a distance at most α, but do not occur α-close anywhere else, in the input sequence.

### Author manuscript, published in "International Conference on Language and Automata Theory and Applications, Tarragona:

, 2008

"... Application of suffix trees for the acquisition of common motifs with gaps in a set of strings ..."

Abstract
- Add to MetaCart

(Show Context)
Application of suffix trees for the acquisition of common motifs with gaps in a set of strings

### PMS6MC: A Multicore Algorithm for Motif Discovery

"... Abstract—We develop an efficient multicore algorithm, PMS6MC, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. PMS6MC is based on PMS6, which is currently the fastest single-core al ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract—We develop an efficient multicore algorithm, PMS6MC, for the (l, d)-motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. PMS6MC is based on PMS6, which is currently the fastest single-core algorithm for motif discovery in large instances. The speedup, relative to PMS6, attained by our multicore algorithm ranges from a high of 6.62 for the (17,6) challenging instances to a low of 2.75 for the (13,4) challenging instances on an Intel 6-core system. We estimate that PMS6MC is 2 to 4 times faster than other parallel algorithms for motif search on large instances. Keywords-Planted motif search, parallel string algorithms, multi-core algorithms. I.

### Outlier Analysis Using Frequent Pattern Mining (LOF Algorithm)

"... Abstract- An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this pap ..."

Abstract
- Add to MetaCart

Abstract- An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this paper, we present a new method to detect outliers by discovering frequent patterns (or frequent item sets) from the data set. The outliers are defined as the data transactions that contain less frequent patterns in their item sets. We define a measure called FPOF (Frequent Pattern Outlier Factor) to detect the outlier transactions and propose the Find FPOF algorithm to discover outliers. The experimental results have shown that our approach outperformed the existing methods on identifying interesting outliers.

### Article PMS6MC: A Multicore Algorithm for Motif Discovery

, 2013

"... algorithms ..."

(Show Context)
### Solving Planted Motif Problem on GPU

"... (l, d) planted motif problem is defined as: Given a sequence of n DNA sequences, each of length L, find M, the set of sequences(or motifs) of length l which have at-least one d-neighbor in each of the n sequences. Planted motif problem is an important and well-studied problem in computational biolog ..."

Abstract
- Add to MetaCart

(Show Context)
(l, d) planted motif problem is defined as: Given a sequence of n DNA sequences, each of length L, find M, the set of sequences(or motifs) of length l which have at-least one d-neighbor in each of the n sequences. Planted motif problem is an important and well-studied problem in computational biology. Motif finding is useful for developing methods to obtain transcription factor binding sites, sequence classifica-tion, in developing methods for building phylogenetic trees etc. The planted motif problem is difficult to solve espe-cially for challenging instance sizes (15,5), (17,6), (19,7), and (21,8). The challenging instances are computationally intensive and require large amount of memory. Several serial implementations have been proposed for solving this prob-lem. The time required by these methods for solving large challenge instances is prohibitively expensive. In this paper, we propose a parallel implementation on GPU that solves the challenge instance (21,8) in 1.1 hours. We are not aware of any sequential or parallel method that will solve this chal-lenge instance in better time. Additionally, to the best our knowledge we are not aware of any previous implementation of a parallel method to solve the planted motif problem on GPU. 1.

### Oulier Analysis Using Frequent Pattern Mining – A Review

"... Abstract. An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this pap ..."

Abstract
- Add to MetaCart

Abstract. An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this paper, we present a new method to detect outliers by discovering frequent patterns (or frequent item sets) from the data set. The outliers are defined as the data transactions that contain less frequent patterns in their item sets. We define a measure called FPOF (Frequent Pattern Outlier Factor) to detect the outlier transactions and propose the Find FPOF algorithm to discover outliers. The experimental results have shown that our approach outperformed the existing methods on identifying interesting outliers.