Results 1 -
7 of
7
Efficient algorithms for sequence segmentation
"... The sequence segmentation problem asks for a partition of the sequence into k non-overlapping segments that cover all data points such that each segment is as homogeneous as possible. This problem can be solved optimally using dynamic programming in O(n² k) time, where n is the length of the sequen ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
The sequence segmentation problem asks for a partition of the sequence into k non-overlapping segments that cover all data points such that each segment is as homogeneous as possible. This problem can be solved optimally using dynamic programming in O(n² k) time, where n is the length of the sequence. Given that sequences in practice are too long, a quadratic algorithm is not an adequately fast solution. Here, we present an alternative constantfactor approximation algorithm with running time O(n 4/3 k 5/3). We call this algorithm the DNS algorithm. We also consider the recursive application of the DNS algorithm, that results in a faster algorithm (O(n log log n) running time) with O(log n) approximation factor, and study the accuracy/efficiency tradeoff. Extensive experimental results show that these algorithms outperform other widely-used heuristics. The same algorithms can speed up solutions for other variants of the basic segmentation problem while maintaining constant their approximation factors. Our techniques can also be used in a streaming setting, with sublinear memory requirements.
Segmentation and dimensionality reduction
- in SDM 2006
, 2006
"... Sequence segmentation and dimensionality reduction have been used as methods for studying high-dimensional sequences — they both reduce the complexity of the representation of the original data. In this paper we study the interplay of these two techniques. We formulate the problem of segmenting a se ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Sequence segmentation and dimensionality reduction have been used as methods for studying high-dimensional sequences — they both reduce the complexity of the representation of the original data. In this paper we study the interplay of these two techniques. We formulate the problem of segmenting a sequence while modeling it with a basis of small size, thus essentially reducing the dimension of the input sequence. We give three different algorithms for this problem: all combine existing methods for sequence segmentation and dimensionality reduction. For two of the proposed algorithms we prove guarantees for the quality of the solutions obtained. We describe experimental results on synthetic and real datasets, including data on exchange rates and genomic sequences. Our experiments show that the algorithms indeed discover underlying structure in the data, including both segmental structure and interdependencies between the dimensions. Keywords: segmentation, multidimensional data, PCA, time series 1
An Optimal DNA Segmentation Based on the MDL Principle
- Int. J. Bioinformatics Research and Applications
, 2003
"... The biological world is highly stochastic and inhomogeneous in its behavior. There are regions in DNA with high concentration of G or C bases; stretches of sequences with an abundance of CG dinucleotide (CpG islands); coding regions with strong periodicityof -three pattern, and so forth. The tran ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The biological world is highly stochastic and inhomogeneous in its behavior. There are regions in DNA with high concentration of G or C bases; stretches of sequences with an abundance of CG dinucleotide (CpG islands); coding regions with strong periodicityof -three pattern, and so forth. The transition between homogeneous and inhomogeneous regions of DNA, known also as change points, carry important biological information. Computational methods used to identify these homogeneous regions are called segmentations.
Aggregating Time Partitions
"... Partitions of sequential data exist either per se or as a result of sequence segmentation algorithms. It is often the case that the same timeline is partitioned in many different ways. For example, different segmentation algorithms produce different partitions of the same underlying data points. In ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Partitions of sequential data exist either per se or as a result of sequence segmentation algorithms. It is often the case that the same timeline is partitioned in many different ways. For example, different segmentation algorithms produce different partitions of the same underlying data points. In such cases, we are interested in producing an aggregate partition, i.e., a segmentation that agrees as much as possible with the input segmentations. Each partition is defined as a set of continuous non-overlapping segments of the timeline. We show that this problem can be solved optimally in polynomial time using dynamic programming. We also propose faster greedy heuristics that work well in practice. We experiment with our algorithms and we demonstrate their utility in clustering the behavior of mobile-phone users and combining the results of different segmentation algorithms on genomic sequences.
www.elsevier.com/locate/gene Are isochore sequences homogeneous?
"... Three statistical/mathematical analyses are carried out on isochore sequences: spectral analysis, analysis of variance, and segmentation analysis. Spectral analysis shows that there are GC content fluctuations at different length scales in isochore sequences. The analysis of variance shows that the ..."
Abstract
- Add to MetaCart
Three statistical/mathematical analyses are carried out on isochore sequences: spectral analysis, analysis of variance, and segmentation analysis. Spectral analysis shows that there are GC content fluctuations at different length scales in isochore sequences. The analysis of variance shows that the null hypothesis (the mean value of a group of GC contents remains the same along the sequence) may or may not be rejected for an isochore sequence, depending on the subwindow sizes at which GC contents are sampled, and the window size within which group members are defined. The segmentation analysis shows that there are stronger indications of GC content changes at isochore borders than within an isochore. These analyses support the notion of isochore sequences, but reject the assumption that isochore sequences are homogeneous at the base level. An isochore sequence may pass a homogeneity test when GC content fluctuations at smaller length scales are
Research Track Paper Aggregating Time Partitions
"... Partitions of sequential data exist either per se or as a result of sequence segmentation algorithms. It is often the case that the same timeline is partitioned in many different ways. For example, different segmentation algorithms produce different partitions of the same underlying data points. In ..."
Abstract
- Add to MetaCart
Partitions of sequential data exist either per se or as a result of sequence segmentation algorithms. It is often the case that the same timeline is partitioned in many different ways. For example, different segmentation algorithms produce different partitions of the same underlying data points. In such cases, we are interested in producing an aggregate partition, i.e., a segmentation that agrees as much as possible with the input segmentations. Each partition is defined as a set of continuous non-overlapping segments of the timeline. We show that this problem can be solved optimally in polynomial time using dynamic programming. We also propose faster greedy heuristics that work well in practice. We experiment with our algorithms and we demonstrate their utility in clustering the behavior of mobile-phone users and combining the results of different segmentation algorithms on genomic sequences.

