Results 1  10
of
62
Random projection in dimensionality reduction: applications to image and text data,”
 in Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’01),
, 2001
"... ABSTRACT Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality r ..."
Abstract

Cited by 245 (0 self)
 Add to MetaCart
(Show Context)
ABSTRACT Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction tool in a number of cases, where the high dimensionality of the data would otherwise lead to burdensome computations. Our application areas are the processing of both noisy and noiseless images, and information retrieval in text documents. We show that projecting the data onto a random lowerdimensional subspace yields results comparable to conventional dimensionality reduction methods such as principal component analysis: the similarity of data vectors is preserved well under random projection. However, using random projections is computationally significantly less expensive than using, e.g., principal component analysis. We also show experimentally that using a sparse random matrix gives additional computational savings in random projection.
Scaling up Dynamic Time Warping for Datamining Applications
 In Proc. 6th Int. Conf. on Knowledge Discovery and Data Mining
, 2000
"... There has been much recent interest in adapting data mining algorithms to time series databases. Most of these algorithms need to compare time series. Typically some variation of Euclidean distance is used. However, as we demonstrate in this paper, Euclidean distance can be an extremely brittle dist ..."
Abstract

Cited by 84 (3 self)
 Add to MetaCart
There has been much recent interest in adapting data mining algorithms to time series databases. Most of these algorithms need to compare time series. Typically some variation of Euclidean distance is used. However, as we demonstrate in this paper, Euclidean distance can be an extremely brittle distance measure. Dynamic time warping (DTW) has been suggested as a technique to allow more robust distance calculations, however it is computationally expensive. In this paper we introduce a modification of DTW which operates on a higher level abstraction of the data, in particular, a Piecewise Aggregate Approximation (PAA). Our approach allows us to outperform DTW by one to two orders of magnitude, with no loss of accuracy.
Indexing SpatioTemporal Trajectories with Chebyshev Polynomials
 Proc. 2004 SIGMOD, toappear
"... In this thesis, we investigate the subject of indexing large collections of spatiotemporal trajectories for similarity matching. Our proposed technique is to first mitigate the dimensionality curse problem by approximating each trajectory with a low order polynomiallike curve, and then incorporate ..."
Abstract

Cited by 83 (0 self)
 Add to MetaCart
(Show Context)
In this thesis, we investigate the subject of indexing large collections of spatiotemporal trajectories for similarity matching. Our proposed technique is to first mitigate the dimensionality curse problem by approximating each trajectory with a low order polynomiallike curve, and then incorporate a multidimensional index into the reduced space of polynomial coefficients. There are many possible ways to choose the polynomial, including Fourier transforms, splines, nonlinear regressions, etc. Some of these possibilities have indeed been studied before. We hypothesize that one of the best approaches is the polynomial that minimizes the maximum deviation from the true value, which is called the minimax polynomial. Minimax approximation is particularly meaningful for indexing because in a branchandbound search (i.e., for finding nearest neighbours), the smaller the maximum deviation, the more pruning opportunities there exist. In general, among all the polynomials of the same degree, the optimal minimax polynomial is very hard to compute. However, it has been shown that the Chebyshev approximation is almost identical to the optimal minimax polynomial, and is easy to compute [32]. Thus, we shall explore how to use
A Survey of Recent Methods for Efficient Retrieval of Similar Time Sequences
, 2001
"... Time sequences occur in many applications, ranging from science and technology to business and entertainment. In many of these applications, an analysis of time series data, and searching through large, unstructured databases based on sample sequences, is often desirable. Such similaritybased retri ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Time sequences occur in many applications, ranging from science and technology to business and entertainment. In many of these applications, an analysis of time series data, and searching through large, unstructured databases based on sample sequences, is often desirable. Such similaritybased retrieval has attracted a lot of attention in recent years. Although several different approaches have appeared, most are based on the common premise of dimensionality reduction and spatial access methods. This paper gives an overview of recent research and shows how the methods fit into a general context of signature extraction.
Gamps: Compressing multi sensor data by grouping and amplitude scaling
 In: ACM SIGMOD. (2009
"... We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data fr ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
(Show Context)
We consider the problem of collectively approximating a set of sensor signals using the least amount of space so that any individual signal can be efficiently reconstructed within a given maximum (L∞) error ε. The problem arises naturally in applications that need to collect large amounts of data from multiple concurrent sources, such as sensors, servers and network routers, and archive them over a long period of time for offline data mining. We present GAMPS, a general framework that addresses this problem by combining several novel techniques. First, it dynamically groups multiple signals together so that signals within each group are correlated and can be maximally compressed jointly. Second, it appropriately scales the amplitudes of different signals within a group and compresses them within the maximum allowed reconstruction error bound. Our schemes are polynomial time O(α, β) approximation schemes, meaning that the maximum (L∞) error is at most αε and it uses at most β times the optimal memory. Finally, GAMPS maintains an index so that various queries can be issued directly on compressed data. Our experiments on several realworld sensor datasets show that GAMPS significantly reduces space without compromising the quality of search and query. Categories and Subject Descriptors
C.: A multiresolution symbolic representation of time series
 In: Proc. IEEE Int. Conf. on Data Engineering (ICDE05
, 2005
"... Efficiently and accurately searching for similarities among time series and discovering interesting patterns is an important and nontrivial problem. In this paper, we introduce a new representation of time series, the Multiresolution Vector Quantized (MVQ) approximation, along with a new distance f ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
Efficiently and accurately searching for similarities among time series and discovering interesting patterns is an important and nontrivial problem. In this paper, we introduce a new representation of time series, the Multiresolution Vector Quantized (MVQ) approximation, along with a new distance function. The novelty of MVQ is that it keeps both local and global information about the original time series in a hierarchical mechanism, processing the original time series at multiple resolutions. Moreover, the proposed representation is symbolic employing key subsequences and potentially allows the application of textbased retrieval techniques into the similarity analysis of time series. The proposed method is fast and scales linearly with the size of
C.: F4: largescale automated forecasting using fractals
 In: Proc. of CIKM’02. (2002) 2–9
, 2002
"... Forecasting has attracted a lot of research interest, with very successful methods for periodic time series. Here, we propose a fast, automated method to do nonlinear forecasting, for both periodic as well as chaotic time series. We use the technique of delay coordinate embedding, which needs sever ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Forecasting has attracted a lot of research interest, with very successful methods for periodic time series. Here, we propose a fast, automated method to do nonlinear forecasting, for both periodic as well as chaotic time series. We use the technique of delay coordinate embedding, which needs several parameters; our contribution is the automated way of setting these parameters, using the concept of ‘intrinsic dimensionality’. Our operational system has fast and scalable algorithms for preprocessing and, using Rtrees, also has fast methods for forecasting. The result of this work is a blackbox which, given a time series as input, finds the best parameter settings, and generates a prediction system. Tests on real and synthetic data show that our system achieves low error, while it can handle arbitrarily large datasets. Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining—time series forecasting
A bit level representation for time series data mining with shape based similarity
, 2006
"... Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonst ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, clipped series can be compared directly to the raw data series. We show that this means we can form a tight lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly complex ways to achieve better results than with other popular techniques. The usefulness of the representation is demonstrated by the fact that the results with clipped data are consistently better than those achieved with a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by content. The flexibility of the representation is shown by the fact that we can take advantage of a variable run length encoding of clipped series to define an approximation of the Kolmogorov complexity and hence perform
bridging the gap between science and relational DBMS
 In Proceedings of the 15th International Database Engineering & Applications Symposium
"... Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the DataIntensive Research fields and the current DBMS technologies, we propose SciQL (pronounced as ‘cycle’), the f ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the DataIntensive Research fields and the current DBMS technologies, we propose SciQL (pronounced as ‘cycle’), the first SQLbased query language for scientific applications with both tables and arrays as first class citizens. It provides a seamless symbiosis of array, set and sequenceinterpretations. A key innovation is the extension of valuebased grouping of SQL:2003 with structural grouping, i.e., fixedsized and unbounded groups based on explicit relationships between elements positions. This leads to a generalisation of windowbased query processing with wide applicability in science domains. This paper describes the main language features of SciQL and illustrates it using timeseries concepts.
Temporal Data Clustering via Weighted Clustering Ensemble with Different Representations
 Issue 1, 2013 134 ISSN (Online) : 22775420 KNOWLEDGE AND DATA ENGINEERING
, 2011
"... Abstract—Temporal data clustering provides underpinning techniques for discovering the intrinsic structure and condensing information over temporal data. In this paper, we present a temporal data clustering framework via a weighted clustering ensemble of multiple partitions produced by initial clust ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Temporal data clustering provides underpinning techniques for discovering the intrinsic structure and condensing information over temporal data. In this paper, we present a temporal data clustering framework via a weighted clustering ensemble of multiple partitions produced by initial clustering analysis on different temporal data representations. In our approach, we propose a novel weighted consensus function guided by clustering validation criteria to reconcile initial partitions to candidate consensus partitions from different perspectives, and then, introduce an agreement function to further reconcile those candidate consensus partitions to a final partition. As a result, the proposed weighted clustering ensemble algorithm provides an effective enabling technique for the joint use of different representations, which cuts the information loss in a single representation and exploits various information sources underlying temporal data. In addition, our approach tends to capture the intrinsic structure of a data set, e.g., the number of clusters. Our approach has been evaluated with benchmark time series, motion trajectory, and timeseries data stream clustering tasks. Simulation results demonstrate that our approach yields favorite results for a variety of temporal data clustering tasks. As our weighted cluster ensemble algorithm can combine any input partitions to generate a clustering ensemble, we also investigate its limitation by formal analysis and empirical studies. Index Terms—Temporal data clustering, clustering ensemble, different representations, weighted consensus function, model selection. Ç 1