Results 1  10
of
141
Patterns of temporal variation in online media
, 2010
"... Online content exhibits rich temporal dynamics, and diverse realtime user generated content further intensifies this process. However, temporal patterns by which online content grows and fades over time, and by which different pieces of content compete for attention remain largely unexplored. We stu ..."
Abstract

Cited by 144 (5 self)
 Add to MetaCart
(Show Context)
Online content exhibits rich temporal dynamics, and diverse realtime user generated content further intensifies this process. However, temporal patterns by which online content grows and fades over time, and by which different pieces of content compete for attention remain largely unexplored. We study temporal patterns associated with online content and how the content’s popularity grows and fades over time. The attention that content receives on the Web varies depending on many factors and occurs on very different time scales and at different resolutions. In order to uncover the temporal dynamics of online content we formulate a time series clustering problem using a similarity metric that is invariant to scaling and shifting. We develop the KSpectral Centroid (KSC) clustering algorithm that effectively finds cluster centroids with our similarity measure. By applying an adaptive waveletbased incremental approach to clustering, we scale KSC to large data sets. We demonstrate our approach on two massive datasets: a set of 580 million Tweets, and a set of 170 million blog posts and news media articles. We find that KSC outperforms the Kmeans clustering algorithm in finding distinct shapes of time series. Our analysis shows that there are six main temporal shapes of attention of online content. We also present a simple model that reliably predicts the shape of attention by using information about only a small number of participants. Our analyses offer insight into common temporal patterns of the content on the Web and broaden the understanding of the dynamics of human attention.
Time Series Shapelets: A New Primitive for Data Mining
"... Classification of time series has been attracting great interest over the past decade. Recent empirical evidence has strongly suggested that the simple nearest neighbor algorithm is very difficult to beat for most time series problems. While this may be considered good news, given the simplicity of ..."
Abstract

Cited by 49 (7 self)
 Add to MetaCart
(Show Context)
Classification of time series has been attracting great interest over the past decade. Recent empirical evidence has strongly suggested that the simple nearest neighbor algorithm is very difficult to beat for most time series problems. While this may be considered good news, given the simplicity of implementing the nearest neighbor algorithm, there are some negative consequences of this. First, the nearest neighbor algorithm requires storing and searching the entire dataset, resulting in a time and space complexity that limits its applicability, especially on resourcelimited sensors. Second, beyond mere classification accuracy, we often wish to gain some insight into the data. In this work we introduce a new time series primitive, time series shapelets, which addresses these limitations. Informally, shapelets are time series subsequences which are in some sense maximally representative of a class. As we shall show with extensive empirical evaluations in diverse domains, algorithms based on the time series shapelet primitives can be interpretable, more accurate and significantly faster than stateoftheart classifiers.
Searching and mining trillions of time series subsequences under dynamic time warping
 In SIGKDD
, 2012
"... Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time ..."
Abstract

Cited by 43 (3 self)
 Add to MetaCart
(Show Context)
Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current stateoftheart Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higherlevel time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for realtime monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.
A Brief Survey on Sequence Classification
"... Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selectio ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
(Show Context)
Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semisupervised learning on sequences. 1.
Searching Trajectories by Locations – An Efficiency Study
"... Trajectory search has long been an attractive and challenging topic which blooms various interesting applications in spatialtemporal databases. In this work, we study a new problem of searching trajectories by locations, in which context the query is only a small set of locations with or without an ..."
Abstract

Cited by 29 (11 self)
 Add to MetaCart
(Show Context)
Trajectory search has long been an attractive and challenging topic which blooms various interesting applications in spatialtemporal databases. In this work, we study a new problem of searching trajectories by locations, in which context the query is only a small set of locations with or without an order specified, while the target is to find the k BestConnected Trajectories (kBCT) from a database such that the kBCT best connect the designated locations geographically. Different from the conventional trajectory search that looks for similar trajectories w.r.t. shape or other criteria by using a sample query trajectory, we focus on the goodness of connection provided by a trajectory to the specified query locations. This new query can benefit users in many novel applications such as trip planning. In our work, we firstly define a new similarity function for measuring how well a trajectory connects the query locations, with both spatial distance and order constraint being considered. Upon the observation that the number of query locations is normally small (e.g. 10 or less) since it is impractical for a user to input too many locations, we analyze the feasibility of using a generalpurpose spatial index to achieve efficient kBCT search, based on a simple Incremental kNN based Algorithm (IKNN). The IKNN effectively prunes and refines trajectories by using the devised lower bound and upper bound of similarity. Our contributions mainly lie in adapting the bestfirst and depthfirst kNN algorithms to the basic IKNN properly, and more importantly ensuring the efficiency in both search effort and memory usage. An indepth study on the adaption and its efficiency is provided. Further optimization is also presented to accelerate the IKNN algorithm. Finally, we verify the efficiency of the algorithm by extensive experiments.
Online Discovery and Maintenance of Time Series Motifs
"... The detection of repeated subsequences, time series motifs, is a problem which has been shown to have great utility for several higherlevel data mining algorithms, including classification, clustering, segmentation, forecasting, and rule discovery. In recent years there has been significant researc ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
(Show Context)
The detection of repeated subsequences, time series motifs, is a problem which has been shown to have great utility for several higherlevel data mining algorithms, including classification, clustering, segmentation, forecasting, and rule discovery. In recent years there has been significant research effort spent on efficiently discovering these motifs in static offline databases. However, for many domains, the inherent streaming nature of time series demands online discovery and maintenance of time series motifs. In this paper, we develop the first online motif discovery algorithm which monitors and maintains motifs exactly in real time over the most recent history of a stream. Our algorithm has a worstcase update time which is linear to the window size and is extendible to maintain more complex pattern structures. In contrast, the current offline algorithms either need significant update time or require very costly preprocessing steps which online algorithms simply cannot afford. Our core ideas allow useful extensions of our algorithm to deal with arbitrary data rates and discovering multidimensional motifs. We demonstrate the utility of our algorithms with a variety of case studies in the domains of robotics, acoustic monitoring and online compression.
Finding Comparable Temporal Categorical Records: A Similarity Measure with an Interactive Visualization
"... An increasing number of temporal categorical databases are being collected: Electronic Health Records in healthcare organizations, traffic incident logs in transportation systems, or student records in universities. Finding similar records within these large databases requires effective similarity m ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
(Show Context)
An increasing number of temporal categorical databases are being collected: Electronic Health Records in healthcare organizations, traffic incident logs in transportation systems, or student records in universities. Finding similar records within these large databases requires effective similarity measures that capture the searcher’s intent. Many similarity measures exist for numerical time series, but temporal categorical records are different. We propose a temporal categorical similarity measure, the M&M (Match & Mismatch) measure, which is based on the concept of aligning records by sentinel events, then matching events between the target and the compared records. The M&M measure combines the time differences between pairs of events and the number of mismatches. To accommodate customization of parameters in the M&M measure and results interpretation, we implemented Similan, an interactive search and visualization tool for temporal categorical records. A usability study with 8 participants demonstrated that Similan was easy to learn and enabled them to find similar records, but users had difficulty understanding the M&M measure. The usability study feedback, led to an improved version with a continuous timeline, which was tested in a pilot study with 5 participants.
A ComplexityInvariant Distance Measure for Time Series Gustavo E.A.P.A. Batista 1,2
"... The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While there is a plethora of classification algorithms that can be applied to time series, all of the current empirical evidence suggests that simple neares ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
(Show Context)
The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While there is a plethora of classification algorithms that can be applied to time series, all of the current empirical evidence suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping. In this work we make a surprising claim. There is an invariance that the community has missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where complex objects are incorrectly assigned to a simpler class. We introduce the first complexityinvariant distance measure for time series, and show that it generally produces significant improvements in classification accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series classification experiments ever attempted, and show that complexityinvariant distance measures can produce improvements in accuracy in the vast majority of cases.
LogicalShapelets: An Expressive Primitive for Time Series Classification
"... Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant a ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community. Despite their immense potential as a data mining primitive, there are two important limitations of shapelets. First, their expressiveness is limited to simple binary presence/absence questions. Second, even though shapelets are computed offline, the time taken to compute them is significant. In this work, we address the latter problem by introducing a novel algorithm that finds shapelets in less time than current methods by an order of magnitude. Our algorithm is based on intelligent caching and reuse of computations, and the admissible pruning of the search space. Because our algorithm is so fast, it creates an opportunity to consider more expressive shapelet queries. In particular, we show for the first time an augmented shapelet representation that distinguishes the data based on conjunctions or disjunctions of shapelets. We call our novel representation LogicalShapelets. We demonstrate the efficiency of our approach on the classic benchmark datasets used for these problems, and show several case studies where logical shapelets significantly outperform the original shapelet representation and other time series classification techniques. We demonstrate the utility of our ideas in domains as diverse as gesture recognition, robotics, and biometrics.