Results 1 - 10
of
24
On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration
- SIGKDD'02
, 2002
"... ... mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in ..."
Abstract
-
Cited by 169 (41 self)
- Add to MetaCart
... mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of "improvement" that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details. To illustrate our point
Learning to Classify Documents According to Genre
- In IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis
, 2003
"... Genre or style analysis can be used to improve results achieved using standard IR techniques. A genre class is a group of documents that are written in a similar style. Genre classification can identify documents that are written in a style most likely to satisfy a user's information need. ..."
Abstract
-
Cited by 56 (0 self)
- Add to MetaCart
Genre or style analysis can be used to improve results achieved using standard IR techniques. A genre class is a group of documents that are written in a similar style. Genre classification can identify documents that are written in a style most likely to satisfy a user's information need.
From Tweets to Polls : Linking Text Sentiment to Public Opinion Time Series
, 2010
"... We connect measures of public opinion measured from polls with sentiment measured from text. We analyze several surveys on consumer confidence and political opinion over the 2008 to 2009 period, and find they correlate to sentiment word frequencies in contemporaneous Twitter messages. While our resu ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
We connect measures of public opinion measured from polls with sentiment measured from text. We analyze several surveys on consumer confidence and political opinion over the 2008 to 2009 period, and find they correlate to sentiment word frequencies in contemporaneous Twitter messages. While our results vary across datasets, in several cases the correlations are as high as 80%, and capture important large-scale trends. The results highlight the potential of text streams as a substitute and supplement for traditional polling.
Efficient algorithms for sequence segmentation
"... The sequence segmentation problem asks for a partition of the sequence into k non-overlapping segments that cover all data points such that each segment is as homogeneous as possible. This problem can be solved optimally using dynamic programming in O(n² k) time, where n is the length of the sequen ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
The sequence segmentation problem asks for a partition of the sequence into k non-overlapping segments that cover all data points such that each segment is as homogeneous as possible. This problem can be solved optimally using dynamic programming in O(n² k) time, where n is the length of the sequence. Given that sequences in practice are too long, a quadratic algorithm is not an adequately fast solution. Here, we present an alternative constantfactor approximation algorithm with running time O(n 4/3 k 5/3). We call this algorithm the DNS algorithm. We also consider the recursive application of the DNS algorithm, that results in a faster algorithm (O(n log log n) running time) with O(log n) approximation factor, and study the accuracy/efficiency tradeoff. Extensive experimental results show that these algorithms outperform other widely-used heuristics. The same algorithms can speed up solutions for other variants of the basic segmentation problem while maintaining constant their approximation factors. Our techniques can also be used in a streaming setting, with sublinear memory requirements.
Predicting risk from financial reports with regression
- In Proc. NAACL Human Language Technologies Conf
, 2009
"... We address a text regression problem: given a piece of text, predict a real-world continuous quantity associated with the text’s meaning. In this work, the text is an SEC-mandated financial report published annually by a publiclytraded company, and the quantity to be predicted is volatility of stock ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
We address a text regression problem: given a piece of text, predict a real-world continuous quantity associated with the text’s meaning. In this work, the text is an SEC-mandated financial report published annually by a publiclytraded company, and the quantity to be predicted is volatility of stock returns, an empirical measure of financial risk. We apply wellknown regression techniques to a large corpus of freely available financial reports, constructing regression models of volatility for the period following a report. Our models rival past volatility (a strong baseline) in predicting the target variable, and a single model that uses both can significantly outperform past volatility. Interestingly, our approach is more accurate for reports after the passage of the Sarbanes-Oxley Act of 2002, giving some evidence for the success of that legislation in making financial reports more informative. 1
Reading the markets: Forecasting public opinion of political candidates by news analysis
- In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008
, 2008
"... Media reporting shapes public opinion which can in turn influence events, particularly in political elections, in which candidates both respond to and shape public perception of their campaigns. We use computational linguistics to automatically predict the impact of news on public perception of poli ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Media reporting shapes public opinion which can in turn influence events, particularly in political elections, in which candidates both respond to and shape public perception of their campaigns. We use computational linguistics to automatically predict the impact of news on public perception of political candidates. Our system uses daily newspaper articles to predict shifts in public opinion as reflected in prediction markets. We discuss various types of features designed for this problem. The news system improves market prediction over baseline market systems. 1
The Predicting Power of Textual Information on Financial Markets
, 2005
"... Mining textual documents and time series concurrently, such as predicting the movements of stock prices based on the contents of the news stories, is an emerging topic in data mining community. Previous researches have shown that there is a strong relationship between the time when the news stories ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Mining textual documents and time series concurrently, such as predicting the movements of stock prices based on the contents of the news stories, is an emerging topic in data mining community. Previous researches have shown that there is a strong relationship between the time when the news stories are released and the time when the stock prices fluctuate. In this paper, we propose a systematic framework for predicting the tertiary movements of stock prices by analyzing the impacts of the news stories on the stocks. To be more specific, we investigate the immediate impacts of news stories on the stocks based on the Efficient Markets Hypothesis. Several data mining and text mining techniques are used in a novel way. Extensive experiments using real-life data are conducted, and encouraging results are obtained.
Correlating Financial Time Series with Micro-Blogging Activity
"... We study the problem of correlating micro-blogging activity with stock-market events, defined as changes in the price and traded volume of stocks. Specifically, we collect messages related to a number of companies, and we search for correlations between stock-market events for those companies and fe ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We study the problem of correlating micro-blogging activity with stock-market events, defined as changes in the price and traded volume of stocks. Specifically, we collect messages related to a number of companies, and we search for correlations between stock-market events for those companies and features extracted from the microblogging messages. The features we extract can be categorized in two groups. Features in the first group measure the overall activity in the micro-blogging platform, such as number of posts, number of re-posts, and so on. Features in the second group measure properties of an induced interaction graph, for instance, the number of connected components, statistics on the degree distribution, and other graph-based properties. We present detailed experimental results measuring the correlation of the stock market events with these features, using Twitter as a data source. Our results show that the most correlated features are the number of connected components and the number of nodes of the interaction graph. The correlation is stronger with the traded volume than with the price of the stock. However, by using a simulator we show that even relatively small correlations between price and micro-blogging features can be exploited to drive a stock trading strategy that outperforms other baseline strategies.
An Improved Feature Extraction Technique for High Volume Time Series Data
"... The field of time series data mining has seen an explosion of interest in recent years. This interest has flowed over into many applications areas, including fiber manufacturing systems. The volume of time series data generated by a fiber monitoring system can be huge. This limits the applicability ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The field of time series data mining has seen an explosion of interest in recent years. This interest has flowed over into many applications areas, including fiber manufacturing systems. The volume of time series data generated by a fiber monitoring system can be huge. This limits the applicability of data mining algorithms to this problem domain. A widely used solution is to reduce the data size through feature extraction. Four of the mostly commonly used feature extraction techniques are Fourier transforms, Wavelets, Piecewise Aggregate Approximation, and Piecewise Linear Approximation (PLA). In this paper, we first empirically demonstrate that PLA techniques produce the highest quality features for this problem domain. We then introduce a novel PLA algorithm that is shown to produce higher quality features than any other currently available techniques. 1.
Identifying and Following Expert Investors in Stock Microblogs 1 Roy Bar-Haim, 1
"... Information published in online stock investment message boards, and more recently in stock microblogs, is considered highly valuable by many investors. Previous work focused on aggregation of sentiment from all users. However, in this work we show that it is beneficial to distinguish expert users f ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Information published in online stock investment message boards, and more recently in stock microblogs, is considered highly valuable by many investors. Previous work focused on aggregation of sentiment from all users. However, in this work we show that it is beneficial to distinguish expert users from non-experts. We propose a general framework for identifying expert investors, and use it as a basis for several models that predict stock rise from stock microblogging messages (stock tweets). In particular, we present two methods that combine expert identification and per-user unsupervised learning. These methods were shown to achieve relatively high precision in predicting stock rise, and significantly outperform our baseline. In addition, our work provides an in-depth analysis of the content and potential usefulness of stock tweets. 1

