Results 1 - 10
of
72
On-line New Event Detection and Tracking
, 1998
"... We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting-i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a singl ..."
Abstract
-
Cited by 106 (4 self)
- Add to MetaCart
We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting-i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a single pass clustering algorithm and a novel thresholding model that incorporates the properties of events as a major component. Our ap-proach to tracking is similar to typical information filtering methods. We discuss the value of “surprising” features that have unusual occurrence characteristics, and briefly explore on-line adaptive filtering to handle evolving events in the news. New event detection and event tracking are part of the Topic Detection and Tracking (TDT) initiative.
Novelty Detection: A Review - Part 1: Statistical Approaches
- Signal Processing
, 2003
"... Novelty detection is the identification of new or unknown data or signal that a machine learning system is not aware of during training. Novelty detection is one of the fundamental requirements of a good classification or identification system since sometimes the test data contains information abou ..."
Abstract
-
Cited by 67 (0 self)
- Add to MetaCart
Novelty detection is the identification of new or unknown data or signal that a machine learning system is not aware of during training. Novelty detection is one of the fundamental requirements of a good classification or identification system since sometimes the test data contains information about objects that were not known at the time of training the model. In this paper we provide stateof -the-art review in the area of novelty detection based on statistical approaches. The second part paper details novelty detection using neural networks. As discussed, there are a multitude of applications where novelty detection is extremely important including signal processing, computer vision, pattern recognition, data mining, and robotics.
Learning approaches for Detecting and Tracking News Events
- IEEE Intelligent Systems
, 1999
"... This paper studies the effective use of information retrieval and machine learning techniques in a new task, event detection and tracking. The objective is to automatically detect novel events from chronologically-ordered streams of news stories, and track events of interest over time. We extended e ..."
Abstract
-
Cited by 58 (6 self)
- Add to MetaCart
This paper studies the effective use of information retrieval and machine learning techniques in a new task, event detection and tracking. The objective is to automatically detect novel events from chronologically-ordered streams of news stories, and track events of interest over time. We extended existing supervised learning and unsupervised clustering algorithms to allow document classification based on both information content and temporal aspects of events. A task-oriented evaluation was conducted using Reuters and CNN news stories. We found agglomerative document clustering highly effective (82% in the F 1 measure) for retrospective event detection, and single-pass clustering with time windowing a better choice for on-line alerting of novel events. We also observed robust learning behavior for k-nearest neighbor (kNN) classification and a decision-tree approach in event tracking, under the difficult condition when the number of positive training examples is extremely small. 1 Intr...
Towards Multidocument Summarization by Reformulation: Progress and Prospects
- IN PROCEEDINGS OF AAAI-99
, 1999
"... By synthesizing information common to retrieved documents, multi-document summarization can help users of information retrieval systems to find relevant documents with a minimal amount of reading. We are developing a multidocument summarization system to automatically generate a concise summary ..."
Abstract
-
Cited by 57 (14 self)
- Add to MetaCart
By synthesizing information common to retrieved documents, multi-document summarization can help users of information retrieval systems to find relevant documents with a minimal amount of reading. We are developing a multidocument summarization system to automatically generate a concise summary by identifying and synthesizing similarities across a set of related documents. Our approach is unique in its integration of machine learning and statistical techniques to identify similar paragraphs, intersection of similar phrases within paragraphs, and language generation to reformulate the wording of the summary. Our evaluation of system components shows that learning over multiple extracted linguistic features is more effective than information retrieval approaches at identifying similar text units for summarization and that it is possible to generate a fluent summary that conveys similarities among documents even when full semantic interpretations of the input text are not available.
Creating and Evaluating Multi-Document Sentence Extract Summaries
- In International Conference on Information and Knowledge Management (CIKM
, 2000
"... This paper discusses passage extraction approaches to multidocument summarization that use available information about the document set as a whole and the relationships between the documents to build on single document summarization methodology. Multi-document summarization differs from single in t ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
This paper discusses passage extraction approaches to multidocument summarization that use available information about the document set as a whole and the relationships between the documents to build on single document summarization methodology. Multi-document summarization differs from single in that the issues of compression, speed, redundancy and passage selection are critical in the formation of useful summaries, as well as the user's goals in creating the summary. Our approach addresses these issues by using domain-independent techniques based mainly on fast, statistical processing, a metric for reducing redundancy and maximizing diversity in the selected passages, and a modular framework to allow easy parameterization for different genres, corpora characteristics and user requirements. We examined howhumans create multi-document summaries as well as the characteristics of such summaries and use these summaries to evaluate the performance of various multidocument summarization algorithms. 1.
An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering
- In Proceedings of the 23rd ACM SIGIR Conference on Research and Development in Information Retrieval
, 2000
"... We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information fro ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information from multiple sources is described and applied to DARPA's Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm and the introduction of the additional features have an impact on clustering performance. We apply our optimal combination of features to the TDT2 test data, obtaining partitions of the documents that compare favorably with the results obtained by participants in th...
Improving Text Categorization Methods for Event Tracking
- In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval
, 2000
"... Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different ...
Learning User Interest Dynamics with a Three-Descriptor Representation
, 2000
"... Learning users' interest categories is challenging in a dynamic environment like the Web because they change over time. This paper describes a novel scheme to represent a user's interest categories, and an adaptive algorithm to learn the dynamics of the user's interests through positive and negative ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Learning users' interest categories is challenging in a dynamic environment like the Web because they change over time. This paper describes a novel scheme to represent a user's interest categories, and an adaptive algorithm to learn the dynamics of the user's interests through positive and negative relevance feedback. We propose a three-descriptor model to represent a user's interests. The proposed model maintains a long-term interest descriptor to capture the user's general interests and a short-term interest descriptor to keep track of the user's more recent, faster-changing interests. An algorithm based on the three-descriptor representation is developed to acquire high accuracy of recognition for long-term interests, and to adapt quickly to changing interests in the short-term. The model is also extended to multiple three-descriptor representations to capture a broader range of interests. Empirical studies confirm the effectiveness of this scheme to accurately model a user's inter...
Mining Correlated Bursty Topic Patterns from Coordinated Text Streams
- KDD'07
, 2007
"... Previous work on text mining has almost exclusively focused on a single stream. However, we often have available multiple text streams indexed by the same set of time points (called coordinated text streams), which offer new opportunities for text mining. For example, when a major event happens, all ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Previous work on text mining has almost exclusively focused on a single stream. However, we often have available multiple text streams indexed by the same set of time points (called coordinated text streams), which offer new opportunities for text mining. For example, when a major event happens, all the news articles published by different agencies in different languages tend to cover the same event for a certain period, exhibiting a correlated bursty topic pattern in all the news article streams. In general, mining correlated bursty topic patterns from coordinated text streams can reveal interesting latent associations or events behind these streams. In this paper, we define and study this novel text mining problem. We propose a general probabilistic algorithm which can effectively discover correlated bursty patterns and their bursty periods across text streams even if the streams have completely different vocabularies (e.g., English vs Chinese). Evaluation of the proposed method on a news data set and a literature data set shows that it can effectively discover quite meaningful topic patterns from both data sets: the patterns discovered from the news data set accurately reveal the major common events covered in the two streams of news articles (in English and Chinese, respectively), while the patterns discovered from two database publication streams match well with the major research paradigm shifts in database research. Since the proposed method is general and does not require the streams to share vocabulary, it can be applied to any coordinated text streams to discover correlated topic patterns that burst in multiple streams in the same period.
Streaming first story detection with application to Twitter
- IN PROCEEDINGS OF THE 11TH ANNUAL CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (NAACL HLT 2010
, 2010
"... With the recent rise in popularity and size of social media, there is a growing need for systems that can extract useful information from this amount of data. We address the problem of detecting new events from a stream of Twitter posts. To make event detection feasible on web-scale corpora, we pres ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
With the recent rise in popularity and size of social media, there is a growing need for systems that can extract useful information from this amount of data. We address the problem of detecting new events from a stream of Twitter posts. To make event detection feasible on web-scale corpora, we present an algorithm based on locality-sensitive hashing which is able overcome the limitations of traditional approaches, while maintaining competitive results. In particular, a comparison with a stateof-the-art system on the first story detection task shows that we achieve over an order of magnitude speedup in processing time, while retaining comparable performance. Event detection experiments on a collection of 160 million Twitter posts show that celebrity deaths are the fastest spreading news on Twitter. 1

