Results 1 - 10
of
41
Efficient phrase-based document indexing for Web document clustering
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2004
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
Improving Text Categorization Methods for Event Tracking
- In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval
, 2000
"... Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different ...
Textquest: Document Clustering Of Medline Abstracts For Concept Discovery In Molecular Biology
, 2001
"... Introduction The vast accumulation of electronically available textual information has raised new challenges for information retrieval technology. The problem of content analysis was first introduced in the late 60's [1]. Since then, a number of approaches have emerged in order to exploit free-text ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
Introduction The vast accumulation of electronically available textual information has raised new challenges for information retrieval technology. The problem of content analysis was first introduced in the late 60's [1]. Since then, a number of approaches have emerged in order to exploit free-text information from a variety of sources [2, 3]. In the fields of Biology and Medicine, abstracts are collected and maintained in Medline, a project supported by the U.S. National Library of Medicine (NLM) 1 . Medline constitutes a valuable resource that allows scientists to retrieve articles of interest, based on keyword searches. This query-based information retrieval is extremely useful but it only allows a limited exploitation of the knowledge available in biological abstracts. Query-based retrieval is useful for content-focused querying [4], where searches pre-suppose that the end-user is familiar with the subject at hand or that they would
Data Cleansing: Beyond Integrity Analysis
, 2000
"... The paper analyzes the problem of data cleansing and automatically identifying potential errors in data sets. An overview of the diminutive amount of existing literature concerning data cleansing is given. Methods for error detection that go beyond integrity analysis are reviewed and presented. The ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
The paper analyzes the problem of data cleansing and automatically identifying potential errors in data sets. An overview of the diminutive amount of existing literature concerning data cleansing is given. Methods for error detection that go beyond integrity analysis are reviewed and presented. The applicable methods include: statistical outlier detection, pattern matching, clustering, and data mining techniques. Some brief results supporting the use of such methods are given. The future research directions necessary to address the data cleansing problem are discussed. Keywords: data cleansing, data cleaning, data quality, error detection.
Event threading within news topics
- In Proceedings of the ACM Thirteenth International Conference on Information and Knowledge Management
, 2004
"... With the overwhelming volume of online news available today, there is an increasing need for automatic techniques to analyze and present news to the user in a meaningful and efficient manner. Previous research focused only on organizing news stories by their topics into a flat hierarchy. We believe ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
With the overwhelming volume of online news available today, there is an increasing need for automatic techniques to analyze and present news to the user in a meaningful and efficient manner. Previous research focused only on organizing news stories by their topics into a flat hierarchy. We believe viewing a news topic as a flat collection of stories is too restrictive and inefficient for a user to understand the topic quickly. In this work, we attempt to capture the rich structure of events and their dependencies in a news topic through our event models. We call the process of recognizing events and their dependencies event threading. We believe our perspective of modeling the structure of a topic is more effective in capturing its semantics than a flat list of on-topic stories. We formally define the novel problem, suggest evaluation metrics and present a few techniques for solving the problem. Besides the standard word based features, our approaches take into account novel features such as temporal locality of stories for event recognition and time-ordering for capturing dependencies. Our experiments on a manually labeled data sets show that our models effectively identify the events and capture dependencies among them.
Text Mining with Information Extraction
- AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases
, 2002
"... The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrat ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases or KDD). By utilizing existing IE and KDD techniques, text-mining systems can be developed relatively rapidly and evaluated on existing text corpora for testing IE systems. We present a general text-mining framework called DiscoTEX which employs an IE module for transforming natural-language documents into structured data and a KDD module for discovering prediction rules from the extracted data. When discovering patterns in extracted text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to typographical errors, misspellings, abbreviations, and other
Intelligent Information Triage
- In The 24th Annoual Internationl Conference on Research and Development in Information Retrieval (SIGIR
, 2001
"... In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective indications of the importance of a time-sensitive document, for the purpose of producing better documen ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective indications of the importance of a time-sensitive document, for the purpose of producing better document filtering or ranking. By prospective, we mean importance that could be assessed by actions that occur in the future. For example, a news story may be assessed (retrospectively) as being important, based on events that occurred after the story appeared, such as a stock price plummeting or the issuance of many follow-up stories. If a system could anticipate (prospectively) such occurrences, it could provide a timely indication of importance. Clearly, perfect prescience is impossible. However, sometimes there is su#cient correlation between the content of an information item and the events that occur subsequently. We describe a process for creating and evaluating approximate information-triage procedures that are based on prospective indications. Unlike many informationretrieval applications for which document labeling is a laborious, manual process, for many prospective criteria it is possible to build very large, labeled, training corpora automatically. Such corpora can be used to train text classification procedures that will predict the (prospective) importance of each document. This paper illustrates the process with two case studies, demonstrating the ability to predict whether a news story will be followed by many, very similar news stories, and also whether the stock price of one or more companies associated with a news story will move significantly following the appearance of that story. We conclude by discussing how the comprehensibility of the learned classifiers ca...
Learning Similarity Metrics for Event Identification in Social Media
"... Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of real-world events of different type and scale. By automatically identifying these events and their associated user-contributed social media documents, which is the focus of this paper, we can enable event browsing and search in state-of-the-art search engines. To address this problem, we exploit the rich “context ” associated with social media content, including user-provided annotations (e.g., title, tags) and automatically generated information (e.g., content creation time). Using this rich context, which includes both textual and non-textual features, we can define appropriate document similarity metrics to enable online clustering of media to events. As a key contribution of this paper, we explore a variety of techniques for learning multi-feature similarity metrics for social media documents in a principled manner. We evaluate our techniques on large-scale, realworld datasets of event images from Flickr. Our evaluation results suggest that our approach identifies events, and their associated social media documents, more effectively than the state-of-the-art strategies on which we build.
Combining Multiple Learning Strategies for Effective Cross Validation
- In Proceedings of ICML-00, 17th International Conference on Machine Learning
, 2000
"... Parameter tuning through cross-validation becomes very difficult when the validation set contains no or only a few examples of the classes in the evaluation set. We address this open challenge by using a combination of classifiers with different performance characteristics to effectively reduc ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Parameter tuning through cross-validation becomes very difficult when the validation set contains no or only a few examples of the classes in the evaluation set. We address this open challenge by using a combination of classifiers with different performance characteristics to effectively reduce the performance variance on average of the overall system across all classes, including those not seen before. This approach allows us to tune the combination system on available but less-representative validation data and obtain smaller performance degradation of this system on the evaluation data than using a single-method classifier alone. We tested this approach by applying k-Nearest Neighbor, Rocchio and Language Modeling classifiers and their combination to the event tracking problem in the Topic Detection and Tracking (TDT) domain, where new classes (events) are created constantly over time, and representative validation sets for new classes are often difficult to ob...

