Results 1 - 10
of
40
Bursty and Hierarchical Structure in Streams
, 2002
"... A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade aw ..."
Abstract
-
Cited by 196 (2 self)
- Add to MetaCart
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise --- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges.
SwiftFile: An Intelligent Assistant for Organizing E-Mail
- In Proceedings of the Third International Conference on Autonomous Agents
, 2000
"... While most e-mail clients allow users to file messages into folders, the process they must go through to file each message is often tedious and slow. For each message, the user must first decide which folder is most appropriate. Then, the user must inform the e-mail reader of that choice by sel ..."
Abstract
-
Cited by 92 (2 self)
- Add to MetaCart
While most e-mail clients allow users to file messages into folders, the process they must go through to file each message is often tedious and slow. For each message, the user must first decide which folder is most appropriate. Then, the user must inform the e-mail reader of that choice by selecting the appropriate icon or menu item from among what is typically a set of several dozen choices. The combined effort of choosing a folder and conveying that choice to the application often discourages users from filing their mail, resulting in unmanageable inboxes that contain hundreds or even thousands of unfiled messages. SwiftFile encourages users to file their mail by simplifying the task. Using an adaptive classifier, it predicts the three folders that are most likely to be appropriate for a given message and provides shortcut buttons that permit the user to effortlessly file it into a predicted folder. For typical users, SwiftFile's predictions are accurate over 80% to...
An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages
- In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
, 2000
"... The growing problem of unsolicited bulk e-mail, also known as "spare", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naiv ..."
Abstract
-
Cited by 74 (2 self)
- Add to MetaCart
The growing problem of unsolicited bulk e-mail, also known as "spare", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in "encrypted " form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader. Keywords filtering/routing; text categorization; machine learning and IR; evaluation (general); test collections I.
Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach
, 2000
"... We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constr ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the keyword-based filter of a widely used e-mail reader.
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists
- Information Retrieval
, 2003
"... This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filterin ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memorybased anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.
Incremental Learning in SwiftFile
- In Proceedings of the Seventh International Conference on Machine Learning
, 2000
"... SwiftFile is an intelligent assistant that helps users organize their e-mail into folders. SwiftFile uses a text classifier to predict where each new message is likely to be filed by the user and provides shortcut buttons to quickly file messages into one of its predicted folders. One of the challen ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
SwiftFile is an intelligent assistant that helps users organize their e-mail into folders. SwiftFile uses a text classifier to predict where each new message is likely to be filed by the user and provides shortcut buttons to quickly file messages into one of its predicted folders. One of the challenges faced by SwiftFile is that the user's mail-ling habits are constantly changing -- users are frequently creating, deleting and rearranging folders to meet their current ling needs. In this paper, we discuss the importance of incremental learning in SwiftFile. We present several criteria for judging how well incremental learning algorithms adapt to quickly changing data and evaluate SwiftFile's classifier using these criteria. We find that SwiftFile's classifier is surprisingly responsive and does not require the extensive training that is often assumed in most learning systems.
Athena: Mining-based interactive management of text databases
- International Conference on Extending Database Technology
, 2000
"... Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive Bayes classi ers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classi er is considerably more accurate (7 to 29 % absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, under-weighting long documents, and over-weighting author and subject. We also present a new interactive clustering algorithm, C-Evolve, for topic discovery. C-Evolve rst nds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classi cation algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, C-Evolve achieves considerably higher clustering accuracy (10 to 20 % absolute increase in our experiments) than the popular K-Means and agglomerative clustering methods. 1
Automatic categorization of email into folders: Benchmark experiments on enron and sri corpora
- In Technical Report, Computer Science department, IR-418
"... Office workers everywhere are drowning in email—not only spam, but also large quantities of legitimate email to be read and organized for browsing. Although there have been extensive investigations of automatic document categorization, email gives rise to a number of unique challenges, and there has ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Office workers everywhere are drowning in email—not only spam, but also large quantities of legitimate email to be read and organized for browsing. Although there have been extensive investigations of automatic document categorization, email gives rise to a number of unique challenges, and there has been relatively little study of classifying email into folders. This paper presents an extensive benchmark study of email foldering using two large corpora of real-world email messages and foldering schemes: one from former Enron employees, another from participants in an SRI research project. We discuss the challenges that arise from differences between email foldering and traditional document classification. We show experimental results from an array of automated classification methods and evaluation methodologies, including a new evaluation method of foldering results based on the email timeline, and including enhancements to the exponential gradient method Winnow, providing top-tier accuracy with a fraction the training time of alternative methods. We also establish that classification accuracy in many cases is relatively low, confirming the challenges of email data, and pointing toward email foldering as an important area for further research. 1.
Hyperlink Ensembles: A Case Study in Hypertext Classification
- Information Fusion
, 2001
"... In this paper, we introduce hyperlink ensembles, a novel type of ensemble classifier for classifying hypertext documents. Instead of using the text on a page for deriving features that can be used for training a classifier, we suggest to use portions of texts from all pages that point to the targ ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
In this paper, we introduce hyperlink ensembles, a novel type of ensemble classifier for classifying hypertext documents. Instead of using the text on a page for deriving features that can be used for training a classifier, we suggest to use portions of texts from all pages that point to the target page. A hyperlink ensemble is formed by obtaining one prediction for each hyperlink that points to a page.
Experience with Rule Induction and k-Nearest Neighbour Methods for Interface Agents that Learn
- In ML95 Workshop on Agents that Learn from Other Agents
, 1995
"... this paper use the same feature extraction mechanism, which extracts words according to word frequency. The underlying assumption here is that words which act as good classifiers for identifying message topics appear frequently. Whilst this model appears to work for Magi, where the task is primarily ..."
Abstract
-
Cited by 20 (8 self)
- Add to MetaCart
this paper use the same feature extraction mechanism, which extracts words according to word frequency. The underlying assumption here is that words which act as good classifiers for identifying message topics appear frequently. Whilst this model appears to work for Magi, where the task is primarily that of grouping together related messages, it is unsuitable for UNA where articles have already been sorted into topics, or newsgroups. Features identified by the current feature extraction module are not ideal for determining the user's interest in an article. The performance of UNA degrades significantly when multiple narrow classifications are used. We are currently studying this phenomenon, however as the number of classes increases, there is a greater chance of features appearing in more than one class. Algorithms such as CN2 and MBR consider each classification as distinct from the others, as a result, such features will be considered as poor classifiers. An important difference between the two algorithms is the time taken to induce and apply user profiles to new articles. The instance based approach builds a sub-symbolic representation in the form of weights and distance metrics. Unlike rule induction in CN2, these calculations do not involve searching through a large space of possible solutions. The search performed by CN2 is compounded by the large number of features generated by the article body. It was found that tests involving CN2 took significantly (30 to 40 times) longer than tests involving MBR. Considerations such as speed of profile induction and classification are important. In order to induce a user profile based on observations, many examples are needed, and large log files are generated. As agent technology is applied to commercial tools such as web ...

