Results 1 - 10
of
88
NewsWeeder: Learning to Filter Netnews
- in Proceedings of the 12th International Machine Learning Conference (ML95
, 1995
"... A significant problem in many information filtering systems is the dependence on the user for the creation and maintenance of a user profile, which describes the user's interests. NewsWeeder is a netnews-filtering system that addresses this problem by letting the user rate his or her interest level ..."
Abstract
-
Cited by 353 (0 self)
- Add to MetaCart
A significant problem in many information filtering systems is the dependence on the user for the creation and maintenance of a user profile, which describes the user's interests. NewsWeeder is a netnews-filtering system that addresses this problem by letting the user rate his or her interest level for each article being read (1-5), and then learning a user profile based on these ratings. This paper describes how NewsWeeder accomplishes this task, and examines the alternative learning methods used. The results show that a learning algorithm based on the Minimum Description Length (MDL) principle was able to raise the percentage of interesting articles to be shown to users from 14% to 52% on average. Further, this performance significantly outperformed (by 21%) one of the most successful techniques in Information Retrieval (IR), termfrequency /inverse-document-frequency (tf-idf) weighting. 1
Approaches to the Automatic Discovery of Patterns in Biosequences
, 1995
"... This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which a ..."
Abstract
-
Cited by 125 (21 self)
- Add to MetaCart
This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis presented of the ways in which an assessment can be made of the significance and usefulness of the discovered patterns. It is shown that this problem is related to problems studied in the field of machine learning. The largest part of this paper comprises a review of a number of existing methods developed to solve this problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered...
Generalizing Case Frames Using a Thesaurus and the MDL Principle
- Computational Linguistics
, 1998
"... this paper, we confine ourselves to the former issue, and refer the interested reader to Li and Abe (1996), which deals with the latter issue ..."
Abstract
-
Cited by 95 (4 self)
- Add to MetaCart
this paper, we confine ourselves to the former issue, and refer the interested reader to Li and Abe (1996), which deals with the latter issue
Minimum Message Length and Kolmogorov Complexity
- Computer Journal
, 1999
"... this paper is to describe some of the relationships among the different streams and to try to clarify some of the important differences in their assumptions and development. Other studies mentioning the relationships appear in [1, Section IV, pp. 1038--1039], [2, sections 5.2, 5.5] and [3, p. 465] ..."
Abstract
-
Cited by 86 (20 self)
- Add to MetaCart
this paper is to describe some of the relationships among the different streams and to try to clarify some of the important differences in their assumptions and development. Other studies mentioning the relationships appear in [1, Section IV, pp. 1038--1039], [2, sections 5.2, 5.5] and [3, p. 465]
Bayesian grammar induction for language modeling
- In Proceedings of ACL
, 1995
"... We describe a corpus-based induction algorithm for probabilistic context-free grammars. The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm. We compare the performance of our algorithm to n-gram models and the Inside-Outside ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
We describe a corpus-based induction algorithm for probabilistic context-free grammars. The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm. We compare the performance of our algorithm to n-gram models and the Inside-Outside algorithm in three language modeling tasks. In two of the tasks, the training data is generated by a probabilistic context-free grammar and in both tasks our algorithm outperforms the other techniques. The third task involves naturally-occurring data, and in this task our algorithm does not perform as well as n-gram models but vastly outperforms the Inside-Outside algorithm. 1
Structured Audio: Creation, Transmission, and Rendering of Parametric Sound Representations
- PROC. IEEE
, 1998
"... ..."
A Statistical Perspective on Knowledge Discovery in Databases
, 1996
"... The quest to find models usefully characterizing data is a process central to the scientific method, and has been carried out on many fronts. Researchers from an expanding number of fields have designed algorithms to discover rules or equations that capture key relationships between variables in a d ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
The quest to find models usefully characterizing data is a process central to the scientific method, and has been carried out on many fronts. Researchers from an expanding number of fields have designed algorithms to discover rules or equations that capture key relationships between variables in a database. The task of this chapter is to provide a perspective on statistical techniques applicable to KDD; accordingly, we review below some major advances in statistics in the last few decades. We next highlight some distinctives of what may be called a "statistical viewpoint." Finally we overview some influential classical and modern statistical methods for practical model induction.
Statistical Themes and Lessons for Data Mining
, 1997
"... Data mining is on the interface of Computer Science and Statistics, utilizing advances in both disciplines to make progress in extracting information from large databases. It is an emerging field that has attracted much attention in a very short period of time. This article highlights some statist ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
Data mining is on the interface of Computer Science and Statistics, utilizing advances in both disciplines to make progress in extracting information from large databases. It is an emerging field that has attracted much attention in a very short period of time. This article highlights some statistical themes and lessons that are directly relevant to data mining and attempts to identify opportunities where close cooperation between the statistical and computational communities might reasonably provide synergy for further progress in data analysis.

