Abstract:
This paper examines the use of inductive learning to categorize natural language documents into predefined content categories. Categorization of text is of increasing importance in information retrieval and natural language processing systems. Previous research on automated text categorization has mixed machine learning and knowledge engineering methods, making it difficult to draw conclusions about the performance of particular methods. In this paper we present empirical results on the performance of a Bayesian classifier and a decision tree learning algorithm on two text categorization data sets. We find that both algorithms achieve reasonable performance and allow controlled tradeoffs between false positives and false negatives. The stepwise feature selection in the decision tree algorithm is particularly effective in dealing with the large feature sets common in text categorization. However, even this algorithm is aided by an initial prefiltering of features, confirming the results...
Citations
|
2538
|
Induction of decision trees
– Quinlan
- 1986
|
|
524
|
Knowledge acquisition via incremental conceptual clustering
– Fisher
- 1987
|
|
247
|
Heuristic classification
– Clancey
- 1985
|
|
172
|
An evaluation of phrasal and clustered representations on a text categorization task
– Lewis
- 1992
|
|
135
|
Representation and learning in information retrieval
– Lewis
- 1992
|
|
129
|
Pattern Classification and Scene Analysis. A Wiley-Inter science Publication
– Duda, Hart
- 1973
|
|
110
|
Shift of bias for inductive concept learning
– Utgoff
- 1986
|
|
78
|
SCISOR: Extracting information from on-line news
– Jacobs, Rau
- 1990
|
|
74
|
A Theory of Learning Classification Rules
– Buntine
- 1990
|
|
63
|
An overview of the FRUMP system
– DeJong
- 1982
|
|
61
|
Evaluating text categorization
– Lewis
- 1991
|
|
57
|
CONSTRUE/TIS: a system for content-based indexing of a database of news stories
– HAYES, WEINSTEIN
- 1990
|
|
54
|
Automatic indexing: An experimental inquiry
– Maron
- 1961
|
|
39
|
Poor estimates of context are worse than none
– Gale, Church
- 1990
|
|
27
|
Automatic document classification
– Borko, Bernick
- 1963
|
|
24
|
Introduction to ind and recursive partitioning
– Buntine, Caruana
- 1991
|
|
23
|
The Significance of the Cranfield Tests on Index Languages
– Cleverdon
- 1991
|
|
19
|
New york university: Description of the proteus system as used for muc-4
– Grishman, Macleod, et al.
- 1992
|
|
14
|
Classification trees for information retrieval
– Crawford, Fung, et al.
- 1991
|
|
7
|
Trading MIPS and Memory for Knowledge Engineering: Automatic Classification of Census Returns on a Massively Parallel Supercomputer
– Creecy, Masand, et al.
- 1992
|
|
7
|
Data Extraction as Text Categorization: An Experiment with the MUC-3 Corpus
– Lewis
- 1991
|
|
6
|
Hughes Trainable Text Skimmer: Description of the TTS System as Used for MUC-3
– Dolan, Goldman, et al.
- 1991
|
|
4
|
Description of the UNL/USL system used for MUC-3
– Deogun, Raghavan
- 1991
|
|
3
|
Concept recognition in an automatic text-processing system for the life sciences
– Vleduts-Stokolov
- 1987
|
|
2
|
On recognizing planned deception
– Hardt
- 1988
|
|
2
|
The Intelligent Banking System: natural language processing for financial communications
– Sahin, Sawyer
- 1989
|
|
1
|
Learning with many irrelevant features. AAAI-91
– Almuallim, Dietterich
- 1991
|
|
1
|
Advanced Decision Systems: Description of the CODEX system as used for MUC-3
– Balcom, Tong
- 1991
|
|
1
|
A fuzzy measure of agreement between machine and manual assignment of documents to subject categories
– Cerny, Okseniuk, et al.
- 1983
|
|
1
|
SRI International: Description of the TACITUS system as used for MUC-3
– Hobbs
- 1991
|