Results 1 -
2 of
2
Information Extraction as a Basis for High-Precision Text Classification
- ACM Transactions on Information Systems
, 1994
"... this article. For the purpose of text classification, the answer keys serve only as a set of correct classifications for each text. If a text has instantiated key templates associated with it in the corpus, then it should be classified as a relevant text. If a text has no instantiated key templates ..."
Abstract
-
Cited by 102 (5 self)
- Add to MetaCart
this article. For the purpose of text classification, the answer keys serve only as a set of correct classifications for each text. If a text has instantiated key templates associated with it in the corpus, then it should be classified as a relevant text. If a text has no instantiated key templates associated with it (i.e., only a dummy template) then it should be classified as an irrelevant text. This is a binary classification problem: a text is either relevant to the terrorism domain or irrelevant. The texts were selected by keyword search from a database of newswire articles 2 because they contained words associated with terrorism. However, many of them did not mention any relevant terrorist incidents. Of the 1700 texts in the MUC4 corpus, only 53% described a relevant terrorist event. Because many of the texts in the corpus were irrelevant, the MUC-4 systems had to distinguish the relevant from the irrelevant texts. Although the MUC-4 task was information extraction, information detection 4 (i.e, text classification) was an implicit subtask. To be successful in MUC-4, the information extraction systems also had to be good at detection. Our MUC-4 system did not use a separate text classification module. Instead, we extracted information from every text and relied on a discourse analysis module to discard irrelevant templates. This strategy was very effective, 5 but it was expensive. A reliable text classification module could have filtered out irrele- 1MUC-3 was the Third Message Understanding ConferenCe held in 1991 [MUC-3 Proceedings 19911
Hierarchical Document Clustering Using Frequent Itemsets
- IN PROC. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2003 (SDM 2003
, 2003
"... A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Anoth ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.

