Results 1 -
2 of
2
Categorization and Keyword Identification of Unlabeled Documents
- in Proceedings of the Fifth IEEE International Conference on Data Mining
, 2005
"... In this paper we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provides evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case. 1
Local Semantic Kernels for Text Document Clustering
- In Workshop on Text Mining, SIAM International Conference on Data Mining
, 2007
"... Document clustering is a fundamental task of text mining, by which efficient organization, navigation, summarization and retrieval of documents can be achieved. The clustering of documents presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Document clustering is a fundamental task of text mining, by which efficient organization, navigation, summarization and retrieval of documents can be achieved. The clustering of documents presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex semantics of the natural language. Subspace clustering is an extension of traditional clustering that is designed to capture local feature relevance, and to group documents with respect to the features (or words) that matter the most. This paper presents a subspace clustering technique based on a Locally Adaptive Clustering (LAC) algorithm. To improve the subspace clustering of documents and the identification of keywords achieved by LAC, kernel methods and semantic distances are deployed. The basic idea is to define a local kernel for each cluster by which semantic distances between pairs of words are computed to derive the clustering and the local term weightings. The proposed approach, called Semantic LAC, is evaluated using benchmark datasets. Our experiments show that Semantic LAC is capable of improving the clustering quality. 1

