Results 1  10
of
45
A Unified Framework for Modelbased Clustering
 Journal of Machine Learning Research
, 2003
"... Modelbased clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic modelbased clustering based on a bipartite graph view of data and models that highlights the commonaliti ..."
Abstract

Cited by 69 (7 self)
 Add to MetaCart
Modelbased clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic modelbased clustering based on a bipartite graph view of data and models that highlights the commonalities and differences among existing modelbased clustering algorithms. In this view, clusters are represented as probabilistic models in a model space that is conceptually separate from the data space. For partitional clustering, the view is conceptually similar to the ExpectationMaximization (EM) algorithm. For hierarchical clustering, the graphbased view helps to visualize critical/important distinctions between similaritybased approaches and modelbased approaches.
Clustering documents with an exponentialfamily approximation of the dirichlet compound multinomial distribution
 In ICML
, 2006
"... The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. We derive a new family of distributions that ..."
Abstract

Cited by 51 (2 self)
 Add to MetaCart
(Show Context)
The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. We derive a new family of distributions that are approximations to DCM distributions and constitute an exponential family, unlike DCM distributions. We use these socalled EDCM distributions to obtain insights into the properties of DCM distributions, and then derive an algorithm for EDCM maximumlikelihood training that is many times faster than the corresponding method for DCM distributions. Next, we investigate expectationmaximization with EDCM components and deterministic annealing as a new clustering algorithm for documents. Experiments show that the new algorithm is competitive with the best methods in the literature, and superior from the point of view of finding models with low perplexity. 1.
Document clustering via adaptive subspace iteration
 In SIGIR
, 2004
"... Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification vi ..."
Abstract

Cited by 36 (7 self)
 Add to MetaCart
(Show Context)
Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.
Topic models over text streams: a study of batch and online unsupervised learning
 In Proc. 7th SIAM Int’l. Conf. on Data Mining
"... Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recentlyproposed batch topic models—Latent Dirichlet A ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
(Show Context)
Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recentlyproposed batch topic models—Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and vonMises Fisher (vMF) mixture models. In cases where offline clustering on complete document collections is infeasible due to resource and responserate constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on large realworld document collections, in both the offline and online settings, demonstrate that though LDA is a good model for finding wordlevel topics, vMF finds better documentlevel topic clusters more efficiently, which is often important in text mining applications. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for several applications (e.g., dynamic topicbased aggregation of usergenerated content in social networks) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms. 1
Resource Bundles: Using Aggregation for Statistical WideArea Resource
 Discovery and Allocation.28Int. Conference on Distributed Computing Systems,760768
"... ..."
(Show Context)
A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library MEDLINE
 MDELINE, accepted in ACM/IEEE Joint Conference on Digital Libraries, Chapel Hill, NC
, 2006
"... www.library.drexel.edu The following item is made available as a courtesy to scholars by the author(s) and Drexel University Library and may contain materials and content, including computer code and tags, artwork, text, graphics, images, and illustrations (Material) which may be protected by copyri ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
www.library.drexel.edu The following item is made available as a courtesy to scholars by the author(s) and Drexel University Library and may contain materials and content, including computer code and tags, artwork, text, graphics, images, and illustrations (Material) which may be protected by copyright law. Unless otherwise noted, the Material is made available for non profit and educational purposes, such as research, teaching and private study. For these limited purposes, you may reproduce (print, download or make copies) the Material without prior permission. All copies must include any copyright notice originally included with the Material. You must seek permission from the authors or copyright owners for all uses that are not allowed by fair use and other provisions of the U.S. Copyright Law. The responsibility for making an independent legal assessment and securing any necessary permission rests with persons desiring to reproduce or use the Material. Please direct questions to archives@drexel.edu
Integration of Semanticbased Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature
 Clustering, in the 12th SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2006
"... We introduce a novel document clustering approach that overcomes those problems by combining a semanticbased bipartite graph representation and a mutual refinement strategy. The primary contributions of this paper are the following. First, we introduce a new representation of documents using a bipa ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
We introduce a novel document clustering approach that overcomes those problems by combining a semanticbased bipartite graph representation and a mutual refinement strategy. The primary contributions of this paper are the following. First, we introduce a new representation of documents using a bipartite graph between documents and cooccurrence concepts in the documents. Second, we show how to enhance clustering quality by applying the mutual refinement strategy to the initial clustering results. Third, through the experiments on MEDLINE documents, we show that our integrated method significantly enhances cluster quality and clustering reliability compared to existing clustering methods. Our approach improves on the average 29.5 % cluster quality and 26.3 % clustering reliability, in terms of misclassification index, over Bisecting Kmeans with the best parameters.
Generative Oversampling for Mining Imbalanced Datasets
"... Abstract — One way to handle data mining problems where class prior probabilities and/or misclassification costs between classes are highly unequal is to resample the data until a new, desired class distribution in the training data is achieved. Many resampling techniques have been proposed in the p ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Abstract — One way to handle data mining problems where class prior probabilities and/or misclassification costs between classes are highly unequal is to resample the data until a new, desired class distribution in the training data is achieved. Many resampling techniques have been proposed in the past, and the relationship between resampling and costsensitive learning has been well studied. Surprisingly, however, few resampling techniques attempt to create new, artificial data points which generalize the known, labeled data. In this paper, we introduce an easily implementable resampling technique (generative oversampling) which creates new data points by learning from available training data. Empirically, we demonstrate that generative oversampling outperforms other wellknown resampling methods on several datasets in the example domain of text classification. I.
Clustering genes using gene expression and text literature data
 In CSB ’05: Proceedings of the 4th IEEE Computational Systems Bioinformatics Conference
, 2005
"... Clustering of gene expression data is a standard technique used to identify closely related genes. In this paper, we develop a new clustering algorithm, MSC (MultiSource Clustering), to perform exploratory analysis using two or more diverse sources of data. In particular, we investigate the problem ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Clustering of gene expression data is a standard technique used to identify closely related genes. In this paper, we develop a new clustering algorithm, MSC (MultiSource Clustering), to perform exploratory analysis using two or more diverse sources of data. In particular, we investigate the problem of improving the clustering by integrating information obtained from gene expression data with knowledge extracted from biomedical text literature. In each iteration of algorithm MSC, an EMtype procedure is employed to bootstrap the model obtained from one data source by starting with the cluster assignments obtained in the previous iteration using the other data sources. Upon convergence, the two individual models are used to construct the final cluster assignment. We compare the results of algorithm MSC for two data sources with the results obtained when the clustering is applied on the two sources of data separately. We also compare it with that obtained using the feature level integration method that performs the clustering after simply concatenating the features obtained from the two data sources. We show that the zscores of the clustering results from MSC are better than that from the other methods. To evaluate our clusters better, function enrichment results are presented using terms from the Gene Ontology database. Finally, by investigating the success of motif detection programs that use the clusters, we show that our approach integrating gene expression data and text data reveals clusters that are biologically more meaningful than those identified using gene expression data alone. To whom correspondence should be addressed.
Utilizing PhraseSimilarity Measures for Detecting and Clustering Informative RSS News Articles
"... As the number of RSS news feeds continue to increase over the Internet, it becomes necessary to minimize the workload of the user who is otherwise required to scan through huge number of news articles to find related articles of interest, which is a tedious and often an impossible task. In order to ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
As the number of RSS news feeds continue to increase over the Internet, it becomes necessary to minimize the workload of the user who is otherwise required to scan through huge number of news articles to find related articles of interest, which is a tedious and often an impossible task. In order to solve this problem, we present a novel approach, called InFRSS, which consists of a correlationbased phrase matching (CPM) model and a fuzzy compatibility clustering (FCC) model. CPM can detect RSS news articles containing phrases that are the same as well as semantically alike, and dictate the degrees of similarity of any two articles. FCC identifies and clusters nonredundant, closely related RSS news articles based on their degrees of similarity and a fuzzy compatibility relation. Experimental results show that (i) our CPM model on matching bigrams and trigrams in RSS news articles outperforms other phrase/keywordmatching approaches and (ii) our FCC model generates high quality clusters and outperforms other wellknown clustering techniques.