Results 1 - 10
of
29
Generative model-based document clustering: a comparative study
- Knowledge and Information Systems
, 2005
"... Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervis ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial modelbased semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.
C.F.: MOSAIC: A proximity graph approach for agglomerative clustering
- In: The 9th Intl. Conf. on Data Warehousing and Knowledge Discovery
, 2007
"... Abstract. Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. ..."
Abstract
-
Cited by 12 (11 self)
- Add to MetaCart
Abstract. Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing a given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates non-convex shapes as the unions of small clusters that have been computed using a representative-based clustering algorithm. The experimental results show that this technique leads to clusters of higher quality compared to running a representative clustering algorithm standalone. Given a suitable fitness function, MOSAIC is able to detect arbitrary shape clusters. In addition, MOSAIC is capable of dealing with high dimensional data. Keywords: Post-processing, hybrid clustering, finding clusters of arbitrary shape, agglomerative clustering, using proximity graphs for clustering. 1
Clustering Time Series from Mixture Polynomial Models with Discretised Data
- In Proceedings of the second Australasian Data Mining Workshop
, 2003
"... Clustering time series is an active research area with applications in many fields. One common feature of time series is the likely presence of outliers. These uncharacteristic data can significantly e#ect the quality of clusters formed. This paper evaluates a method of overcoming the detrimenta ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Clustering time series is an active research area with applications in many fields. One common feature of time series is the likely presence of outliers. These uncharacteristic data can significantly e#ect the quality of clusters formed. This paper evaluates a method of overcoming the detrimental e#ects of outliers. We describe some of the alternative approaches to clustering time series, then specify a particular class of model for experimentation with k-means clustering and a correlation based distance metric. For data derived from this class of model we demonstrate that discretising the data into a binary series of above and below the median improves the clustering when the data has outliers.
Learning using the Born rule
, 2006
"... In Quantum Mechanics the transition from a deterministic description to a probabilistic one is done using a simple rule termed the Born rule. This rule states that the probability of an outcome (a) given a state (Ψ) is the square of their inner products ((a ⊤ Ψ) 2). In this paper, we will explore th ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In Quantum Mechanics the transition from a deterministic description to a probabilistic one is done using a simple rule termed the Born rule. This rule states that the probability of an outcome (a) given a state (Ψ) is the square of their inner products ((a ⊤ Ψ) 2). In this paper, we will explore the use of the Born-rule-based probabilities for clustering, feature selection, classification, and for comparison between sets. We show how these probabilities lead to existing and new algebraic algorithms for which no other complete probabilistic justification is known.
1 Parallel Spectral Clustering in Distributed Systems
"... Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform cluster ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through
A probabilistic validation algorithm for Web users’ clusters
- In Proceedings of the IEEE international conference on systems, man and cybernetics (SMC
, 2004
"... Abstract – Cluster analysis is one of the most important aspects in the data mining process for discovering groups and identifying interesting distributions or patterns over the considered data sets. In the context of Web data mining, model-based clustering algorithms are often used to cluster simil ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract – Cluster analysis is one of the most important aspects in the data mining process for discovering groups and identifying interesting distributions or patterns over the considered data sets. In the context of Web data mining, model-based clustering algorithms are often used to cluster similar users ’ sessions in order to determine Website access behaviors. An important issue in cluster analysis is the evaluation of clustering results to find the partitioning that best fits the underlying data. In this paper, we present a novel validation technique for modelbased clustering approaches.
Unsupervised learning for expert-based software quality estimation
- In The Eighth IEEE International Symposium on High Assurance Systems Engineering (HASE 2004
, 2004
"... Current software quality estimation models often involve using supervised learning methods to train a software quality classifier or a software fault prediction model. In such models, the dependent variable is a software quality measurement indicating the quality of a software module by either a ris ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Current software quality estimation models often involve using supervised learning methods to train a software quality classifier or a software fault prediction model. In such models, the dependent variable is a software quality measurement indicating the quality of a software module by either a risk-based class membership (e.g., whether it is fault-prone or not fault-prone) or the number of faults. In reality, such a measurement may be inaccurate, or even unavailable. In such situations, this paper advocates the use of unsupervised learning (i.e., clustering) techniques to build a software quality estimation system, with the help of a software engineering human expert. The system first clusters hundreds of software modules into a small number of coherent groups and presents the representative of each group to a software quality expert, who labels each cluster as either fault-prone or not fault-prone based on his domain knowledge as well as some data statistics (without any knowledge of the dependent variable, i.e., the software quality measurement). Our preliminary empirical results show promising potentials of this methodology in both predicting software quality and detecting potential noise in a software measurement and quality dataset. 1
Intuitive Clustering of Biological Data
"... K-means clustering combines a variety of striking properties because of which it is widely used in applications: training is intuitive and simple, the final classifier represents classes by geometrically meaningful prototypes, and the algorithm is quite powerful compared to more complex alternative ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
K-means clustering combines a variety of striking properties because of which it is widely used in applications: training is intuitive and simple, the final classifier represents classes by geometrically meaningful prototypes, and the algorithm is quite powerful compared to more complex alternative clustering algorithms. In this contribution, we focus on extensions which incorporate additional information into the clustering algorithm to achieve a better accuracy: neighborhood cooperation from neural gas, (possibly fuzzy) label information of input data, and general problem-adapted distances instead of the standard Euclidean metric. These extensions can be formulated in a simple general framework by means of a cost function. We demonstrate the ability of these variants on several representative clustering problems from computational biology.
Clustering Algorithms Optimizer: A Framework for Large Datasets
"... Abstract. Clustering algorithms are employed in many bioinformatics tasks, including categorization of protein sequences and analysis of gene-expression data. Although these algorithms are routinely applied, many of them suffer from the following limitations: (i) relying on predetermined parameters ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Clustering algorithms are employed in many bioinformatics tasks, including categorization of protein sequences and analysis of gene-expression data. Although these algorithms are routinely applied, many of them suffer from the following limitations: (i) relying on predetermined parameters tuning, such as a-priori knowledge regarding the number of clusters; (ii) involving nondeterministic procedures that yield inconsistent outcomes. Thus, a framework that addresses these shortcomings is desirable. We provide a datadriven framework that includes two interrelated steps. The first one is SVDbased dimension reduction and the second is an automated tuning of the algorithm’s parameter(s). The dimension reduction step is efficiently adjusted for very large datasets. The optimal parameter setting is identified according to the internal evaluation criterion known as Bayesian Information Criterion (BIC). This framework can incorporate most clustering algorithms and improve their performance. In this study we illustrate the effectiveness of this platform by incorporating the standard K-Means and the Quantum Clustering algorithms. The implementations are applied to several gene-expression benchmarks with significant success.

