Results 1  10
of
43
Generative modelbased document clustering: a comparative study
 Knowledge and Information Systems
, 2005
"... Semisupervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semisupervis ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Semisupervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semisupervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial modelbased semisupervised document clustering methods under a principled modelbased clustering framework. The framework naturally leads to a deterministic annealing extension of existing semisupervised clustering approaches. We compare three (slightly) different semisupervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedbackbased damnl, where damnl stands for multinomial modelbased deterministic annealing algorithm. The first two are extensions of the seeded kmeans and constrained kmeans algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semisupervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedbackbased approach excels when available labels are incomplete.
1 Parallel Spectral Clustering in Distributed Systems
"... Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as kmeans. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform cluster ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms such as kmeans. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through
MOSAIC: A proximity graph approach for agglomerative clustering
 IN: THE 9TH INTL. CONF. ON DATA WAREHOUSING AND KNOWLEDGE DISCOVERY
, 2007
"... Representativebased clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. In this ..."
Abstract

Cited by 15 (12 self)
 Add to MetaCart
Representativebased clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing a given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates nonconvex shapes as the unions of small clusters that have been computed using a representativebased clustering algorithm. The experimental results show that this technique leads to clusters of higher quality compared to running a representative clustering algorithm standalone. Given a suitable fitness function, MOSAIC is able to detect arbitrary shape clusters. In addition, MOSAIC is capable of dealing with high dimensional data.
Integrating recommendation models for improved web page prediction accuracy
 ThirtyFirst Australasian Computer Science Conference (ACSC’08
, 2008
"... Recent research initiatives have addressed the need for improved performance of Web page prediction accuracy that would profit many applications, ebusiness in particular. Different Web usage mining frameworks have been implemented for this purpose specifically Association rules, clustering, and Mar ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Recent research initiatives have addressed the need for improved performance of Web page prediction accuracy that would profit many applications, ebusiness in particular. Different Web usage mining frameworks have been implemented for this purpose specifically Association rules, clustering, and Markov model. Each of these frameworks has its own strengths and weaknesses and it has been proved that using each of these frameworks individually does not provide a suitable solution that answers today’s Web page prediction needs. This paper endeavors to provide an improved Web page prediction accuracy by using a novel approach that involves integrating clustering, association rules and Markov models according to some constraints. Experimental results prove that this integration provides better prediction accuracy than using each technique individually.
Clustering Time Series from Mixture Polynomial Models with Discretised Data
 In Proceedings of the second Australasian Data Mining Workshop
, 2003
"... Clustering time series is an active research area with applications in many fields. One common feature of time series is the likely presence of outliers. These uncharacteristic data can significantly e#ect the quality of clusters formed. This paper evaluates a method of overcoming the detrimenta ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Clustering time series is an active research area with applications in many fields. One common feature of time series is the likely presence of outliers. These uncharacteristic data can significantly e#ect the quality of clusters formed. This paper evaluates a method of overcoming the detrimental e#ects of outliers. We describe some of the alternative approaches to clustering time series, then specify a particular class of model for experimentation with kmeans clustering and a correlation based distance metric. For data derived from this class of model we demonstrate that discretising the data into a binary series of above and below the median improves the clustering when the data has outliers.
Learning using the Born rule
, 2006
"... In Quantum Mechanics the transition from a deterministic description to a probabilistic one is done using a simple rule termed the Born rule. This rule states that the probability of an outcome (a) given a state (Ψ) is the square of their inner products ((a ⊤ Ψ) 2). In this paper, we will explore th ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
In Quantum Mechanics the transition from a deterministic description to a probabilistic one is done using a simple rule termed the Born rule. This rule states that the probability of an outcome (a) given a state (Ψ) is the square of their inner products ((a ⊤ Ψ) 2). In this paper, we will explore the use of the Bornrulebased probabilities for clustering, feature selection, classification, and for comparison between sets. We show how these probabilities lead to existing and new algebraic algorithms for which no other complete probabilistic justification is known.
Clustering processes
"... The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general nonparametric assumptions. The notion of ..."
Abstract

Cited by 6 (6 self)
 Add to MetaCart
The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general nonparametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent. 1.
Online Clustering of Processes
"... The problem of online clustering is considered in the case where each data point is a sequence generated by a stationary ergodic process. Data arrive in an online fashion so that the sample received at every timestep is either a continuation of some previously received sequence or a new sequence. Th ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
The problem of online clustering is considered in the case where each data point is a sequence generated by a stationary ergodic process. Data arrive in an online fashion so that the sample received at every timestep is either a continuation of some previously received sequence or a new sequence. The dependence between the sequences can be arbitrary. No parametric or independence assumptions are made; the only assumption is that the marginal distribution of each sequence is stationary and ergodic. A novel, computationally efficient algorithm is proposed and is shown to be asymptotically consistent (under a natural notion of consistency). The performance of the proposed algorithm is evaluated on simulated data, as well as on real datasets (motion classification). 1
Unsupervised learning for expertbased software quality estimation
 In The Eighth IEEE International Symposium on High Assurance Systems Engineering (HASE 2004
, 2004
"... Current software quality estimation models often involve using supervised learning methods to train a software quality classifier or a software fault prediction model. In such models, the dependent variable is a software quality measurement indicating the quality of a software module by either a ris ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Current software quality estimation models often involve using supervised learning methods to train a software quality classifier or a software fault prediction model. In such models, the dependent variable is a software quality measurement indicating the quality of a software module by either a riskbased class membership (e.g., whether it is faultprone or not faultprone) or the number of faults. In reality, such a measurement may be inaccurate, or even unavailable. In such situations, this paper advocates the use of unsupervised learning (i.e., clustering) techniques to build a software quality estimation system, with the help of a software engineering human expert. The system first clusters hundreds of software modules into a small number of coherent groups and presents the representative of each group to a software quality expert, who labels each cluster as either faultprone or not faultprone based on his domain knowledge as well as some data statistics (without any knowledge of the dependent variable, i.e., the software quality measurement). Our preliminary empirical results show promising potentials of this methodology in both predicting software quality and detecting potential noise in a software measurement and quality dataset. 1