Results 11  20
of
110
A Fast and Robust General Purpose Clustering Algorithm
 In Pacific Rim International Conference on Artificial Intelligence
, 2000
"... General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. kMeans has been adopted as the prototype of iterative modelbased clustering because of its speed, simplicity and capability to work within the format of very larg ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. kMeans has been adopted as the prototype of iterative modelbased clustering because of its speed, simplicity and capability to work within the format of very large databases. However, kMeans has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multidimensional but is more robust to noise and outliers. We achieve this by using the discrete median rather than the mean as the estimator of the center of a cluster. Comparison with kMeans, Expectation Maximization and Gibbs sampling demonstrates the advantages of our algorithm.
An alternative extension of the kmeans algorithm for clustering categorical data
 Int. J. Appl. Math. Comput. Sci
, 2004
"... Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computatio ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The kmeans algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of “cluster centers ” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a kmeanslike algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two wellknown data sets, namely, soybean disease and nursery databases.
On Data Clustering Analysis: Scalability, Constraints and Validation
 In Proceedings of the 6th PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD
, 2002
"... this paper we discuss some very recent clustering approaches and recount our experience with some of these algorithms. We also present the problem of clustering in the presence of constraints and discuss the issue of clustering validation ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
this paper we discuss some very recent clustering approaches and recount our experience with some of these algorithms. We also present the problem of clustering in the presence of constraints and discuss the issue of clustering validation
A Cube Model for Web Access Sessions and Cluster Analysis
 IN PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON MINING WEB DATA (WEBKDD 2001
, 2001
"... Identification of the navigational patterns of casual visitors is an important step in online recommendation to convert casual visitors to customers in ecommerce. Clustering and sequential analysis are two primary techniques for mining navigational patterns from Web and application server logs. The ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Identification of the navigational patterns of casual visitors is an important step in online recommendation to convert casual visitors to customers in ecommerce. Clustering and sequential analysis are two primary techniques for mining navigational patterns from Web and application server logs. The characteristics of the log data and mining tasks require new data representation methods and analysis algorithms to be tested in the ecommerce environment. In this paper we present a cube model to represent Web access sessions for data mining. The cube model organizes session data into three dimensions. The COMPONENT dimension represents a session as a set of ordered components fc 1 ; c 2 ; :::; c P g, in which each component c i indexes the ith visited page in the session. Each component is associated with a set of attributes describing the page indexed by it, such as page ID, page category and view time spent at a page. The attributes associated with each component are defined in the ATTRIBUTE dimension. The SESSION dimension indexes individual sessions. In the model, irregular sessions are converted to a regular data structure to which existing data mining algorithms can applied while the order of the page sequences is maintained. A rich set of page attributes is embedded in the model for different analysis purposes. We also present some experimental results of using the kmodes algorithm to cluster sessions. Because the sessions are essentially sequences of categories, the k modes algorithm designed for clustering categorical data is proved effective and efficient. Furthermore, we present a new approach to using the firstorder Markov transition frequency (or probability) matrix to analyze clustering results for categorical sequences. Some initial results are given.
ECCLAT: a New Approach of Clusters Discovery in Categorical Data
 In the 22nd Int. Conf. on Knowledge Based Systems and Applied Arti Intelligence (ES'02
, 2002
"... In this paper we present a new approach for the discovery of meaningful clusters from large categorical data (which is an usual situation, e.g., web data analysis). Our method called Ecclat (for Extraction of Clusters from Concepts LATtice) extracts a subset of concepts from the frequent closed ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
In this paper we present a new approach for the discovery of meaningful clusters from large categorical data (which is an usual situation, e.g., web data analysis). Our method called Ecclat (for Extraction of Clusters from Concepts LATtice) extracts a subset of concepts from the frequent closed itemsets lattice, using an evaluation measure. Ecclat is generic because it allows to build approximate clustering and discover meaningful clusters with slight overlapping. The approach is illustrated on a classical data set and on web data analysis.
A unified view on clustering binary data
 Machine Learning
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of dat ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering criteria. Experimental studies are conducted to empirically verify the relationships. 1
A data cube model for predictionbased Web prefetching
 Journal of Intelligent Information Systems
, 2003
"... Abstract. Reducing the web latency is one of the primary concerns of Internet research. Web caching and web prefetching are two effective techniques to latency reduction. A primary method for intelligent prefetching is to rank potential web documents based on prediction models that are trained on th ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Abstract. Reducing the web latency is one of the primary concerns of Internet research. Web caching and web prefetching are two effective techniques to latency reduction. A primary method for intelligent prefetching is to rank potential web documents based on prediction models that are trained on the past web server and proxy server log data, and to prefetch the highly ranked objects. For this method to work well, the prediction model must be updated constantly, and different queries must be answered efficiently. In this paper we present a datacube model to represent Web access sessions for data mining for supporting the prediction model construction. The cube model organizes session data into three dimensions. With the data cube in place, we apply efficient data mining algorithms for clustering and correlation analysis. As a result of the analysis, the web page clusters can then be used to guide the prefetching system. In this paper, we propose an integrated webcaching and webprefetching model, where the issues of prefetching aggressiveness, replacement policy and increased network traffic are addressed together in an integrated framework. The core of our integrated solution is a prediction model based on statistical correlation between web objects. This model can be frequently updated by querying the data cube of web server logs. This integrated data cube and prediction based prefetching framework represents a first such effort in our knowledge.
Improving Kmodes Algorithm Considering Frequencies of Attribute Values
 IN MODE. LECTURE NOTES IN ARTIFICIAL INTELLIGENCE
, 2005
"... The original kmeans algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being applied to categorical data clustering, which is an integral part of data mining and has attracted much attention recently. The kmodes algorithm extended the kmeans paradigm t ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
The original kmeans algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being applied to categorical data clustering, which is an integral part of data mining and has attracted much attention recently. The kmodes algorithm extended the kmeans paradigm to cluster categorical data by using a frequencybased method to update the cluster modes versus the kmeans fashion of minimizing a numerically valued cost. However, the dissimilarity measure used in kmodes doesn’t consider the relative frequencies of attribute values in each cluster mode, this will result in a weaker intracluster similarity by allocating less similar objects to the cluster. In this paper, we present an experimental study on applying a new dissimilarity measure to the kmodes clustering to improve its clustering accuracy. The measure is based on the idea that the similarity between a data object and cluster mode, is directly proportional to the sum of relative frequencies of the common values in mode. Experimental results on real life datasets show that, the modified algorithm is superior to the original kmodes algorithm with respect to clustering accuracy.
Clustering orders
 In Proc of The 6th Int’l Conf. on Discovery Science
, 2003
"... Today I’d like to talk about a dimension reduction for supervised ordering. ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Today I’d like to talk about a dimension reduction for supervised ordering.
Improved KModes for Categorical Clustering using Weighted Dissimilarity Measure
 International Journal of Information and Mathematical Sciences
"... Abstract—KModes is an extension of KMeans clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Abstract—KModes is an extension of KMeans clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for KModes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with KModes and Krepresentative, which show that the new measure generates clusters with high purity. Keywords—Clustering, categorical data, KModes, weighted dissimilarity measure I.