Results 1 -
6 of
6
X.: Improving K-modes Algorithm Considering Frequencies of Attribute Values
- in Mode. Lecture Notes in Artificial Intelligence
, 2005
"... Abstract. The original k-means algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being applied to categorical data clustering, which is an integral part of data mining and has attracted much attention recently. The k-modes algorithm extended the k-means ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract. The original k-means algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being applied to categorical data clustering, which is an integral part of data mining and has attracted much attention recently. The k-modes algorithm extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, the dissimilarity measure used in k-modes doesn’t consider the relative frequencies of attribute values in each cluster mode, this will result in a weaker intra-cluster similarity by allocating less similar objects to the cluster. In this paper, we present an experimental study on applying a new dissimilarity measure to the k-modes clustering to improve its clustering accuracy. The measure is based on the idea that the similarity between a data object and cluster mode, is directly proportional to the sum of relative frequencies of the common values in mode. Experimental results on real life datasets show that, the modified algorithm is superior to the original kmodes algorithm with respect to clustering accuracy. 1.
A Robust and Efficient Clustering Algorithm based on Cohesion Self-Merging
- Inf. Conf. 8th ACM SIGKDD on Knowledge Discovery and Data Mining
, 2002
"... Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids, or the dis- tance between two closest (or farthest) data points. How ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids, or the dis- tance between two closest (or farthest) data points. How- ever, all of these measurements are vulnerable to outliers, and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measurement, referred to as cohesion, to measure the inter-cluster distances. By using this new measurement of cohesion, we design a two-phase clustering algorithm, called cohesion- based self-merging (abbreviated as CSM), which runs in lin- ear time to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase, and then continuously merges the sub- clusters based on cohesion in a hierarchical manner in the second phase. As shown by our performance studies, the cohesion-based clustering is very robust and possesses the excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently, and provide better clustering results than those by prior methods.
TCSOM: clustering transactions using selforganizing map
- Neural Processing Letters
, 2005
"... Abstract Self-Organizing Map (SOM) networks have been successfully applied as a clustering method to numeric datasets. However, it is not feasible to directly apply SOM for clustering transactional data. This paper proposes the TCSOM (Transactions Clustering using SOM) algorithm for clustering binar ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract Self-Organizing Map (SOM) networks have been successfully applied as a clustering method to numeric datasets. However, it is not feasible to directly apply SOM for clustering transactional data. This paper proposes the TCSOM (Transactions Clustering using SOM) algorithm for clustering binary transactional data. In the TCSOM algorithm, normalized Dot Product norm is utilized for measuring the distance between input vector and output neuron. And a modified weight adaptation function is employed for adjusting the weights of the winner and its neighbors. More importantly, TCSOM is a one-pass algorithm, which is extremely suitable for data mining applications. Experimental results on real datasets show that TCSOM algorithm is superior to those state-of-art transactional data clustering algorithms with respect to clustering accuracy.
A New Feature Selection Scheme Using Data Distribution Factor for Transactional Data
"... Abstract. A new efficient unsupervised feature selection method is proposed to handle transactional data. The proposed feature selection method introduces a new Data Distribution Factor (DDF) to select appropriate clusters. This method combines the compactness and separation together with a newly in ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. A new efficient unsupervised feature selection method is proposed to handle transactional data. The proposed feature selection method introduces a new Data Distribution Factor (DDF) to select appropriate clusters. This method combines the compactness and separation together with a newly introduced concept of singleton item. This new feature selection method is computationally inexpensive and is able to deliver very promising results. Four datasets from UCI machine learning repository are used in this studied. The obtained results show that the proposed method is very efficient and able to deliver very reliable results. 1.
1.1.1.1.1.1.1 front was passing through accident death of 1st person death of 2nd person damage to AC
- University of Southern California
, 2004
"... Current directory-based hierarchical file systems have many limitations as the amount of unstructured data possessed by individual user is increasing continuously. One of the most significant problems is that users usually have difficulties searching, navigating, and organizing their files since use ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Current directory-based hierarchical file systems have many limitations as the amount of unstructured data possessed by individual user is increasing continuously. One of the most significant problems is that users usually have difficulties searching, navigating, and organizing their files since useful semantic information describing a file is not used in the current directory-based system. To solve this problem, several research groups have suggested attribute-based file naming systems. However, their approaches have not been widely used because of lack of semantic information. In this paper, we describe the ontology-based semantic file naming approach that employs the hierarchical conceptual clustering technique to capture more complex semantic information from the set of file attributes. Ontologies, which play a major role on the Semantic Web, describe the semantics of data by organizing data into taxonomies of concepts and describing the relationships between concepts. To generate the ontology from the set of attribute-value pairs for files, we first extend one of the standard incremental hierarchical clustering techniques, COBWEB, and suggest the new clustering evaluation measure to guide search through the space of clustering. From the clustering result, we then generate the ontology and represent it by the RDF Schema. Our experimental results show that our extended clustering approach can produce a good quality of the concept hierarchy, and is computationally efficient and well suited to building the ontology-based semantic file system. 1
Bregman Bubble Clustering: A Robust Framework for Mining Dense Clusters
"... In classical clustering, each data point is assigned to at least one cluster. However, in many applications only a small subset of the available data is relevant for the problem and the rest needs to be ignored in order to obtain good clusters. Certain nonparametric density-based clustering methods ..."
Abstract
- Add to MetaCart
In classical clustering, each data point is assigned to at least one cluster. However, in many applications only a small subset of the available data is relevant for the problem and the rest needs to be ignored in order to obtain good clusters. Certain nonparametric density-based clustering methods find the most relevant data as multiple dense regions, but such methods are generally limited to low-dimensional data and do not scale well to large, high-dimensional datasets. Also, they use a specific notion of “distance”, typically Euclidean or Mahalanobis distance, which further limits their applicability. On the other hand, the recent One Class Information Bottleneck (OC-IB) method is fast and works on a large class of distortion measures known as Bregman Divergences, but can only find a single dense region. This article presents a broad framework for finding k dense clusters while ignoring the rest of the data. It includes a seeding algorithm that can automatically determine a suitable value for k. When k is forced to 1, our method gives rise to an improved version of OC-IB with optimality guarantees. We provide a generative model that yields the proposed iterative algorithm for finding k dense regions as a special case. Our analysis reveals an interesting and novel connection between the problem of finding dense regions and exponential mixture models; a hard model corresponding to k exponential mixtures with a uniform background results

