Results 11 - 20
of
66
ECCLAT: a New Approach of Clusters Discovery in Categorical Data
- In the 22nd Int. Conf. on Knowledge Based Systems and Applied Arti Intelligence (ES'02
, 2002
"... In this paper we present a new approach for the discovery of meaningful clusters from large categorical data (which is an usual situation, e.g., web data analysis). Our method called Ecclat (for Extraction of Clusters from Concepts LATtice) extracts a subset of concepts from the frequent closed ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
In this paper we present a new approach for the discovery of meaningful clusters from large categorical data (which is an usual situation, e.g., web data analysis). Our method called Ecclat (for Extraction of Clusters from Concepts LATtice) extracts a subset of concepts from the frequent closed itemsets lattice, using an evaluation measure. Ecclat is generic because it allows to build approximate clustering and discover meaningful clusters with slight overlapping. The approach is illustrated on a classical data set and on web data analysis.
A Cube Model for Web Access Sessions and Cluster Analysis
- IN PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON MINING WEB DATA (WEBKDD 2001
, 2001
"... Identification of the navigational patterns of casual visitors is an important step in online recommendation to convert casual visitors to customers in e-commerce. Clustering and sequential analysis are two primary techniques for mining navigational patterns from Web and application server logs. The ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Identification of the navigational patterns of casual visitors is an important step in online recommendation to convert casual visitors to customers in e-commerce. Clustering and sequential analysis are two primary techniques for mining navigational patterns from Web and application server logs. The characteristics of the log data and mining tasks require new data representation methods and analysis algorithms to be tested in the e-commerce environment. In this paper we present a cube model to represent Web access sessions for data mining. The cube model organizes session data into three dimensions. The COMPONENT dimension represents a session as a set of ordered components fc 1 ; c 2 ; :::; c P g, in which each component c i indexes the ith visited page in the session. Each component is associated with a set of attributes describing the page indexed by it, such as page ID, page category and view time spent at a page. The attributes associated with each component are defined in the ATTRIBUTE dimension. The SESSION dimension indexes individual sessions. In the model, irregular sessions are converted to a regular data structure to which existing data mining algorithms can applied while the order of the page sequences is maintained. A rich set of page attributes is embedded in the model for different analysis purposes. We also present some experimental results of using the k-modes algorithm to cluster sessions. Because the sessions are essentially sequences of categories, the k- modes algorithm designed for clustering categorical data is proved effective and efficient. Furthermore, we present a new approach to using the first-order Markov transition frequency (or probability) matrix to analyze clustering results for categorical sequences. Some initial results are given.
X.: Improving K-modes Algorithm Considering Frequencies of Attribute Values
- in Mode. Lecture Notes in Artificial Intelligence
, 2005
"... Abstract. The original k-means algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being applied to categorical data clustering, which is an integral part of data mining and has attracted much attention recently. The k-modes algorithm extended the k-means ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract. The original k-means algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being applied to categorical data clustering, which is an integral part of data mining and has attracted much attention recently. The k-modes algorithm extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, the dissimilarity measure used in k-modes doesn’t consider the relative frequencies of attribute values in each cluster mode, this will result in a weaker intra-cluster similarity by allocating less similar objects to the cluster. In this paper, we present an experimental study on applying a new dissimilarity measure to the k-modes clustering to improve its clustering accuracy. The measure is based on the idea that the similarity between a data object and cluster mode, is directly proportional to the sum of relative frequencies of the common values in mode. Experimental results on real life datasets show that, the modified algorithm is superior to the original kmodes algorithm with respect to clustering accuracy. 1.
Improving the Accuracy and Efficiency of the k-means Clustering Algorithm
"... Abstract — Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract — Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means clustering algorithm is widely used for many practical applications. But the original k-means algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed in the literature for improving the performance of the k-means clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient, so as to get better clustering with reduced complexity.
Clustering orders
- In Proc of The 6th Int’l Conf. on Discovery Science
, 2003
"... Today I’d like to talk about a dimension reduction for supervised ordering. ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Today I’d like to talk about a dimension reduction for supervised ordering.
A data cube model for prediction-based Web prefetching
- Journal of Intelligent Information Systems
, 2003
"... Abstract. Reducing the web latency is one of the primary concerns of Internet research. Web caching and web prefetching are two effective techniques to latency reduction. A primary method for intelligent prefetching is to rank potential web documents based on prediction models that are trained on th ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. Reducing the web latency is one of the primary concerns of Internet research. Web caching and web prefetching are two effective techniques to latency reduction. A primary method for intelligent prefetching is to rank potential web documents based on prediction models that are trained on the past web server and proxy server log data, and to prefetch the highly ranked objects. For this method to work well, the prediction model must be updated constantly, and different queries must be answered efficiently. In this paper we present a data-cube model to represent Web access sessions for data mining for supporting the prediction model construction. The cube model organizes session data into three dimensions. With the data cube in place, we apply efficient data mining algorithms for clustering and correlation analysis. As a result of the analysis, the web page clusters can then be used to guide the prefetching system. In this paper, we propose an integrated web-caching and web-prefetching model, where the issues of prefetching aggressiveness, replacement policy and increased network traffic are addressed together in an integrated framework. The core of our integrated solution is a prediction model based on statistical correlation between web objects. This model can be frequently updated by querying the data cube of web server logs. This integrated data cube and prediction based prefetching framework represents a first such effort in our knowledge.
A unified view on clustering binary data
- Machine Learning
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been oc-cupying a special place in the domain of dat ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been oc-cupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering crite-ria. Experimental studies are conducted to empirically verify the relationships. 1
TCSOM: clustering transactions using selforganizing map
- Neural Processing Letters
, 2005
"... Abstract Self-Organizing Map (SOM) networks have been successfully applied as a clustering method to numeric datasets. However, it is not feasible to directly apply SOM for clustering transactional data. This paper proposes the TCSOM (Transactions Clustering using SOM) algorithm for clustering binar ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract Self-Organizing Map (SOM) networks have been successfully applied as a clustering method to numeric datasets. However, it is not feasible to directly apply SOM for clustering transactional data. This paper proposes the TCSOM (Transactions Clustering using SOM) algorithm for clustering binary transactional data. In the TCSOM algorithm, normalized Dot Product norm is utilized for measuring the distance between input vector and output neuron. And a modified weight adaptation function is employed for adjusting the weights of the winner and its neighbors. More importantly, TCSOM is a one-pass algorithm, which is extremely suitable for data mining applications. Experimental results on real datasets show that TCSOM algorithm is superior to those state-of-art transactional data clustering algorithms with respect to clustering accuracy.
Multiple Layer Clustering of Large Software Systems
- In Proceedings of the Twelfth Working Conference on Reverse Engineeering (WCRE 2005
"... Software clustering algorithms presented in the literature rarely incorporate in the clustering process dynamic information, such as the number of function invocations during runtime. Moreover, the structure of a software system is often multi-layered, while existing clustering algorithms often crea ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Software clustering algorithms presented in the literature rarely incorporate in the clustering process dynamic information, such as the number of function invocations during runtime. Moreover, the structure of a software system is often multi-layered, while existing clustering algorithms often create flat system decompositions. This paper presents a software clustering algorithm called MULICsoft that incorporates in the clustering process both static and dynamic information. MULICsoft produces layered clusters with the core elements of each cluster assigned to the top layer. We present experimental results of applying MULICsoft to a large open-source system. Comparison with existing software clustering algorithms indicates that MULICsoft is able to produce decompositions that are close to those created by system experts. 1.
Categorical data visualization and clustering using subjective factors
- Data & Knowledge Engineering
, 2005
"... Abstract. A common issue in cluster analysis is that there is no single correct answer to the number of clusters, since cluster analysis involves human subjective judgement. Interactive visualization is one of the methods where users can decide a proper clustering parameters. In this paper, a new cl ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. A common issue in cluster analysis is that there is no single correct answer to the number of clusters, since cluster analysis involves human subjective judgement. Interactive visualization is one of the methods where users can decide a proper clustering parameters. In this paper, a new clustering approach called CDCS (Categorical Data Clustering with Subjective factors) is introduced, where a visualization tool for clustered categorical data is developed such that the result of adjusting parameters is instantly reflected. The experiment shows that CDCS generates high quality clusters compared to other typical algorithms. 1

