Results 1 - 10
of
11
The WEKA Data Mining Software: An Update
"... More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an a ..."
Abstract
-
Cited by 175 (6 self)
- Add to MetaCart
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003. 1.
TCSOM: clustering transactions using selforganizing map
- Neural Processing Letters
, 2005
"... Abstract Self-Organizing Map (SOM) networks have been successfully applied as a clustering method to numeric datasets. However, it is not feasible to directly apply SOM for clustering transactional data. This paper proposes the TCSOM (Transactions Clustering using SOM) algorithm for clustering binar ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract Self-Organizing Map (SOM) networks have been successfully applied as a clustering method to numeric datasets. However, it is not feasible to directly apply SOM for clustering transactional data. This paper proposes the TCSOM (Transactions Clustering using SOM) algorithm for clustering binary transactional data. In the TCSOM algorithm, normalized Dot Product norm is utilized for measuring the distance between input vector and output neuron. And a modified weight adaptation function is employed for adjusting the weights of the winner and its neighbors. More importantly, TCSOM is a one-pass algorithm, which is extremely suitable for data mining applications. Experimental results on real datasets show that TCSOM algorithm is superior to those state-of-art transactional data clustering algorithms with respect to clustering accuracy.
A New Feature Selection Scheme Using Data Distribution Factor for Transactional Data
"... Abstract. A new efficient unsupervised feature selection method is proposed to handle transactional data. The proposed feature selection method introduces a new Data Distribution Factor (DDF) to select appropriate clusters. This method combines the compactness and separation together with a newly in ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. A new efficient unsupervised feature selection method is proposed to handle transactional data. The proposed feature selection method introduces a new Data Distribution Factor (DDF) to select appropriate clusters. This method combines the compactness and separation together with a newly introduced concept of singleton item. This new feature selection method is computationally inexpensive and is able to deliver very promising results. Four datasets from UCI machine learning repository are used in this studied. The obtained results show that the proposed method is very efficient and able to deliver very reliable results. 1.
A Spectroscopy of Texts for Effective Clustering
- In: Proc. 8th PKDD
, 2004
"... For many clustering algorithms, such as k-means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters to return. In the presence of di#erent data characteristics and analysis contexts, it is often di#c ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
For many clustering algorithms, such as k-means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters to return. In the presence of di#erent data characteristics and analysis contexts, it is often di#cult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images or biological data. The fundamental question this paper addresses is: "How can we e#ectively estimate the natural number of clusters in a given text collection?". We propose to use spectral analysis, which analyzes the eigenvalues (not eigenvectors) of the collection, as the solution to the above. We first present the relationship between a text collection and its underlying spectra. We then show how the answer to this question enhances the clustering process. Finally, we conclude with empirical results and related work.
SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes
, 2004
"... Clustering is a di#cult problem especially when we consider the task in the context of a data stream of categorical attributes. In this paper, we propose SCLOPE, a novel algorithm based on CLOPE's intuitive observation about cluster histograms. Unlike CLOPE however, our algorithm is very fast an ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Clustering is a di#cult problem especially when we consider the task in the context of a data stream of categorical attributes. In this paper, we propose SCLOPE, a novel algorithm based on CLOPE's intuitive observation about cluster histograms. Unlike CLOPE however, our algorithm is very fast and operates within the constraints of a data stream environment. In particular, we designed SCLOPE according to the recent CluStream framework. Our evaluation of SCLOPE shows very promising results. It consistently outperforms CLOPE in speed and scalability tests on our data sets while maintaining high cluster purity; it also supports cluster analysis that other algorithms in its class do not.
SCALE: A Scalable Framework for Efficiently Clustering Transactional Data
, 2009
"... This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weig ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner. key words: transactional data clustering, cluster assessment, cluster validation, frequent itemset mining, weighted coverage density
A Framework for Exploring Categorical Data
"... In this paper, we present a framework for categorical data analysis which allows such data sets to be explored using a rich set of techniques that are only applicable to continuous data sets. We introduce the concept of separability statistics in the context of exploratory categorical data analysis. ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper, we present a framework for categorical data analysis which allows such data sets to be explored using a rich set of techniques that are only applicable to continuous data sets. We introduce the concept of separability statistics in the context of exploratory categorical data analysis. We show how these statistics can be used as a way to map categorical data to continuous space given a labeled reference data set. This mapping enables visualization of categorical data using techniques that are applicable to continuous data. We show that in the transformed continuous space, the performance of the standard k-nn based outlier detection technique is comparable to the performance of the k-nn based outlier detection technique using the best of the similarity measures designed for categorical data. The proposed framework can also be used to devise similarity measures best suited for a particular type of data set. 1
c ○ 2007 PIPS Color Mining of Images Based on Clustering ⋆
"... Abstract. The increasing size of multimedia databases and the ease of accessing them by a large number of users through the Internet carries a problem of efficient and semantically adequate querying of such content. A metadatabase may be used to shorten query resolution time by trying to limit the n ..."
Abstract
- Add to MetaCart
Abstract. The increasing size of multimedia databases and the ease of accessing them by a large number of users through the Internet carries a problem of efficient and semantically adequate querying of such content. A metadatabase may be used to shorten query resolution time by trying to limit the number of images being thoroughly analyzed to a smaller subset, having a high probability of finding the query image. In the article we propose a simple but fast and effective method of indexing such image metadatabases. The index is created by describing the images according to their color characteristics, with compact feature vectors, that represent typical color distributions. We present experiment results of typical search schemes by querying the metadatabase index created using a few different approaches.
Determining the Best K for Clustering Transactional Datasets: A Coverage Density-based Approach
, 2008
"... The problem of determining the optimal number of clusters is important but mysterious in cluster analysis. In this paper, we propose a novel method to find a set of candidate optimal number Ks of clusters in transactional datasets. Concretely, we propose Transactional-cluster-modes Dissimilarity bas ..."
Abstract
- Add to MetaCart
The problem of determining the optimal number of clusters is important but mysterious in cluster analysis. In this paper, we propose a novel method to find a set of candidate optimal number Ks of clusters in transactional datasets. Concretely, we propose Transactional-cluster-modes Dissimilarity based on the concept of coverage density as an intuitive transactional inter-cluster dissimilarity measure. Based on the above measure, an agglomerative hierachical clustering algorithm is developed and the Merge Dissimilarity Indexes, which are generated in hierachical cluster merging processes, are used to find the candidate optimal number Ks of clusters of transactional data. Our experimental results on both synthetic and real data show that the new method often effectively estimates the number of clusters of transactional data.
iTree: Efficiently Discovering High-Coverage Configurations Using Interaction Trees
"... Abstract—Software configurability has many benefits, but it also makes programs much harder to test, as in the worst case the program must be tested under every possible configuration. One potential remedy to this problem is combinatorial interaction testing (CIT), in which typically the developer s ..."
Abstract
- Add to MetaCart
Abstract—Software configurability has many benefits, but it also makes programs much harder to test, as in the worst case the program must be tested under every possible configuration. One potential remedy to this problem is combinatorial interaction testing (CIT), in which typically the developer selects a strength t and then computes a covering array containing all t-way configuration option combinations. However, in a prior study we showed that several programs have important highstrength interactions (combinations of a subset of configuration options) that CIT is highly unlikely to generate in practice. In this paper, we propose a new algorithm called interaction tree discovery (iTree) that aims to identify sets of configurations to test that are smaller than those generated by CIT, while also including important high-strength interactions missed by practical applications of CIT. On each iteration of iTree, we first use low-strength CIT to test the program under a set of configurations, and then apply machine learning techniques to discover new interactions that are potentially responsible for any new coverage seen. By repeating this process, iTree builds up a set of configurations likely to contain key high-strength interactions. We evaluated iTree by comparing the coverage it achieves versus covering arrays and randomly generated configuration sets. Our results strongly suggest that iTree can identify high-coverage sets of configurations more effectively than traditional CIT or random sampling.

