Results 1 - 10
of
115
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
- In Research Issues on Data Mining and Knowledge Discovery
, 1997
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract
-
Cited by 70 (2 self)
- Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called k-modes, to extend the k-means paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of ...
A Framework for Robust Subspace Learning
- International Journal of Computer Vision
, 2003
"... Many computer vision, signal processing and statistical problems can be posed as problems of learning low dimensional linear or multi-linear models. These models have been widely used for the representation of shape, appearance, motion, etc, in computer vision applications. ..."
Abstract
-
Cited by 61 (5 self)
- Add to MetaCart
Many computer vision, signal processing and statistical problems can be posed as problems of learning low dimensional linear or multi-linear models. These models have been widely used for the representation of shape, appearance, motion, etc, in computer vision applications.
Using Correspondence Analysis to Combine Classifiers
- Machine Learning
, 1998
"... . Several effective methods have been developed recently for improving predictive performance by generating and combining multiple learned models. The general approach is to create a set of learned models either by applying an algorithm repeatedly to different versions of the training data, or by ap ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
. Several effective methods have been developed recently for improving predictive performance by generating and combining multiple learned models. The general approach is to create a set of learned models either by applying an algorithm repeatedly to different versions of the training data, or by applying different learning algorithms to the same data. The predictions of the models are then combined according to a voting scheme. This paper focuses on the task of combining the predictions of a set of learned models. The method described uses the strategies of stacking and Correspondence Analysis to model the relationship between the learning examples and their classification by a collection of learned models. A nearest neighbor method is then applied within the resulting representation to classify previously unseen examples. The new algorithm does not perform worse than, and frequently performs significantly better than other combining techniques on a suite of data sets. Keywords: Clas...
Recognizing Subjectivity: A Case Study of Manual Tagging
- Natural Language Engineering
, 1999
"... In this paper, we describe a case study of a sentence-level categorization in which tagging instructions are developed and used by four judges to classify clauses from the Wall Street Journal as either subjective or objective. Agreement among the four judges is analyzed, and, based on that analysis, ..."
Abstract
-
Cited by 34 (7 self)
- Add to MetaCart
In this paper, we describe a case study of a sentence-level categorization in which tagging instructions are developed and used by four judges to classify clauses from the Wall Street Journal as either subjective or objective. Agreement among the four judges is analyzed, and, based on that analysis, each clause is given a final classification. To provide empirical support for the classifications, correlations are assessed in the data between the subjective category and a basic semantic class posited by Quirk et al. (1985).
Pursuing failure: The distribution of program failures in a profile space
- In Proceedings of the 8th European Software Engineering Conference held jointly with 9th ACM SIGSOFT International Symposium on Foundations of Software Engineering
, 2001
"... Observation-based testing calls for analyzing profiles of executions induced by potential test cases, in order to select a subset of executions to be checked for conformance to requirements. A family of techniques for selecting such a subset is evaluated experimentally. These techniques employ autom ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
Observation-based testing calls for analyzing profiles of executions induced by potential test cases, in order to select a subset of executions to be checked for conformance to requirements. A family of techniques for selecting such a subset is evaluated experimentally. These techniques employ automatic cluster analysis to partition executions, and they use various sampling techniques to select executions from clusters. The experimental results support the hypothesis that with appropriate profiling, failures often have unusual profiles that are revealed by cluster analysis. The results also suggest that failures often form small clusters or chains in sparsely-populated areas of the profile space. A form of adaptive sampling called failure-pursuit sampling is proposed for revealing failures in such regions, and this sampling method is evaluated experimentally. The results suggest that failure-pursuit sampling is effective.
Euclidean embedding of co-occurrence data
- Advances in Neural Information Processing Systems 17
, 2005
"... Abstract Embedding algorithms search for low dimensional structure in complexdata, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for em-bedding objects of different types, such as images and text, into a single comm ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Abstract Embedding algorithms search for low dimensional structure in complexdata, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for em-bedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. Thejoint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to con-vex optimization over positive semidefinite matrices. The local structure of our embedding corresponds to the statistical correlations via ran-dom walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and signifi-cantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1 Introduction Embeddings of objects in a low-dimensional space are an important tool in unsupervisedlearning and in preprocessing data for supervised learning algorithms. They are especially valuable for exploratory data analysis and visualization by providing easily interpretablerepresentations of the relationships among objects. Most current embedding techniques build low dimensional mappings that preserve certain relationships among objects and dif-fer in the relationships they choose to preserve, which range from pairwise distances in multidimensional scaling (MDS) [4] to neighborhood structure in locally linear embedding[12]. All these methods operate on objects of a single type endowed with a measure of similarity or dissimilarity. However, real-world data often involve objects of several very different types without anatural measure of similarity. For example, typical web pages or scientific papers contain
Multivariate visualization in observation-based testing
- In Proceedings of the 22th International Conference on Software Engineering (ICSE’00
, 2000
"... We explore the use of multivariate visualization techniques to support a new approach to test data selection, called observation-based testing. Applications of multivariate visualization are described, including: evaluating and improving synthetic tests; filtering regression test suites; filtering c ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
We explore the use of multivariate visualization techniques to support a new approach to test data selection, called observation-based testing. Applications of multivariate visualization are described, including: evaluating and improving synthetic tests; filtering regression test suites; filtering captured operational executions; comparing test suites; and assessing bug reports. These applications are illustrated by the use of correspondence analysis to analyze test inputs for the GNU GCC compiler. Keywords Software testing, observation-based testing, multivariate visualization, multivariate data analysis, data visualization, correspondence analysis. 1
Kernel methods for measuring independence
- Journal of Machine Learning Research
, 2005
"... We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prov ..."
Abstract
-
Cited by 25 (13 self)
- Add to MetaCart
We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.
Distributed Weighted-Multidimensional Scaling for Node Localization in Sensor Networks
- ACM TRANSACTIONS ON SENSOR NETWORKS
, 2005
"... Accurate, distributed localization algorithms are needed for a wide variety of wireless sensor network applications. This paper introduces a scalable, distributed weighted-multidimensional scaling (dwMDS) algorithm that adaptively emphasizes the most accurate range measurements and naturally account ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
Accurate, distributed localization algorithms are needed for a wide variety of wireless sensor network applications. This paper introduces a scalable, distributed weighted-multidimensional scaling (dwMDS) algorithm that adaptively emphasizes the most accurate range measurements and naturally accounts for communication constraints within the sensor network. Each node adaptively chooses a neighborhood of sensors, updates its position estimate by minimizing a local cost function and then passes this update to neighboring sensors. Derived bounds on communication requirements provide insight on the energy efficiency of the proposed distributed method versus a centralized approach. For received signal-strength (RSS) based range measurements, we demonstrate via simulation that location estimates are nearly unbiased with variance close to the Cramer-Rao lower bound. Further, RSS and time-of-arrival (TOA) channel measurements are used to demonstrate performance as good as the centralized maximum-likelihood estimator (MLE) in a real-world sensor network.

