Results 1  10
of
32
Clustering with Bregman Divergences
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergence ..."
Abstract

Cited by 441 (59 self)
 Add to MetaCart
(Show Context)
A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical kmeans and informationtheoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by ratedistortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an ecient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.
SCHISM: A New Approach to Interesting Subspace Mining
 INT. J. OF BUSINESS INTELLIGENCE AND DATA MINING, VOL. 1, NO. 2, 137160 1
, 2005
"... Highdimensional data pose challenges to traditional clustering algorithms due to their inherent sparsity and data tend to cluster in different and possibly overlapping subspaces of the entire feature space. Finding such subspaces is called subspace mining. We present SCHISM, a new algorithm for ..."
Abstract

Cited by 28 (2 self)
 Add to MetaCart
Highdimensional data pose challenges to traditional clustering algorithms due to their inherent sparsity and data tend to cluster in different and possibly overlapping subspaces of the entire feature space. Finding such subspaces is called subspace mining. We present SCHISM, a new algorithm for mining interesting subspaces, using the notions of support and ChernoffHoeffding bounds. We use a vertical representation of the dataset, and use a depthfirst search with backtracking to find maximal interesting subspaces. We test our algorithm on a number of highdimensional synthetic and real datasets to test its effectiveness.
Metric Space Similarity Joins
"... Similarity join algorithms find pairs of objects that lie within a certain distance ɛ of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multidimensional index. For these algorithms, when the data ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Similarity join algorithms find pairs of objects that lie within a certain distance ɛ of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multidimensional index. For these algorithms, when the data lies in a metric space, the usual solution is to embed the data in vector space and then make use of a multidimensional index. Such an approach has a number of drawbacks when the data is high dimensional as we must eventually find the most discriminating dimensions, which is not trivial. In addition, although the maximum distance between objects increases with dimension, the ability to discriminate between objects in each dimension does not. These drawbacks are overcome via the introduction of a new method called Quickjoin that does not require a multidimensional index and instead adapts techniques used in distancebased indexing for use in a method that is conceptually similar to the Quicksort algorithm. A formal analysis is provided of the Quickjoin method. Experiments show that the Quickjoin method significantly outperforms two existing techniques.
Formulating contextdependent similarity functions
 In ACM International Conference on Multimedia (MM
, 2005
"... Tasks of information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a contextdependent (also application, data, and userdependent) way. In this paper, we present a novel method, which learns ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Tasks of information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a contextdependent (also application, data, and userdependent) way. In this paper, we present a novel method, which learns a distance function by capturing the nonlinear relationships among contextual information provided by the application, data, or user. We show that through a process called the “kernel trick, ” such nonlinear relationships can be learned efficiently in a projected space. In addition to using the kernel trick, we propose two algorithms to further enhance efficiency and effectiveness of function learning. For efficiency, we propose a SMOlike solver to achieve O(N 2) learning performance. For effectiveness, we propose using unsupervised learning in an innovative way to address the challenge of lack of labeled data (contextual information). Theoretically, we substantiate that our method is both sound and optimal. Empirically, we demonstrate that our method is effective and useful.
Formulating Distance Functions via the Kernel Trick
 In Conf. on Knowledge Discovery and Data Mining (KDD
, 2005
"... Tasks of data mining and information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a contextdependent (also application, data, and userdependent) way. In this paper, we propose to learn a di ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
Tasks of data mining and information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a contextdependent (also application, data, and userdependent) way. In this paper, we propose to learn a distance function by capturing the nonlinear relationships among contextual information provided by the application, data, or user. We show that through a process called the “kernel trick, ” such nonlinear relationships can be learned efficiently in a projected space. Theoretically, we substantiate that our method is both sound and optimal. Empirically, using several datasets and applications, we demonstrate that our method is effective and useful.
Advanced visualization of selforganizing maps with vector fields
, 2006
"... SelfOrganizing Maps have been applied in various industrial applications and have proven to be a valuable data mining tool. In order to fully benefit from their potential, advanced visualization techniques assist the user in analyzing and interpreting the maps. We propose two new methods for depict ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
SelfOrganizing Maps have been applied in various industrial applications and have proven to be a valuable data mining tool. In order to fully benefit from their potential, advanced visualization techniques assist the user in analyzing and interpreting the maps. We propose two new methods for depicting the SOM based on vector fields, namely the Gradient Field and Borderline visualization techniques, to show the clustering structure at various levels of detail. We explain how this method can be used on aggregated parts of the SOM that show which factors contribute to the clustering structure, and show how to use it for finding correlations and dependencies in the underlying data. We provide examples on several artificial and realworld data sets to point out the strengths of our technique, specifically as a means to combine different types of visualizations offering effective multidimensional information visualization of SOMs.
Decision support systems for police: Lessons from the application of data mining techniques to 'soft' forensic evidence
 APPLICATIONS AND INNOVATIONS IN INTELLIGENT SYSTEMS XII. PROCEEDINGS OF AI2004, THE TWENTYFOURTH SGAI INTERNATIONAL CONFERENCE ON KNOWLEDGE BASED SYSTEMS AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE
, 2004
"... Computer science technology that can support police activities is wide ranging, from the well known geographical information systems display (’pins in maps’), clustering and link analysis algorithms, to the more complex use of data mining technology for profiling single and series of crimes or offen ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
Computer science technology that can support police activities is wide ranging, from the well known geographical information systems display (’pins in maps’), clustering and link analysis algorithms, to the more complex use of data mining technology for profiling single and series of crimes or offenders, and matching and predicting crimes. This paper presents a discussion of data mining and decision support technologies for police, considering the range of computer science technologies that are available to assist police activities. The discussion is very practical, with examples taken from the authors ’ own work with three United Kingdom police forces. The lessons learned are presented, along with their relevance to future work. We describe significant aspects of the knowledge discovery from databases process, starting with an examination of the data that police collect and the reasons for storing such data, and progressing to the development of crime matching and predictive knowledge which are operationalised in decision support software. Discussion and experimentation include decision support techniques based around spatial statistics, and a wide range of data mining technologies, including casebased reasoning, logic programming and ontologies, survival analysis, Bayesian networks, and the comparison of models that use either behavioural features, spatiotemporal features, or a combination of both. The paper concludes with a discussion of the operational lessons relevant to future work.
Multisource contingency clustering
 Master’s thesis, EECS, MIT
, 2004
"... This thesis examines the problem of clustering multiple, related sets of data simultaneously. Given datasets which are in some way connected (e.g. temporally) but which do not necessarily share label compatibility, we exploit cooccurrence information in the form of normalized multidimensional conti ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
This thesis examines the problem of clustering multiple, related sets of data simultaneously. Given datasets which are in some way connected (e.g. temporally) but which do not necessarily share label compatibility, we exploit cooccurrence information in the form of normalized multidimensional contingency tables in order to recover robust mappings between data points and clusters for each of the individual data sources. We outline a unifying formalism by which one might approach crosschannel clustering problems, and begin by defining an informationtheoretic objective function that is small when the clustering can be expected to be good. We then propose and explore several multisource algorithms for optimizing this and other relevant objective functions, borrowing ideas from both continuous and discrete optimization methods. More specifically, we adapt gradientbased techniques, simulated annealing, and spectral clustering to the multisource clustering problem. Finally, we apply the proposed algorithms to a multisource human identification
Random Forest Based Imbalanced Data Cleaning and Classification
"... Abstract. The given task of PAKDD 2007 data mining competition is a typical problem of learning from extremely imbalanced data set. In this paper, we propose a combination of random forest based techniques and sampling methods to identify the potential buyers. Our methods is mainly composed of two p ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The given task of PAKDD 2007 data mining competition is a typical problem of learning from extremely imbalanced data set. In this paper, we propose a combination of random forest based techniques and sampling methods to identify the potential buyers. Our methods is mainly composed of two phases: data cleaning and classification, both based on random forest. Firstly, the data set is cleaned by the elimination of dangerous negative instances. The data cleaning process is supervised by a negative biased random forests, where the negative instances make a major proportion of the training data in each of the tree in the forest. Secondly, we train a variant of random forest in which each tree is biased towards the positive class to classify the data set, where a major vote is made for prediction. We compared our methods with many other existing methods and showed its favorable performance improvement in terms of the area under the ROC. At last, we provide discussion on what business insights can be interpreted from the scoring model results.