Results 1 - 10
of
12
Multi-way clustering on relation graphs
- In Proc. of the 7th SIAM Intl. Conf. on Data Mining
, 2006
"... A number of real-world domains such as social networks and e-commerce involve heterogeneous data that describes relations between multiple classes of entities. Understanding the natural structure of this type of heterogeneous relational data is essential both for exploratory analysis and for perform ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
A number of real-world domains such as social networks and e-commerce involve heterogeneous data that describes relations between multiple classes of entities. Understanding the natural structure of this type of heterogeneous relational data is essential both for exploratory analysis and for performing various predictive modeling tasks. In this paper, we propose a principled multi-way clustering framework for relational data, wherein different types of entities are simultaneously clustered based not only on their intrinsic attribute values, but also on the multiple relations between the entities. To achieve this, we introduce a relation graph model that describes all the known relations between the different entity classes, in which each relation between a given set of entity classes is represented in the form of multi-modal tensor over an appropriate domain. Our multi-way clustering formulation is driven by the objective of capturing the maximal “information ” in the original relation graph, i.e., accurately approximating the set of tensors corresponding to the various relations. This formulation is applicable to all Bregman divergences (a broad family of loss functions that includes squared Euclidean distance, KL-divergence), and also permits analysis of mixed data types using convex combinations of appropriate Bregman loss functions. Furthermore, we present a large family of structurally different multi-way clustering schemes that preserve various linear summary statistics of the original data. We accomplish the above generalizations by extending a recently proposed key theoretical result, namely the minimum Bregman information principle [1], to the relation graph setting. We also describe an efficient multi-way clustering algorithm based on alternate minimization that generalizes a number of other recently proposed clustering methods. Empirical results on datasets obtained from real-world domains (e.g., movie recommendations, newsgroup articles) demonstrate the generality and efficacy of our framework. 1
R.: Towards internet-scale multiview stereo
- In: Proceedings of IEEE CVPR
, 2010
"... This paper introduces an approach for enabling existing multi-view stereo methods to operate on extremely large unstructured photo collections. The main idea is to decompose the collection into a set of overlapping sets of photos that can be processed in parallel, and to merge the resulting reconstr ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
This paper introduces an approach for enabling existing multi-view stereo methods to operate on extremely large unstructured photo collections. The main idea is to decompose the collection into a set of overlapping sets of photos that can be processed in parallel, and to merge the resulting reconstructions. This overlapping clustering problem is formulated as a constrained optimization and solved iteratively. The merging algorithm, designed to be parallel and out-of-core, incorporates robust filtering steps to eliminate low-quality reconstructions and enforce global visibility constraints. The approach has been tested on several large datasets downloaded from Flickr.com, including one with over ten thousand images, yielding a 3D reconstruction with nearly thirty million points. 1.
Learnable Similarity Functions and Their Applications to Clustering and Record Linkage
, 2004
"... rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initia ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initial results are encouraging, there still remains a large number of similarity functions that are currently unable to adapt to a particular domain. In our research, we attempt to bridge this gap by developing both new learnable similarity functions and methods for their application to particular problems in machine learning and data mining. In preliminary work, we proposed two learnable similarity functions for strings that adapt distance computations given training pairs of equivalent and non-equivalent strings (Bilenko & Mooney 2003a). The first function is based on a probabilistic model of edit distance with affine gaps (Gus- Copyright c # 2004, American Association for Artificial Intelli
Cluster ranking with an application to mining mailbox networks
- In ICDM ’06: Proceedings of the Sixth International Conference on Data Mining
, 2006
"... We initiate the study of a new clustering framework, called cluster ranking. Rather than simply partitioning a network into clusters, a cluster ranking algorithm also orders the clusters by their strength. To this end, we introduce a novel strength measure for clusters—the integrated cohesion—which ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We initiate the study of a new clustering framework, called cluster ranking. Rather than simply partitioning a network into clusters, a cluster ranking algorithm also orders the clusters by their strength. To this end, we introduce a novel strength measure for clusters—the integrated cohesion—which is applicable to arbitrary weighted networks. We then present C-Rank: a new cluster ranking algorithm. Given a network with arbitrary pairwise similarity weights, C-Rank creates a list of overlapping clusters and ranks them by their integrated cohesion. We provide extensive theoretical and empirical analysis of C-Rank and show that it is likely to have high precision and recall. A main component of C-Rank is a heuristic algorithm for finding sparse vertex separators. At the core of this algorithm is a new connection between the well known measure of vertex betweenness and multicommodity flow. Our experiments focus on mining mailbox networks. A mailbox network is an egocentric social network, consisting of contacts with whom an individual exchanges email. Ties among contacts are represented by the frequency of their co–occurrence on message headers. C-Rank is well suited to mine such networks, since they are abundant with overlapping communities of highly variable strengths. We demonstrate the effectiveness of C-Rank on the Enron data set, consisting of 130 mailbox networks. 1
Pragmatic text mining: minimizing human effort to quantify many issues in call logs
- In Proc. of the 12 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD
, 2006
"... text mining, log processing, supervised machine learning, quantification, text classification, applications, pattern recognition We discuss our experiences in analyzing customer-support issues from the unstructured free-text fields of technical-support call logs. The identification of frequent issue ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
text mining, log processing, supervised machine learning, quantification, text classification, applications, pattern recognition We discuss our experiences in analyzing customer-support issues from the unstructured free-text fields of technical-support call logs. The identification of frequent issues and their accurate quantification is essential in order to track aggregate costs broken down by issue type, to appropriately target engineering resources, and to provide the best diagnosis, support and documentation for most common issues. We present a new set of techniques for doing this efficiently on an industrial scale, without requiring manual coding of calls in the call center. Our approach involves (1) a new text clustering method to identify common and emerging issues; (2) a method to rapidly train large numbers of categorizers in a practical, interactive manner; and (3) a method to accurately quantify categories, even in the face of inaccurate classifications and training sets that necessarily cannot match the class distribution of each new month’s data. We present our methodology and a tool we developed and deployed that uses these methods for tracking ongoing support issues and discovering emerging issues at HP.
A segment-based approach to clustering multitopic documents
- in Text Mining Workshop, SIAM Datamining Conference
"... Document clustering has been recognized as a central problem in text data management, and it becomes particularly challenging when documents have multiple topics. In this paper we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segmen ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Document clustering has been recognized as a central problem in text data management, and it becomes particularly challenging when documents have multiple topics. In this paper we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments, which bear one or more topics on their own. We propose a segment-based document clustering framework, which is designed to induce a classification of documents starting from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our approach on different, large collections of multi-topic documents. 1
An extended version of the k-means method for overlapping clustering
"... This paper deals with overlapping clustering, a trade off between crisp and fuzzy clustering. It has been motivated by recent applications in various domains such as information retrieval or biology. We show that the problem of finding a suitable coverage of data by overlapping clusters is not a tri ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper deals with overlapping clustering, a trade off between crisp and fuzzy clustering. It has been motivated by recent applications in various domains such as information retrieval or biology. We show that the problem of finding a suitable coverage of data by overlapping clusters is not a trivial task. We propose a new objective criterion and the associated algorithm OKM that generalizes the k-means algorithm. Experiments show that overlapping clustering is a good alternative and indicate that OKM outperforms other existing methods. 1.
What is a Cluster? Perspectives from Game Theory
"... “Since no paradigm ever solves all the problems it defines and since no two paradigms leave all the same problems unsolved, paradigm debates always involve the question: Which problems is it more significant to have solved?” Thomas S. Kuhn, The Structure of Scientific Revolutions (1962) There is no ..."
Abstract
- Add to MetaCart
“Since no paradigm ever solves all the problems it defines and since no two paradigms leave all the same problems unsolved, paradigm debates always involve the question: Which problems is it more significant to have solved?” Thomas S. Kuhn, The Structure of Scientific Revolutions (1962) There is no shortage of clustering algorithms, and recently a new wave of excitement has spread across the machine learning community mainly because of the important development of spectral methods. At the same time, there is also growing interest around fundamental questions pertaining to the very nature of the clustering problem (see, e.g., [17, 1, 28]). Yet, despite the tremendous progress in the field, the clustering problem remains elusive and a satisfactory answer even to the most basic questions is still to come. Upon scrutinizing the relevant literature on the subject, it becomes apparent that the vast majority of the existing approaches deal with a very specific version of the problem, which asks for partitioning the input data into coherent classes. In fact, almost invariably, the problem of clustering is defined as a partitioning problem, and even the classical distinction between hierarchical and partitional algorithms
Bregman Bubble Clustering: A Robust Framework for Mining Dense Clusters
"... In classical clustering, each data point is assigned to at least one cluster. However, in many applications only a small subset of the available data is relevant for the problem and the rest needs to be ignored in order to obtain good clusters. Certain nonparametric density-based clustering methods ..."
Abstract
- Add to MetaCart
In classical clustering, each data point is assigned to at least one cluster. However, in many applications only a small subset of the available data is relevant for the problem and the rest needs to be ignored in order to obtain good clusters. Certain nonparametric density-based clustering methods find the most relevant data as multiple dense regions, but such methods are generally limited to low-dimensional data and do not scale well to large, high-dimensional datasets. Also, they use a specific notion of “distance”, typically Euclidean or Mahalanobis distance, which further limits their applicability. On the other hand, the recent One Class Information Bottleneck (OC-IB) method is fast and works on a large class of distortion measures known as Bregman Divergences, but can only find a single dense region. This article presents a broad framework for finding k dense clusters while ignoring the rest of the data. It includes a seeding algorithm that can automatically determine a suitable value for k. When k is forced to 1, our method gives rise to an improved version of OC-IB with optimality guarantees. We provide a generative model that yields the proposed iterative algorithm for finding k dense regions as a special case. Our analysis reveals an interesting and novel connection between the problem of finding dense regions and exponential mixture models; a hard model corresponding to k exponential mixtures with a uniform background results
Dominating set · Overlapping clustering · Set cover · Star clustering
, 2010
"... SimClus: an effective algorithm for clustering with a lower bound on similarity ..."
Abstract
- Add to MetaCart
SimClus: an effective algorithm for clustering with a lower bound on similarity

