Results 1  10
of
10
A Discriminative Framework for Clustering via Similarity Functions
"... Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as groundtruth and then design algorithms to (approximately) optimize various graphbased objective functions. However, in most appli ..."
Abstract

Cited by 38 (10 self)
 Add to MetaCart
(Show Context)
Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as groundtruth and then design algorithms to (approximately) optimize various graphbased objective functions. However, in most applications, this similarity information is merely based on some heuristic; the ground truth is really the unknown correct clustering of the data points and the real goal is to achieve low error on the data. In this work, we develop a theoretical approach to clustering from this perspective. In particular, motivated by recent work in learning theory that asks “what natural properties of a similarity (or kernel) function are sufficient to be able to learn well? ” we ask “what natural properties of a similarity function are sufficient to be able to cluster well?” To study this question we develop a theoretical framework that
Clustering with Interactive Feedback
"... Abstract. In this paper, we initiate a theoretical study of the problem of clustering data under interactive feedback. We introduce a querybased model in which users can provide feedback to a clustering algorithm in a natural way via split and merge requests. We then analyze the “clusterability” of ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we initiate a theoretical study of the problem of clustering data under interactive feedback. We introduce a querybased model in which users can provide feedback to a clustering algorithm in a natural way via split and merge requests. We then analyze the “clusterability” of different concept classes in this framework — the ability to cluster correctly with a bounded number of requests under only the assumption that each cluster can be described by a concept in the class — and provide efficient algorithms as well as informationtheoretic upper and lower bounds. 1
Distributed user profiling via spectral methods. SIGMETRICS Perform. Eval
 EPFL ICLCA2 BC256 (BC Building) Station 14 1015 Lausanne Switzerland
, 2010
"... User profiling is a useful primitive for constructing personalised services, such as content recommendation. In the present paper we investigate the feasibility of user profiling in a distributed setting, with no central authority and only local information exchanges between users. We compute a pro ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
User profiling is a useful primitive for constructing personalised services, such as content recommendation. In the present paper we investigate the feasibility of user profiling in a distributed setting, with no central authority and only local information exchanges between users. We compute a profile vector for each user (i.e., a lowdimensional vector that characterises her taste) via spectral transformation of observed userproduced ratings for items. Our two main contributions follow: (i) We consider a lowrank probabilistic model of user taste. More specifically, we consider that users and items are partitioned in a constant number of classes, such that users and items within the same class are statistically identical. We prove that without prior knowledge of the compositions of the classes, based solely on few random observed ratings (namely O(N logN) such ratings for N users), we can predict user preference with high probability for unrated items by running a local vote among users with similar profile vectors. In addition, we provide empirical evaluations characterising the way in which spectral profiling performance depends on the dimension of the profile space. Such evaluations are performed on a data set of real user ratings provided by Netflix. (ii) We develop distributed algorithms which provably achieve an embedding of users into a lowdimensional space, based on spectral transformation. These involve simple message passing among users, and provably converge to the desired embedding. Our method essentially relies on a novel combination of gossiping and the algorithm proposed by Oja and Karhunen. 1. Introduction. Recommendation
New Theoretical Frameworks for Machine Learning
, 2007
"... This thesis develops and analyzes theoretical frameworks for new emerging paradigms of Machine Learning including Semisupervised, Active, and Similaritybased Learning. These are areas of significant practical importance and significant activity in Machine Learning, and a number of different algori ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
This thesis develops and analyzes theoretical frameworks for new emerging paradigms of Machine Learning including Semisupervised, Active, and Similaritybased Learning. These are areas of significant practical importance and significant activity in Machine Learning, and a number of different algorithmic approaches have been developed for each of them. Standard Learning Theory frameworks such as PAC or Statistical Learning Theory models tend to not capture these learning approaches, hence developing sound and rigorous models that provide a thorough understanding of these new paradigms is desirable. The purpose of this thesis is to propose and to study new theoretical frameworks and algorithms for better understanding and extending some of these learning approaches. In addition, this dissertation also presents new applications of techniques from Machine Learning Theory to new emerging areas of Computer Science at large, such as Auction and Mechanism Design. In Machine Learning, there has been growing interest in using unlabeled data together with labeled data due to the availability of large amounts of unlabeled data in many applications. As a result, a number of different algorithmic approaches have been developed for this
A theory of similarity functions for clustering
, 2007
"... Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as groundtruth and then design algorithms to (approximately) optimize various graphbased objective functions. However, in most appli ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as groundtruth and then design algorithms to (approximately) optimize various graphbased objective functions. However, in most applications, this similarity information is merely based on some heuristic: the true goal is to cluster the points correctly rather than to optimize any specific graph property. In this work, we initiate a theoretical study of the design of similarity functions for clustering from this perspective. In particular, motivated by recent work in learning theory that asks “what natural properties of a similarity function are sufficient to be able to learn well? ” we ask “what natural properties of a similarity function are sufficient to be able to cluster well?” We develop a notion of the clustering complexity of a given property (analogous to notions of capacity in learning theory), that characterizes its informationtheoretic usefulness for clustering. We then analyze this complexity for several natural gametheoretic and learningtheoretic properties, as well as design efficient algorithms that are able to take advantage of them. We consider two natural clustering objectives: (a) list clustering: analogous to the notion of listdecoding, the algorithm can produce a small list of clusterings (which a user can select from) and (b) hierarchical clustering: the desired clustering is some
Interactive clustering
, 2009
"... We consider the problem of clustering with feedback. We study a recently proposed framework for the problem and present new results on clustering geometric concept classes in that model. In this model the clustering algorithm interacts with the user via “split ” and “merge ” requests to figure out t ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider the problem of clustering with feedback. We study a recently proposed framework for the problem and present new results on clustering geometric concept classes in that model. In this model the clustering algorithm interacts with the user via “split ” and “merge ” requests to figure out the target clustering. We also give a simple generic algorithm to cluster any concept class in the model. Our algorithm is queryefficient in the sense that it involves only a small amount of interaction with the user. We also present and study two natural generalization of the original model. The original model assumes that the user response to the algorithm is perfect. We eliminate this limitation by proposing a noisy model for interactive clustering and give an algorithm for learning the class of intervals in that model. We also propose a dynamic model considering the fact that the user might see a random subset of the space of all points at every step. Finally, for datasets satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class of clustering functions containing SingleLinkage will find the target clustering under the strongest property. 1
Clustering via Similarity Functions: Theoretical Foundations and Algorithms
"... Problems of clustering data from pairwise similarity information arise in many different fields. Yet the question of which algorithm is best to use under what conditions, and how good a notion of similarity does one need in order to cluster accurately remains poorly understood. In this work we propo ..."
Abstract
 Add to MetaCart
(Show Context)
Problems of clustering data from pairwise similarity information arise in many different fields. Yet the question of which algorithm is best to use under what conditions, and how good a notion of similarity does one need in order to cluster accurately remains poorly understood. In this work we propose a new general framework for analyzing clustering from similarity information that directly addresses this question of what properties of a similarity measure are sufficient to cluster accurately and by what kinds of algorithms. We show that in our framework a wide variety of interesting learningtheoretic and gametheoretic properties, including properties motivated by mathematical biology, can be used to cluster well, and we design new efficient algorithms that are able to take advantage of them. We consider two natural clustering objectives: (a) list clustering, where the algorithm’s goal is to produce a small list of clusterings such that at least one of them is approximately correct, and (b) hierarchical clustering, where the algorithm’s goal is to produce a hierarchy such that desired clustering is some pruning of this tree (which a user could navigate). We develop a notion of the clustering complexity of a given property, analogous to notions of capacity in learning theory, that characterizes informationtheoretic usefulness for clustering. We analyze this quantity for a wide range of properties, giving tight upper and lower
Using Spectral Clustering for Finding Students’ Using Spectral Clustering for Finding Students’ Patterns of Behavior in Social Networks Patterns of Behavior in Social Networks
"... Abstract. The high dimensionality of the data generated by social networks has been a big challenge for researchers. In order to solve the problems associated with this phenomenon, a number of methods and techniques were developed. Spectral clustering is a data mining method used in many application ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. The high dimensionality of the data generated by social networks has been a big challenge for researchers. In order to solve the problems associated with this phenomenon, a number of methods and techniques were developed. Spectral clustering is a data mining method used in many applications; in this paper we used this method to find students ’ behavioral patterns performed in an elearning system. In addition, a software was introduced to allow the user (tutor or researcher) to define the data dimensions and input values to obtain appropriate graphs with behavioral pattens that meet his/her needs. Behavioral patterns were compared with students ’ study performance and evaluation with relation to their possible usage in collaborative learning. 1
Streaming, Memory Limited Algorithms for Community Detection
"... In this paper, we consider sparse networks consisting of a finite number of nonoverlapping communities, i.e. disjoint clusters, so that there is higher density within clusters than across clusters. Both the intra and intercluster edge densities vanish when the size of the graph grows large, makin ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper, we consider sparse networks consisting of a finite number of nonoverlapping communities, i.e. disjoint clusters, so that there is higher density within clusters than across clusters. Both the intra and intercluster edge densities vanish when the size of the graph grows large, making the cluster reconstruction problem nosier and hence difficult to solve. We are interested in scenarios where the network size is very large, so that the adjacency matrix of the graph is hard to manipulate and store. The data stream model in which columns of the adjacency matrix are revealed sequentially constitutes a natural framework in this setting. For this model, we develop two novel clustering algorithms that extract the clusters asymptotically accurately. The first algorithm is offline, as it needs to store and keep the assignments of nodes to clusters, and requires a memory that scales linearly with the network size. The second algorithm is online, as it may classify a node when the corresponding column is revealed and then discard this information. This algorithm requires a memory growing sublinearly with the network size. To construct these efficient streaming memorylimited clustering algorithms, we first address the problem of clustering with partial information, where only a small proportion of the columns of the adjacency matrix is observed and develop, for this setting, a new spectral algorithm which is of independent interest. 1
00002008 to 00002008 4. TITLE AND SUBTITLE New Theoretical Frameworks for Machine Learning
, 2008
"... Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments ..."
Abstract
 Add to MetaCart
(Show Context)
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 222024302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it