Results 1  10
of
38
Aiding the detection of fake accounts in large scale social online services.
 In NSDI,
, 2012
"... Abstract Users increasingly rely on the trustworthiness of the information exposed on Online Social Networks (OSNs). In addition, OSN providers base their business models on the marketability of this information. However, OSNs suffer from abuse in the form of the creation of fake accounts, which do ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
(Show Context)
Abstract Users increasingly rely on the trustworthiness of the information exposed on Online Social Networks (OSNs). In addition, OSN providers base their business models on the marketability of this information. However, OSNs suffer from abuse in the form of the creation of fake accounts, which do not correspond to real humans. Fakes can introduce spam, manipulate online rating, or exploit knowledge extracted from the network. OSN operators currently expend significant resources to detect, manually verify, and shut down fake accounts. Tuenti, the largest OSN in Spain, dedicates 14 fulltime employees in that task alone, incurring a significant monetary cost. Such a task has yet to be successfully automated because of the difficulty in reliably capturing the diverse behavior of fake and real OSN profiles. We introduce a new tool in the hands of OSN operators, which we call SybilRank . It relies on social graph properties to rank users according to their perceived likelihood of being fake (Sybils). SybilRank is computationally efficient and can scale to graphs with hundreds of millions of nodes, as demonstrated by our Hadoop prototype. We deployed SybilRank in Tuenti's operation center. We found that ∼90% of the 200K accounts that SybilRank designated as most likely to be fake, actually warranted suspension. On the other hand, with Tuenti's current userreportbased approach only ∼5% of the inspected accounts are indeed fake.
A local algorithm for finding wellconnected clusters
 CoRR
, 2013
"... Motivated by applications of largescale graph clustering, we study randomwalkbased local algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. In particular, we develop a method with better theoretical guarantee compared to all previous work, b ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Motivated by applications of largescale graph clustering, we study randomwalkbased local algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. In particular, we develop a method with better theoretical guarantee compared to all previous work, both in terms of the clustering accuracy and the conductance of the output set. We also prove that our analysis is tight, and perform empirical evaluation to support our theory on both synthetic and real data. More specifically, our method outperforms prior work when the cluster is wellconnected. In fact, the better it is wellconnected inside, the more significant improvement we can obtain. Our results shed light on why in practice some randomwalkbased algorithms perform better than its previous theory, and help guide future research about local clustering. 1.
A very fast method for clustering big text datasets
 In ECAI
, 2010
"... Abstract. Largescale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pairwise similarities between data points due to the prohibitive cost, timeand spacewise, in operating on a similarity matrix, where the stateofthe ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Largescale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pairwise similarities between data points due to the prohibitive cost, timeand spacewise, in operating on a similarity matrix, where the stateoftheart is at best quadratic in time and in space. We present an extremely fast and simple method also using the power of all pairwise similarity between data points, and show through experiments that it does as well as previous methods in clustering accuracy, and it does so with in linear time and space, without sampling data points or sparsifying the similarity matrix.
Efficient spectral neighborhood blocking for entity resolution
 In ICDE
, 2011
"... Abstract—In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the realworld. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social net ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the realworld. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a realworld entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1) We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by NewmanGirvan modularity in the bipartition process. Our experimental results with both synthetic and realworld data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets. I.
H.: Diffusion processes for retrieval revisited
 In: Proceedings of IEEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2013) 1320–1327
"... In this paper we revisit diffusion processes on affinity graphs for capturing the intrinsic manifold structure defined by pairwise affinity matrices. Such diffusion processes have already proved the ability to significantly improve subsequent applications like retrieval. We give a thorough overvie ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
In this paper we revisit diffusion processes on affinity graphs for capturing the intrinsic manifold structure defined by pairwise affinity matrices. Such diffusion processes have already proved the ability to significantly improve subsequent applications like retrieval. We give a thorough overview of the stateoftheart in this field and discuss obvious similarities and differences. Based on our observations, we are then able to derive a generic framework for diffusion processes in the scope of retrieval applications, where the related work represents specific instances of our generic formulation. We evaluate our framework on several retrieval tasks and are able to derive algorithms that e. g. achieve a 100 % bullseye score on the popular MPEG7 shape retrieval data set. 1.
Node Clustering in Graphs: An Empirical Study
"... Modeling networks is an active area of research and is used for many applications ranging from bioinformatics to social network analysis. An important operation that is often performed in the course of graph analysis is node clustering. Popular methods for node clustering such as the normalized cut ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Modeling networks is an active area of research and is used for many applications ranging from bioinformatics to social network analysis. An important operation that is often performed in the course of graph analysis is node clustering. Popular methods for node clustering such as the normalized cut method have their roots in graph partition optimization and spectral graph theory. Recently, there has been increasing interest in modeling graphs probabilistically using stochastic block models and other approaches that extend it. In this paper, we present an empirical study that compares the node clustering performances of stateoftheart algorithms from both the probabilistic and spectral families on undirected graphs. Our experiments show that no family dominates over the other and that network characteristics play a significant role in determining the best model to use. 1
STOCHASTIC DATA CLUSTERING
, 2012
"... ... published the theory behind the longterm behavior of a dynamical system that can be described by a nearly uncoupled matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, brain organization, and ecology. In all of these applications, the ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
... published the theory behind the longterm behavior of a dynamical system that can be described by a nearly uncoupled matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, brain organization, and ecology. In all of these applications, the structure of the system is known and the point of interest is the various stages the system passes through on its way to some longterm equilibrium. This paper looks at this problem from the other direction. That is, we develop a technique for using the evolution of the system to tell us about its initial structure, and then use this technique to develop an algorithm that takes the varied solutions from multiple data clustering algorithms to arrive at a single data clustering solution.
Determining the Number of Clusters via Iterative Consensus Clustering ∗
, 2013
"... We use a cluster ensemble to determine the number of clusters, k, in a group of data. A consensus similarity matrix is formed from the ensemble using multiple algorithms and several values for k. A random walk is induced on the graph defined by the consensus matrix and the eigenvalues of the associa ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
We use a cluster ensemble to determine the number of clusters, k, in a group of data. A consensus similarity matrix is formed from the ensemble using multiple algorithms and several values for k. A random walk is induced on the graph defined by the consensus matrix and the eigenvalues of the associated transition probability matrix are used to determine the number of clusters. For noisy or highdimensional data, an iterative technique is presented to refine this consensus matrix in way that encourages a blockdiagonal form. It is shown that the resulting consensus matrix is generally superior to existing similarity matrices for this type of spectral analysis. 1
Approximate spectral clustering via randomized sketching. preprint arXiv:1311.2854
, 2013
"... Spectral clustering is arguably one of the most important algorithms in data mining and machine intelligence; however, its computational complexity makes it a challenge to use it for large scale data analysis. Recently, several approximation algorithms for spectral clustering have been developed in ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Spectral clustering is arguably one of the most important algorithms in data mining and machine intelligence; however, its computational complexity makes it a challenge to use it for large scale data analysis. Recently, several approximation algorithms for spectral clustering have been developed in order to alleviate the relevant costs, but theoretical results are lacking. In this paper, we present a novel approximation algorithm for spectral clustering with strong theoretical evidence of its performance. Our algorithm is based on approximating the eigenvectors of the Laplacian matrix using random projections, a.k.a randomized sketching. Our experimental results demonstrate that the proposed approximation algorithm compares remarkably well to the exact algorithm. 1
Coinciding walk kernels: Parallel absorbing random walks for learning with graphs and few labels
 In Asian Conference on Machine Learning
, 2013
"... Exploiting autocorrelation for nodelabel prediction in networked data has led to great success. However, when dealing with sparsely labeled networks, common in presentday tasks, the autocorrelation assumption is difficult to exploit. Taking a step beyond, we propose the coinciding walk kernel (cw ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Exploiting autocorrelation for nodelabel prediction in networked data has led to great success. However, when dealing with sparsely labeled networks, common in presentday tasks, the autocorrelation assumption is difficult to exploit. Taking a step beyond, we propose the coinciding walk kernel (cwk), a novel kernel leveraging labelstructure similarity – the idea that nodes with similarly arranged labels in their local neighbourhoods are likely to have the same label – for learning problems on partially labeled graphs. Inspired by the success of random walk based schemes for the construction of graph kernels, cwk is defined in terms of the probability that the labels encountered during parallel random walks coincide. In addition to its intuitive probabilistic interpretation, coinciding walk kernels outperform existing kernel and walkbased methods on the task of nodelabel prediction in sparsely labeled graphs with high labelstructure similarity. We also show that computing cwks is faster than many stateoftheart kernels on graphs. We evaluate cwks on several realworld networks, including cocitation and coauthor graphs, as well as a graph of interlinked populated places extracted from the dbpedia knowledge base.