Results 1  10
of
65
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract

Cited by 415 (5 self)
 Add to MetaCart
(Show Context)
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Fast random walk with restart and its applications
 In ICDM ’06: Proceedings of the 6th IEEE International Conference on Data Mining
, 2006
"... How closely related are two nodes in a graph? How to compute this score quickly, on huge, diskresident, real graphs? Random walk with restart (RWR) provides a good relevance score between two nodes in a weighted graph, and it has been successfully used in numerous settings, like automatic captionin ..."
Abstract

Cited by 151 (18 self)
 Add to MetaCart
(Show Context)
How closely related are two nodes in a graph? How to compute this score quickly, on huge, diskresident, real graphs? Random walk with restart (RWR) provides a good relevance score between two nodes in a weighted graph, and it has been successfully used in numerous settings, like automatic captioning of images, generalizations to the “connection subgraphs”, personalized PageRank, and many more. However, the straightforward implementations of RWR do not scale for large graphs, requiring either quadratic space and cubic precomputation time, or slow response time on queries. We propose fast solutions to this problem. The heart of our approach is to exploit two important properties shared by many real graphs: (a) linear correlations and (b) blockwise, communitylike structure. We exploit the linearity by using lowrank matrix approximation, and the community structure by graph partitioning, followed by the ShermanMorrison lemma for matrix inversion. Experimental results on the Corel image and the DBLP dabasets demonstrate that our proposed methods achieve significant savings over the straightforward implementations: they can save several orders of magnitude in precomputation and storage cost, and they achieve up to 150x speed up with 90%+ quality preservation. 1
Centerpiece subgraphs: Problem definition and fast solutions
 In KDD
, 2006
"... Given Q nodes in a social network (say, authorship network), how can we find the node/author that is the centerpiece, and has direct or indirect connections to all, or most of them? For example, this node could be the common advisor, or someone who started the research area that the Q nodes belong t ..."
Abstract

Cited by 67 (22 self)
 Add to MetaCart
(Show Context)
Given Q nodes in a social network (say, authorship network), how can we find the node/author that is the centerpiece, and has direct or indirect connections to all, or most of them? For example, this node could be the common advisor, or someone who started the research area that the Q nodes belong to. Isomorphic scenarios appear in law enforcement (find the mastermind criminal, connected to all current suspects), gene regulatory networks (find the protein that participates in pathways with all or most of the given Q proteins), viral marketing and many more. Connection subgraphs is an important first step, handling the case of Q=2 query nodes. Then, the connection subgraph algorithm finds the b intermediate nodes, that provide a good connection between the two original query nodes. Here we generalize the challenge in multiple dimensions: First, we allow more than two query nodes. Second, we allow a whole family of queries, ranging from ’OR ’ to ’AND’, with ’softAND ’ inbetween. Finally, we design and compare a fast approximation, and study the quality/speed tradeoff. We also present experiments on the DBLP dataset. The experiments confirm that our proposed method naturally deals with multisource queries and that the resulting subgraphs agree with our intuition. Wallclock timing results on the DBLP dataset show that our proposed approximation achieve good accuracy for about 6: 1 speedup. This material is based upon work supported by the
Fast DirectionAware Proximity for Graph Mining
, 2007
"... In this paper we study asymmetric proximity measures on directed graphs, which quantify the relationships between two nodes or two groups of nodes. The measures are useful in several graph mining tasks, including clustering, link prediction and connection subgraph discovery. Our proximity measure is ..."
Abstract

Cited by 43 (9 self)
 Add to MetaCart
In this paper we study asymmetric proximity measures on directed graphs, which quantify the relationships between two nodes or two groups of nodes. The measures are useful in several graph mining tasks, including clustering, link prediction and connection subgraph discovery. Our proximity measure is based on the concept of escape probability. This way, we strive to summarize the multiple facets of nodesproximity, while avoiding some of the pitfalls to which alternative proximity measures are susceptible. A unique feature of the measures is accounting for the underlying directional information. We put a special emphasis on computational efficiency, and develop fast solutions that are applicable in several settings. Our experimental study shows the usefulness of our proposed directionaware proximity method for several applications, and that our algorithms achieve a significant speedup (up to 50,000x) over straightforward implementations.
Audience selection for online brand advertising: privacyfriendly social network targeting
 In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
, 2009
"... This paper describes and evaluates privacyfriendly methods for extracting quasisocial networks from browser behavior on usergenerated content sites, for the purpose of finding good audiences for brand advertising (as opposed to click maximizing, for example). Targeting socialnetwork neighbors re ..."
Abstract

Cited by 32 (2 self)
 Add to MetaCart
(Show Context)
This paper describes and evaluates privacyfriendly methods for extracting quasisocial networks from browser behavior on usergenerated content sites, for the purpose of finding good audiences for brand advertising (as opposed to click maximizing, for example). Targeting socialnetwork neighbors resonates well with advertisers, and online browsing behavior data counterintuitively can allow the identification of good audiences anonymously. Besides being one of the first papers to our knowledge on data mining for online brand advertising, this paper makes several important contributions. We introduce a framework for evaluating brand audiences, in analogy to predictivemodeling holdout evaluation. We introduce methods for extracting quasisocial networks from data on visitations to social networking pages, without collecting any information on the identities of the browsers or the content of the socialnetwork pages. We introduce measures of brand proximity in the network, and show that audiences with high brand proximity indeed show substantially higher brand affinity. Finally, we provide evidence that the quasisocial network embeds a true social network, which along with results from social theory offers one explanation for the increases in audience brand affinity.
On Community Outliers and their Efficient Detection in Information Networks
 KDD'10
, 2010
"... Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, coauthorship and citation information, blog data, movie reviews and so on. In these datasets (called ..."
Abstract

Cited by 30 (9 self)
 Add to MetaCart
(Show Context)
Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, coauthorship and citation information, blog data, movie reviews and so on. In these datasets (called “information networks”), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a lowincome person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that wellknown baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both
Proximity Tracking on TimeEvolving Bipartite Graphs
"... Given an authorconference network that evolves over time, which are the conferences that a given author is most closely related with, and how do they change over time? Large timeevolving bipartite graphs appear in many settings, such as social networks, cocitations, marketbasket analysis, and co ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
(Show Context)
Given an authorconference network that evolves over time, which are the conferences that a given author is most closely related with, and how do they change over time? Large timeevolving bipartite graphs appear in many settings, such as social networks, cocitations, marketbasket analysis, and collaborative filtering. Our goal is to monitor (i) the centrality of an individual node (e.g., who are the most important authors?); and (ii) the proximity of two nodes or sets of nodes (e.g., who are the most important authors with respect to a particular conference?) Moreover, we want to do this efficiently and incrementally, and to provide “anytime ” answers. We propose pTrack and cTrack, which are based on random walk with restart, and use powerful matrix tools. Experiments on real data show that our methods are effective and efficient: the mining results agree with intuition; and we achieve up to 15∼176 times speedup, without any quality loss. 1
On the Vulnerability of Large Graphs
"... Given a large graph, like a computer network, which k nodes should we immunize (or monitor, or remove), to make it as robust as possible against a computer virus attack? We need (a) a measure of the ‘Vulnerability ’ of a given network, (b) a measure of the ‘Shieldvalue ’ of a specific set of k node ..."
Abstract

Cited by 24 (11 self)
 Add to MetaCart
(Show Context)
Given a large graph, like a computer network, which k nodes should we immunize (or monitor, or remove), to make it as robust as possible against a computer virus attack? We need (a) a measure of the ‘Vulnerability ’ of a given network, (b) a measure of the ‘Shieldvalue ’ of a specific set of k nodes and (c) a fast algorithm to choose the best such k nodes. We answer all these three questions: we give the justification behind our choices, we show that they agree with intuition as well as recent results in immunology. Moreover, we propose NetShield, a fast and scalable algorithm. Finally, we give experiments on large real graphs, where NetShield achieves tremendous speed savings exceeding 7 orders of magnitude, against straightforward competitors. 1
Colibri: Fast Mining of Large Static and Dynamic Graphs
"... Lowrank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the lowrank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
(Show Context)
Lowrank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the lowrank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have thousands or millions of nodes, but are usually very sparse. However, standard decompositions such as SVD do not preserve sparsity. This has led to the development of methods such as CUR and CMD, which seek a nonorthogonal basis by sampling the columns and/or rows of the sparse matrix. However, these approaches will typically produce overcomplete bases, which wastes both space and time. In this paper we propose the family of Colibri methods to deal with these challenges. Our version for static graphs, ColibriS, iteratively finds a nonredundant basis and we prove that it has no loss of accuracy compared to the best competitors (CUR and CMD), while achieving significant savings in space and time: on real data, ColibriS requires much less space and is orders of magnitude faster (in proportion to the square of the number of nonredundant columns). Additionally, we propose an efficient update algorithm for dynamic, timeevolving graphs, ColibriD. Our evaluation on a large, real network traffic dataset shows that ColibriD is over 100 times faster than the best published competitor (CMD).