Results 1 
6 of
6
Community Distribution Outlier Detection in Heterogeneous Information Networks
"... Abstract. Heterogeneous networks are ubiquitous. For example, bibliographic data, social data, medical records, movie data and many more can be modeled as heterogeneous networks. Rich information associated with multityped nodes in heterogeneous networks motivates us to propose a new definition of ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Heterogeneous networks are ubiquitous. For example, bibliographic data, social data, medical records, movie data and many more can be modeled as heterogeneous networks. Rich information associated with multityped nodes in heterogeneous networks motivates us to propose a new definition of outliers, which is different from those defined for homogeneous networks. In this paper, we propose the novel concept of Community Distribution Outliers (CDOutliers) for heterogeneous information networks, which are defined as objects whose community distribution does not follow any of the popular community distribution patterns. We extract such outliers using a typeaware joint analysis of multiple types of objects. Given community membership matrices for all types of objects, we follow an iterative twostage approach which performs pattern discovery and outlier detection in a tightly integrated manner. We first propose a novel outlieraware approach based on joint nonnegative matrix factorization to discover popular community distribution patterns for all the object types in a holistic manner, and then detect outliers based on such patterns. Experimental results on both synthetic and real datasets show that the proposed approach is highly effective in discovering interesting community distribution outliers. 1
Local Learning for Mining Outlier Subgraphs from Network Datasets
 In: SDM
"... In the real world, various systems can be modeled using entityrelationship graphs. Given such a graph, one may be interested in identifying suspicious or anomalous subgraphs. Specifically, a user may want to identify suspicious subgraphs matching a query template. A subgraph can be defined as anom ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
In the real world, various systems can be modeled using entityrelationship graphs. Given such a graph, one may be interested in identifying suspicious or anomalous subgraphs. Specifically, a user may want to identify suspicious subgraphs matching a query template. A subgraph can be defined as anomalous based on the connectivity structure within itself as well as with its neighborhood. For example for a coauthorship network, given a subgraph containing three authors, one expects all three authors to be say data mining authors. Also, one expects the neighborhood to mostly consist of data mining authors. But a 3author clique of data mining authors with all theory authors in the neighborhood clearly seems interesting. Similarly, having one of the authors in the clique as a theory author when all other authors (both in the clique and neighborhood) are data mining authors, is also suspicious. Thus, existence of lowprobability links and absence of highprobability links can be a good indicator of subgraph outlierness. The probability of an edge can in turn be modeled based on the weighted similarity between the attribute values of the nodes linked by the edge. We claim that the attribute weights must be learned locally for accurate link existence probability computations. In this paper, we design a system that finds subgraph outliers given a graph and a query by modeling the problem as a linear optimization. Experimental results on several synthetic and real datasets show the effectiveness of the proposed approach in computing interesting outliers. 1
GraphBased User Behavior Modeling: From Prediction to Fraud Detection Perspective and Target Audience
"... Abstract How can we model users' preferences? How do anomalies, fraud, and spam effect our models of normal users? How can we modify our models to catch fraudsters? In this tutorial we will answer these questions connecting graph analysis tools for user behavior modeling to anomaly and fraud ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract How can we model users' preferences? How do anomalies, fraud, and spam effect our models of normal users? How can we modify our models to catch fraudsters? In this tutorial we will answer these questions connecting graph analysis tools for user behavior modeling to anomaly and fraud detection. In particular, we will focus on the application of subgraph analysis, label propagation, and latent factor models to static, evolving, and attributed graphs. For each of these techniques we will give a brief explanation of the algorithms and the intuition behind them. We will then give examples of recent research using the techniques to model, understand and predict normal behavior. With this intuition for how these methods are applied to graphs and user behavior, we will focus on stateoftheart research showing how the outcomes of these methods are effected by fraud, and how they have been used to catch fraudsters. Perspective and Target Audience Perspective: In this tutorial we focus on understanding anomaly and fraud detection through the lens of normal user behavior modeling. The data mining and machine learning communities have developed a plethora of models and methods for understanding user behavior. However, these methods generally assume that the behavior is that of real, honest people. On the other hand, fraud detection systems frequently use similar techniques as those used in modeling "normal" behavior, but are often framed as an independent problem. However, by focusing on the relations and intersections of the two perspectives we can gain a more complete understanding of the methods and hopefully inspire new research joining these two communities. Target Audience: This tutorial is aimed at anyone interested in modeling and understanding user behavior, from data mining and machine learning researchers to practitioners from industry and government. For those new to the field, the tutorial will cover the necessary background material to understand these systems and will offer a concise, intuitive overview of the stateoftheart. Additionally, the tutorial aims to offer a new perspective that will be valuable and interesting even for researchers with more experience in these domains. For those having worked in classic user behavior modeling, we will demonstrate how fraud can effect commonlyused models that expect normal behavior, with the hope that future models will directly account for fraud. For those having worked in fraud detection systems, we hope to inspire new research directions through connecting with recent developments in modeling "normal" behavior.
Querybased Graph Cuboid Outlier Detection
"... Abstract—Various projections or views of a heterogeneous information network can be modeled using the graph OLAP (Online Analytical Processing) framework for effective decision making. Detecting anomalous projections of the network can help the analysts identify regions of interest from the graph ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Various projections or views of a heterogeneous information network can be modeled using the graph OLAP (Online Analytical Processing) framework for effective decision making. Detecting anomalous projections of the network can help the analysts identify regions of interest from the graph specific to the projection attribute. While most previous studies on outlier detection in graphs deal with outlier nodes, edges or subgraphs, we are the first to propose detection of graph cuboid outliers. Further we perform this detection in a query sensitive way. Given a general subgraph query on a heterogeneous network, we study the problem of finding outlier cuboids from the graph OLAP lattice. A Graph Cuboid Outlier (GCOutlier) is a cuboid with exceptionally high density of matches for the query. The GCOutlier detection task is clearly challenging because: (1) finding matches for the query (subgraph isomorphism) is NPhard; (2) number of matches for the query can be very high; and (3) number of cuboids can be large. We provide an approximate solution to the problem by computing only a fraction of the total matches originating from a select set of candidate nodes and including a select set of edges, chosen smartly. We perform extensive experiments on synthetic datasets to showcase the execution time versus accuracy tradeoff. Experiments on real datasets like Four Area and Delicious containing thousands of nodes reveal interesting GCOutliers.
TopK Interesting Subgraph Discovery in Information Networks
"... Abstract—In the real world, various systems can be modeled using heterogeneous networks which consist of entities of different types. Many problems on such networks can be mapped to an underlying critical problem of discovering topK subgraphs of entities with rare and surprising associations. Answ ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—In the real world, various systems can be modeled using heterogeneous networks which consist of entities of different types. Many problems on such networks can be mapped to an underlying critical problem of discovering topK subgraphs of entities with rare and surprising associations. Answering such subgraph queries efficiently involves two main challenges: (1) computing all matching subgraphs which satisfy the query and (2) ranking such results based on the rarity and the interestingness of the associations among entities in the subgraphs. Previous work on the matching problem can be harnessed for a naı̈ve rankingaftermatching solution. However, for large graphs, subgraph queries may have enormous number of matches, and so it is inefficient to compute all matches when only the topK matches are desired. In this paper, we address the two challenges of matching and ranking in topK subgraph discovery as follows. First, we introduce two index structures for the network: topology index, and graph maximum metapath weight index, which are both computed offline. Second, we propose novel topK mechanisms to exploit these indexes for answering interesting subgraph queries online efficiently. Experimental results on several synthetic datasets and the DBLP and Wikipedia datasets containing thousands of entities show the efficiency and the effectiveness of the proposed approach in computing interesting subgraphs. I.
A Proposal for Statistical Outlier Detection in Relational Structures
"... This paper extends unsupervised statistical outlier detection to the case of relational data. For nonrelational data, where each individual is characterized by a feature vector, a common approach starts with learning a generative statistical model for the population. The model assigns a likelihood m ..."
Abstract
 Add to MetaCart
(Show Context)
This paper extends unsupervised statistical outlier detection to the case of relational data. For nonrelational data, where each individual is characterized by a feature vector, a common approach starts with learning a generative statistical model for the population. The model assigns a likelihood measure for the feature vector that characterizes the individual; the lower the feature vector likelihood, the more anomalous the individual. A difference between relational and nonrelational data is that an individual is characterized not only by a list of attributes, but also by its links and by attributes of the individuals linked to it. We refer to a relational structure that specifies this information for a specific individual as the individual’s database. Our proposal is to use the likelihood assigned by a generative model to the individual’s database as the anomaly score for the individual; the lower the model likelihood, the more anomalous the individual. As a novel validation method, we compare the model likelihood with metrics of individual success. An empirical evaluation reveals a surprising finding in soccer and movie data: We observe in the data a strong correlation between the likelihood and success metrics.