## Link Mining: A Survey (2005)

Venue: | SigKDD Explorations Special Issue on Link Mining |

Citations: | 47 - 0 self |

### BibTeX

@ARTICLE{Getoor05linkmining:,

author = {Lise Getoor and Christopher P. Diehl},

title = {Link Mining: A Survey},

journal = {SigKDD Explorations Special Issue on Link Mining},

year = {2005}

}

### Years of Citing Articles

### OpenURL

### Abstract

Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other semantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and contacts, or in bibliographic domains describing publications, authors, and venues. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a cross-fertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this article, we review some of the common emerging themes. 1.

### Citations

2704 | Authoritative sources in a hyperlinked environment
- Kleinberg
- 1999
(Show Context)
Citation Context ... the set of objects within the graph. Much of this research focuses on graphs with a single object type and a single link type. In the context of web information retrieval, the PageRank [91] and HITS =-=[64]-=- algorithms are the most notable approaches to LBR. PageRank models web surfing as a random walk where the surfer randomly selects and follows links and occasionally jumps to a new web page to start a... |

2675 | Fast Algorithms for Mining Association Rules
- Agrawal, Srikant
- 1994
(Show Context)
Citation Context ... or the discovered patterns may be used for graph classification (Section 9). One line of work attempts to find frequent subgraphs [54; 70; 116]. Many of these approaches exploit the Apriori property =-=[4]-=- from frequent item set mining. Typically, there is a candidate generation phase followed by a matching phase. Naive matching requires a subgraph isomorphism test, so efficient algorithms are needed h... |

2310 | Conditional random fields: probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...llection of encyclopedia articles: simply incorporating words from neighboring documents was not helpful, while making use of the predicted class of neighboring documents was helpful. Lafferty et al. =-=[71]-=- introduce conditional random fields (CRF), which extend traditional maximum entropy models for LBC in the restricted case where the data graphs are chains. Taskar et al. [107] extend Lafferty et al.’... |

2137 | T.: “The PageRank citation ranking: Bringing order to the Web
- Page, Brin, et al.
- 1998
(Show Context)
Citation Context ... or prioritize the set of objects within the graph. Much of this research focuses on graphs with a single object type and a single link type. In the context of web information retrieval, the PageRank =-=[91]-=- and HITS [64] algorithms are the most notable approaches to LBR. PageRank models web surfing as a random walk where the surfer randomly selects and follows links and occasionally jumps to a new web p... |

2089 | Emergence of scaling in random networks
- Barabási, Albert
- 1999
(Show Context)
Citation Context ...ivated the search for general principles governing such networks [15]. Airoldi et al. [5] in this issue review sampling algorithms for a number of the common network types such as scale free networks =-=[7]-=-, small-world networks [113], core-periphery [13], and cellular networks [42] that exhibit such attributes. In contrast to the random process models from the social network analysis literature, many o... |

1904 |
Collective dynamics of small worlds networks
- Watts, Strogatz
- 1998
(Show Context)
Citation Context ...ral principles governing such networks [15]. Airoldi et al. [5] in this issue review sampling algorithms for a number of the common network types such as scale free networks [7], small-world networks =-=[113]-=-, core-periphery [13], and cellular networks [42] that exhibit such attributes. In contrast to the random process models from the social network analysis literature, many of these generative models ar... |

1652 |
Social Network Analysis: Methods and applications
- Wasserman, Faust
- 1994
(Show Context)
Citation Context ... more complex. Consider a simple example from Singh et al. [101] of a social network describing actors and their participation in events. Such social networks are commonly called affiliation networks =-=[112]-=-, and are easily represented by three tables representing the actors, the events, and the participation relationships. Even this simple structure can be represented as several distinct graphs. The mos... |

1055 | Inductive logic programming
- Muggleton
- 1991
(Show Context)
Citation Context ...e and lexicographically ordering these codes, then performing DFS on the search tree defined by this lexicographic ordering. Other approaches come from the inductive logic programming (ILP) community =-=[79; 72]-=-. One early success was the work of Dehaspe et al. [27], who applied techniques from inductive logic programming to finding frequent patterns in a toxicology domain. Another line of work focuses on ef... |

483 | Centrality in social networks: Conceptual clarification
- Freeman
- 1979
(Show Context)
Citation Context ...res characterize some aspect of the local or global network structure as seen from a given individual’s position in the network. They range in complexity from local measures such as degree centrality =-=[43]-=-, which is simply the vertex degree, to global measures such as eigenvector/power centrality [12], which use spectral methods to characterize the importance of individuals based on their connectedness... |

479 | The link prediction problem for social networks
- Liben-Nowell, Kleinberg
- 2003
(Show Context)
Citation Context ...r any two potentially linked objects oi and oj, predict whether lijis1 or 0. One approach is to make this prediction entirely based on structural properties of the network. Liben-Nowell and Kleinberg =-=[75]-=- present a survey of predictors based on different graph proximity measures. Other approaches make use of attribute information for link prediction. Popescul et al. [93] introduce a structured logisti... |

470 |
Modularity and Community Structure in Networks
- Newman
- 2006
(Show Context)
Citation Context ...he positions. Spectral graph partitioning methods address the group detection problem by identifying an approximately minimal set of links to remove from the graph to achieve a given number of groups =-=[82; 30]-=-. In a related vein, Gibson et al. [50] have shown that the dominant eigenvectors of the HITS authority matrix provide a natural decomposition of web community structure. Other recent approaches for g... |

443 | gSpan: Graph-Based Substructure Pattern Mining
- Yan, Han
- 2002
(Show Context)
Citation Context .... Discovery of these patterns may be the sole purpose of the systems, or the discovered patterns may be used for graph classification (Section 9). One line of work attempts to find frequent subgraphs =-=[54; 70; 116]-=-. Many of these approaches exploit the Apriori property [4] from frequent item set mining. Typically, there is a candidate generation phase followed by a matching phase. Naive matching requires a subg... |

415 | Topic-sensitive PageRank
- Haveliwala
- 2002
(Show Context)
Citation Context ... on relevance. Ng et al. [83; 84] analyze the stability of PageRank and HITS to small perturbations in the link structure and present modifications to HITS that yield more stable rankings. Haveliwala =-=[51]-=- and Jeh and Widom [56] propose topic-sensitive PageRank algorithms that identify topically authoritative web pages efficiently at query time. Ding et al. [29] proposes a unified framework encompassin... |

403 | Improved algorithms for topic distillation in a hyperlinked environment
- Henzinger, Bharat
(Show Context)
Citation Context ...istributions of the respective random processes. Since the introduction of PageRank and HITS, a number of algorithms have been proposed that are variations on these basic themes. Bharat and Henzinger =-=[8]-=- and Chakrabarti et al. [17] propose modifications to HITS that exploit web page content to weight pages and links based on relevance. Ng et al. [83; 84] analyze the stability of PageRank and HITS to ... |

384 | Enhanced Hypertext Categorization Using Hyperlinks
- Chakrabarti, Dom, et al.
- 1998
(Show Context)
Citation Context ...e classification that exploit such correlations and jointly infer the categorical values associated with the objects in the graph. LBC has received considerable attention recently. Chakrabarti et al. =-=[18]-=- consider the problem of classifying related news items in the Reuters dataset. They were among the first to notice that exploiting class labels of related objects aids classification, whereas exploit... |

349 |
Scene labeling by relaxation operations
- Rosenfeld, Hummel, et al.
- 1976
(Show Context)
Citation Context ... Markov blanket of the object to be classified. In addition to the machine learning community, the computer vision and natural language communities have also studied the LBC problem. Rosenfeld et al. =-=[99]-=- proposed relaxation labeling, an inference algorithm later used by Chakrabarti et al. [18] to perform link-based classification. Hummel and Zucker [53] present one of many approaches for exploring re... |

347 | Discriminative Probabilistic Models for Relational Data
- Taskar, Abbeel, et al.
(Show Context)
Citation Context ...helpful. Lafferty et al. [71] introduce conditional random fields (CRF), which extend traditional maximum entropy models for LBC in the restricted case where the data graphs are chains. Taskar et al. =-=[107]-=- extend Lafferty et al.’s approach [71] to the case where the data graphs are arbitrary graphs. Neville and Jensen [80] propose simple LBC algorithms to classify corporate datasets with rich schemas t... |

339 | Inferring Web Communities from Link Topology
- Gibson, Kleinberg, et al.
- 1998
(Show Context)
Citation Context ...ethods address the group detection problem by identifying an approximately minimal set of links to remove from the graph to achieve a given number of groups [82; 30]. In a related vein, Gibson et al. =-=[50]-=- have shown that the dominant eigenvectors of the HITS authority matrix provide a natural decomposition of web community structure. Other recent approaches for group detection use a measure of edge be... |

306 | Frequent Subgraph Discovery
- Kuramochi, Karypis
- 2001
(Show Context)
Citation Context .... Discovery of these patterns may be the sole purpose of the systems, or the discovered patterns may be used for graph classification (Section 9). One line of work attempts to find frequent subgraphs =-=[54; 70; 116]-=-. Many of these approaches exploit the Apriori property [4] from frequent item set mining. Typically, there is a candidate generation phase followed by a matching phase. Naive matching requires a subg... |

295 | Scaling personalized web search
- Jeh, Widom
- 2003
(Show Context)
Citation Context .... [83; 84] analyze the stability of PageRank and HITS to small perturbations in the link structure and present modifications to HITS that yield more stable rankings. Haveliwala [51] and Jeh and Widom =-=[56]-=- propose topic-sensitive PageRank algorithms that identify topically authoritative web pages efficiently at query time. Ding et al. [29] proposes a unified framework encompassing both PageRank and HIT... |

265 | A.: “Learning To Map between Ontologies on the Semantic Web
- Doan, Madhavan, et al.
(Show Context)
Citation Context ...e methods for discovering interesting subgraphs based on semantic information associated with the edges. There has been some other work in this area, for example Madche and Staab [77] and Doan et al. =-=[32]-=-, but there is much more to be done. As information extraction techniques continue to improve, one area for future research is combining information extraction with techniques from link mining to help... |

263 |
Power and Centrality: A Family of Measures
- Bonacich
- 1987
(Show Context)
Citation Context ...dual’s position in the network. They range in complexity from local measures such as degree centrality [43], which is simply the vertex degree, to global measures such as eigenvector/power centrality =-=[12]-=-, which use spectral methods to characterize the importance of individuals based on their connectedness to other important individuals. In the above work, the common goal is a global ranking of object... |

252 |
On the Foundations of Relaxation Labeling Processes
- Hummel, Zucker
- 1983
(Show Context)
Citation Context ... also studied the LBC problem. Rosenfeld et al. [99] proposed relaxation labeling, an inference algorithm later used by Chakrabarti et al. [18] to perform link-based classification. Hummel and Zucker =-=[53]-=- present one of many approaches for exploring relaxation labeling theoretically. Lafferty et al. [71] proposed CRFs for use in part-of-speech tagging, a task in natural language processing. 5. GROUP D... |

235 | An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data
- Inokuchi, Washio, et al.
(Show Context)
Citation Context .... Discovery of these patterns may be the sole purpose of the systems, or the discovered patterns may be used for graph classification (Section 9). One line of work attempts to find frequent subgraphs =-=[54; 70; 116]-=-. Many of these approaches exploit the Apriori property [4] from frequent item set mining. Typically, there is a candidate generation phase followed by a matching phase. Naive matching requires a subg... |

230 | SimRank: A Measure of StructuralContext Similarity
- Jeh, Widom
(Show Context)
Citation Context ... ranking of objects in a static graph produced using a specified measure. Notable variations from this theme include approaches that rank objects relative to one or more relevant objects in the graph =-=[55; 114; 105]-=- and methods that rank objects over time in dynamic graphs [89; 88]. Jeh and Widom [55] propose a metric for assessing the similarity of two objects based on the degree to which they link to similar o... |

221 | SALSA: the stochastic approach for link-structure analysis
- Lempel, Moran
(Show Context)
Citation Context ...rhood. This approach bears a relation to PageRank with two separate random walks—one with hub transitions and one with authority transitions—on a corresponding bipartite graph of hubs and authorities =-=[73; 95; 84]-=-. The hub and authority scores are the steady-state distributions of the respective random processes. Since the introduction of PageRank and HITS, a number of algorithms have been proposed that are va... |

202 | Learning to Construct Knowledge Bases from
- Craven, DiPasquo, et al.
- 2000
(Show Context)
Citation Context ...predicting the participation of actors in events [88], such as email, telephone calls and co-authorship; and predicting semantic relationships such as “advisor-of” based on web page links and content =-=[24; 108]-=-. Most often, some links are observed, and one is attempting to predict unobserved links, or there is a temporal aspect: a snapshot of the set of links at time t is given and the goal is to predict th... |

186 | The missing link: A probabilistic model of document content and hypertext connectivity
- Cohn
(Show Context)
Citation Context ...ce a probabilistic analogue to HITS based on probabilistic latent semantic indexing, where the model attempts to explain the link structure in terms of a small set of latent factors. Cohn and Hofmann =-=[21]-=- and Richardson and Domingos [98] present probabilistic models inspired by HITS and PageRank, respectively, that incorporate both content and link structure. In the domain of social network analysis (... |

153 | Identity uncertainty and citation matching
- Pasula, Marthi, et al.
- 2003
(Show Context)
Citation Context ...w the flow of reasoning between linked pair-wise decisions over multiple entity types. In addition, models have been proposed that explicitly consider links among references for collective resolution =-=[92; 11; 25]-=-. Pasula et al. [92] propose a generic probabilistic relational model framework for the citation matching problem. Culotta and McCallum [25] construct a conditional random field model of deduplication... |

152 | Substructure discovery using minimum description length and background knowledge
- Cook, Holder
- 1994
(Show Context)
Citation Context ...d techniques from inductive logic programming to finding frequent patterns in a toxicology domain. Another line of work focuses on efficient subgraph generation and compression-based heuristic search =-=[22; 78]-=-. Subdue [22], the earliest work in this area, uses an MDL-based heuristic to guide the search for subgraphs. Subdue has been used for both subgraph discovery and graph classification [23]. As another... |

150 | The intelligent surfer: Probabilistic combination of link and content information
- Richardson, Domingos
(Show Context)
Citation Context ...TS based on probabilistic latent semantic indexing, where the model attempts to explain the link structure in terms of a small set of latent factors. Cohn and Hofmann [21] and Richardson and Domingos =-=[98]-=- present probabilistic models inspired by HITS and PageRank, respectively, that incorporate both content and link structure. In the domain of social network analysis (SNA), LBR is a core analysis task... |

150 | Email as Spectroscopy: Automated Discovery of Community Structure within Organizations
- Tyler, Wilkinson, et al.
(Show Context)
Citation Context ...b community structure. Other recent approaches for group detection use a measure of edge betweenness, derived from Freeman’s notion of betweenness centrality [43], to identify links connecting groups =-=[109]-=-. Links with high edge betweenness are incrementally removed to partition the graph. In contrast to the above methods, where group assignments are deterministic, a number of approaches for group detec... |

140 | Graph-Based Data Mining
- Cook, Holder
- 2000
(Show Context)
Citation Context ...wly emerging research area that is at the intersection of the work in link analysis [58; 40], hypertext and web mining [16], relational learning and inductive logic programming [38], and graph mining =-=[23]-=-. We use the term link mining to put a special emphasis on the links—moving them up to first-class citizens in the data analysis endeavor. In recent years, there have been several workshop series devo... |

137 |
Link-based classification
- Lu, Getoor
- 2003
(Show Context)
Citation Context ...nd Jensen [80] propose simple LBC algorithms to classify corporate datasets with rich schemas that produce graphs with heterogeneous objects, each with its own distinct set of features. Lu and Getoor =-=[76]-=- extend simple machine learning classifiers to perform LBC by introducing new features that measure the distribution of class labels in the Markov blanket of the object to be classified. In addition t... |

127 | Iterative Classification in Relational Data
- Neville, Jensen
(Show Context)
Citation Context ... for LBC in the restricted case where the data graphs are chains. Taskar et al. [107] extend Lafferty et al.’s approach [71] to the case where the data graphs are arbitrary graphs. Neville and Jensen =-=[80]-=- propose simple LBC algorithms to classify corporate datasets with rich schemas that produce graphs with heterogeneous objects, each with its own distinct set of features. Lu and Getoor [76] extend si... |

126 | Learning to probabilistically identify authoritative documents
- Cohn, Chang
- 2000
(Show Context)
Citation Context ...e. Ding et al. [29] proposes a unified framework encompassing both PageRank and HITS and presents several new ranking algorithms within this algorithm class with closed-form solutions. Cohn and Chang =-=[20]-=- introduce a probabilistic analogue to HITS based on probabilistic latent semantic indexing, where the model attempts to explain the link structure in terms of a small set of latent factors. Cohn and ... |

119 | Reference reconciliation in complex information spaces
- DONG, HALEVY, et al.
- 2005
(Show Context)
Citation Context ...ributes of linked references are considered and different resolution decisions are still taken independently. In contrast, collective entity resolution approaches have also been proposed in databases =-=[9; 34]-=-, where one resolution decision affects another if they are linked. Bhattacharya and Getoor [9; 10] propose different measures for linkage similarity in graphs and show how these can be combined with ... |

116 | Finding frequent substructures in chemical compounds
- Dehaspe, Toivonen, et al.
- 1998
(Show Context)
Citation Context ...ng DFS on the search tree defined by this lexicographic ordering. Other approaches come from the inductive logic programming (ILP) community [79; 72]. One early success was the work of Dehaspe et al. =-=[27]-=-, who applied techniques from inductive logic programming to finding frequent patterns in a toxicology domain. Another line of work focuses on efficient subgraph generation and compression-based heuri... |

116 |
Estimation and prediction for stochastic blockstructures
- Nowicki, Snijders
- 2001
(Show Context)
Citation Context ...lockmodeling from SNA. In stochastic blockmodSIGKDD Explorations Volume 7, Issue 2 Page 5eling, the observed social network is assumed to be a realization from a pair-dependent stochastic blockmodel =-=[112; 86]-=-. Positions for the individuals in the network are treated as IID random variables, and relational links of a given type between two individuals are random variables dependent solely on the positions ... |

113 | A survey of kernels for structured data
- Gärtner
- 2003
(Show Context)
Citation Context ...the walks on the graphs [44; 60]. Gärtner [44] counts walks with equal initial and terminal labels, whereas Kashima [60] looks at the probability of random walks with equal label sequences. A Gärtner =-=[45]-=- surveys kernel methods for structured data. 10. GENERATIVE MODELS FOR GRAPHS Generative models for a range of graph and dependency types have been studied extensively in the social network analysis c... |

111 | Eliminating fuzzy duplicates in data warehouses
- Ananthakrishna, Chaudhuri, et al.
- 2002
(Show Context)
Citation Context ...al references in geo-spatial data, or co-occurrence links between name references in natural language documents. The use of links for resolution was first explored in databases. Ananthakrishna et al. =-=[6]-=- introduce a method for deduplication using links in data warehouse applications where there is a dimensional hierarchy over the link relations. More recently, Kalashnikov et al. [59] enhance feature-... |

111 |
Markov Random Fields: Theory and Application
- Chellappa, Jain, et al.
- 1993
(Show Context)
Citation Context ... number of approaches define a single probabilistic model over the entire link graph, labels, and edges. These joint models of network structure are often based on models such as Markov random fields =-=[19]-=-. In the simplest case, where there is a set of objects O, with attributes X, and edges E among the objects, the MRF models a joint distribution over the set of edges E, P (E), or a distribution condi... |

107 | Link Prediction in Relational Data
- Taskar, Wong, et al.
- 2003
(Show Context)
Citation Context ...predicting the participation of actors in events [88], such as email, telephone calls and co-authorship; and predicting semantic relationships such as “advisor-of” based on web page links and content =-=[24; 108]-=-. Most often, some links are observed, and one is attempting to predict unobserved links, or there is a temporal aspect: a snapshot of the set of links at time t is given and the goal is to predict th... |

106 | Stable algorithms for link analysis
- Ng, Zheng, et al.
- 2001
(Show Context)
Citation Context ...rhood. This approach bears a relation to PageRank with two separate random walks—one with hub transitions and one with authority transitions—on a corresponding bipartite graph of hubs and authorities =-=[73; 95; 84]-=-. The hub and authority scores are the steady-state distributions of the respective random processes. Since the introduction of PageRank and HITS, a number of algorithms have been proposed that are va... |

102 | B.: Learning probabilistic models of link structure
- Getoor, Friedman, et al.
- 2002
(Show Context)
Citation Context ...onal representations, are possible, such as Relational Markov Networks [108] and, more recently, Markov Logic Networks [33]. Models based on directed graphical models are also possible. Getoor et al. =-=[47]-=- describe several approaches for handling link uncertainty in probabilistic relational models. A discerning feature of these latter approaches is that they perform probabilistic inference to make infe... |

98 | P.: “Algorithms for estimating relative importance in networks
- White
- 2003
(Show Context)
Citation Context ... ranking of objects in a static graph produced using a specified measure. Notable variations from this theme include approaches that rank objects relative to one or more relevant objects in the graph =-=[55; 114; 105]-=- and methods that rank objects over time in dynamic graphs [89; 88]. Jeh and Widom [55] propose a metric for assessing the similarity of two objects based on the degree to which they link to similar o... |

81 |
M.: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectives to predict mutagenicity by inductive logic programming
- King, Muggleton, et al.
- 1996
(Show Context)
Citation Context ...re used for transforming the graph data into data represented as a single table, and then traditional classifiers are used for classifying the instances. As an example of an ILP approach, King et al. =-=[63]-=- first map the graph data describing mutagenesis into a relational representation. Their logical representation uses relations such as vertex(graphId,VertexId,VertexLabel, VertexAttributes) and edge(g... |

78 |
Mining the Web
- Chakrabarti
- 2003
(Show Context)
Citation Context ... such as degree and connectivity can be important indicators. Link mining is a newly emerging research area that is at the intersection of the work in link analysis [58; 40], hypertext and web mining =-=[16]-=-, relational learning and inductive logic programming [38], and graph mining [23]. We use the term link mining to put a special emphasis on the links—moving them up to first-class citizens in the data... |

74 |
Models and Methods in Social Network Analysis
- Carrington, Scott, et al.
- 2005
(Show Context)
Citation Context ...es that are more general than Markov graphs have been introduced as well, along with models for multiple object and link types and dynamic networks with a varying link structure and number of objects =-=[14; 52]-=-. In recent years, significant attention has focused on studying the structural properties of networks such as the World Wide Web, online social networks, communication networks, citation networks, an... |

74 | M.: Markov logic: A unifying framework for statistical relational learning. [19
- Domingos, Richardson
(Show Context)
Citation Context ...nditioned on the attributes of the nodes, P (E|X). Richer models, based on relational representations, are possible, such as Relational Markov Networks [108] and, more recently, Markov Logic Networks =-=[33]-=-. Models based on directed graphical models are also possible. Getoor et al. [47] describe several approaches for handling link uncertainty in probabilistic relational models. A discerning feature of ... |