Results 1 - 10
of
138
Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation
- IEEE Transactions on Knowledge and Data Engineering
"... ABSTRACT This work presents a new perspective on characterizing the similarity between elements of a database or, more generally, nodes of a weighted, undirected, graph. It is based on a Markov-chain model of random walk through the database. More precisely, we compute quantities (the average commu ..."
Abstract
-
Cited by 194 (19 self)
- Add to MetaCart
(Show Context)
ABSTRACT This work presents a new perspective on characterizing the similarity between elements of a database or, more generally, nodes of a weighted, undirected, graph. It is based on a Markov-chain model of random walk through the database. More precisely, we compute quantities (the average commute time, the pseudoinverse of the Laplacian matrix of the graph, etc) that provide similarities between any pair of nodes, having the nice property of increasing when the number of paths connecting those elements increases and when the "length" of paths decreases. It turns out that the square root of the average commute time is a Euclidean distance and that the pseudoinverse of the Laplacian matrix is a kernel (it contains inner-products closely related to commute times). A procedure for computing the subspace projection of the node vectors of the graph that preserves as much variance as possible in terms of the commute-time distance -a principal components analysis (PCA) of the graph -is also introduced. This graph PCA provides a nice interpretation to the "Fiedler vector", widely used for graph partitioning. The model is evaluated on a collaborative-recommendation task where suggestions are made about which movies people should watch based upon what they watched in the past. Experimental results on the MovieLens database show that the Laplacian-based similarities perform well in comparison with other methods. The model, which nicely fits into the so-called "statistical relational learning" framework, could also be used to compute document or word similarities, and, more generally, could be applied to machine-learning and pattern-recognition tasks involving a database. * François Fouss, Alain Pirotte and Marco Saerens are with the
Probabilistic Author-Topic Models for Information Discovery
- THE TENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2004
"... We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probabilit ..."
Abstract
-
Cited by 173 (11 self)
- Add to MetaCart
We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.
Closure-Tree: An Index Structure for Graph Queries
, 2006
"... Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing techniq ..."
Abstract
-
Cited by 92 (1 self)
- Add to MetaCart
(Show Context)
Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing technique, called Closure-tree, organizes graphs hierarchically where each node summarizes its descendants by a graph closure. Closure-tree can efficiently support both subgraph queries and similarity queries. Subgraph queries find graphs that contain a specific subgraph, whereas similarity queries find graphs that are similar to a query graph. For subgraph queries, we propose a technique called pseudo subgraph isomorphism which approximates subgraph isomorphism with high accuracy. For similarity queries, we measure graph similarity through edit distance using heuristic graph mapping methods. We implement two kinds of similarity queries: K-NN query and range query. Our experiments on chemical compounds and synthetic graphs show that for subgraph queries, Closure-tree outperforms existing techniques by up to two orders of magnitude in terms of candidate answer set size and index size. For similarity queries, our experiments validate the quality and efficiency of the presented algorithms.
Analysis and Visualization of Network Data using JUNG
"... The JUNG (Java Universal Network/Graph) Framework is a free, open-source software library that provides a common and extendible language for the manipulation, analysis, and visualization of data that can be represented as a graph or network. It is written in the Java programming language, allowing J ..."
Abstract
-
Cited by 86 (0 self)
- Add to MetaCart
The JUNG (Java Universal Network/Graph) Framework is a free, open-source software library that provides a common and extendible language for the manipulation, analysis, and visualization of data that can be represented as a graph or network. It is written in the Java programming language, allowing JUNG-based applications to make use of the extensive built-in capabilities of the Java Application Programming Interface (API), as well as those of other existing third-party Java libraries. We describe the design, and some details of the implementation, of the JUNG architecture, and provide illustrative examples of its use.
The slashdot zoo: Mining a social network with negative edges
- In WWW
, 2009
"... christian.bauckhage ..."
(Show Context)
Link Mining: A Survey
- SigKDD Explorations Special Issue on Link Mining
, 2005
"... Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly oth ..."
Abstract
-
Cited by 84 (0 self)
- Add to MetaCart
(Show Context)
Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other semantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and contacts, or in bibliographic domains describing publications, authors, and venues. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a cross-fertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this article, we review some of the common emerging themes. 1.
Exploiting relationships for domain-independent data cleaning
, 2005
"... In this paper we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each ..."
Abstract
-
Cited by 81 (24 self)
- Add to MetaCart
(Show Context)
In this paper we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real datasets and also over synthetic datasets show that analysis of relationships significantly improves quality of the result.
Domain-independent data cleaning via analysis of entity-relationship graph
- ACM TRANSACTIONS ON DATABASE SYSTEMS (TODS
, 2006
"... In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which e ..."
Abstract
-
Cited by 64 (23 self)
- Add to MetaCart
In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RELDC) and the traditional techniques is that RELDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.
ToppGene Suite for gene list enrichment analysis and candidate gene prioritization
- Nucleic Acids Res
, 2009
"... doi:10.1093/nar/gkp427 ..."
(Show Context)
Co-ranking Authors and Documents in a Heterogeneous Network
- In ICDM
, 2007
"... Abstract ..."
(Show Context)