Results 1 - 10
of
14
RankClus: Integrating clustering with ranking for heterogeneous information network analysis
- In EDBT’09
"... As information networks become ubiquitous, extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering wh ..."
Abstract
-
Cited by 17 (13 self)
- Add to MetaCart
As information networks become ubiquitous, extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) in one huge cluster without distinction is dull as well. In this paper, we address the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multityped (i.e., heterogeneous) information network. A novel
A Probabilistic Framework for Relational Clustering
- KDD'07
"... Relational clustering has attracted more and more attention due to its phenomenal impact in various important applications which involve multi-type interrelated data objects, such as Web mining, search marketing, bioinformatics, citation analysis, and epidemiology. In this paper, we propose a probab ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Relational clustering has attracted more and more attention due to its phenomenal impact in various important applications which involve multi-type interrelated data objects, such as Web mining, search marketing, bioinformatics, citation analysis, and epidemiology. In this paper, we propose a probabilistic model for relational clustering, which also provides a principal framework to unify various important clustering tasks including traditional attributes-based clustering, semi-supervised clustering, co-clustering and graph clustering. The proposed model seeks to identify cluster structures for each type of data objects and interaction patterns between different types of objects. Under this model, we propose parametric hard and soft relational clustering algorithms under a large number of exponential family distributions. The algorithms are applicable to relational data of various structures and at the same time unifies a number of stat-of-the-art clustering algorithms: co-clustering algorithms, the k-partite graph clustering, and semi-supervised clustering based on hidden Markov random fields.
Exploring the power of heuristics and links in multi-relational data mining
- In Foundations of Intelligent Systems (ISMIS
, 2008
"... Abstract. Relational databases are the most popular repository for structured data, and are thus one of the richest sources of knowledge in the world. Because of the complexity of relational data, it is a challenging task to design efficient and scalable data mining approaches in relational database ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. Relational databases are the most popular repository for structured data, and are thus one of the richest sources of knowledge in the world. Because of the complexity of relational data, it is a challenging task to design efficient and scalable data mining approaches in relational databases. In this paper we discuss two methodologies to address this issue. The first methodology is to use heuristics to guide the data mining procedure, in order to avoid aimless, exhaustive search in relational databases. The second methodology is to assign certain property to each object in the database, and let different objects interact with each other along the links. Experiments show that both approaches achieve high efficiency and accuracy in real applications. 1
Fast single-pair simrank computation
- In Proc. of the SIAM Intl. Conf. on Data Mining (SDM2010
, 2010
"... SimRank is an intuitive and effective measure for link-based similarity that scores similarity between two nodes as the first-meeting probability of two random surfers, based on the random surfer model. However, when a user queries the similarity of a given node-pair based on SimRank, the existing a ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
SimRank is an intuitive and effective measure for link-based similarity that scores similarity between two nodes as the first-meeting probability of two random surfers, based on the random surfer model. However, when a user queries the similarity of a given node-pair based on SimRank, the existing approaches need to compute the similarities of other node-pairs beforehand, which we call an all-pair style. In this paper, we propose a Single-Pair SimRank approach. Without accuracy loss, this approach performs an iterative computation to obtain the similarity of a single node-pair. The time cost of our Single-Pair SimRank is always less than All-Pair SimRank and obviously efficient when we only need to assess similarity of one or a few node-pairs. We confirm the accuracy and efficiency of our approach in extensive experimental studies over synthetic and real datasets. 1
Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach
, 2008
"... Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techn ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore becomes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted index approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index approaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when standard blocking is used, and thus more work is required.
Research Challenges for Data Mining in Science and Engineering ∗
"... With the rapid development of computer and information technology in the last several decades, an enormous amount of data in science and engineering has been and will continuously be generated in massive scale, either being stored in gigantic storage devices or flowing into and out of the system in ..."
Abstract
- Add to MetaCart
With the rapid development of computer and information technology in the last several decades, an enormous amount of data in science and engineering has been and will continuously be generated in massive scale, either being stored in gigantic storage devices or flowing into and out of the system in the form of data streams. Moreover, such data has been made widely available, e.g., via the Internet. Such tremendous amount of data, in the order of tera- to petabytes, has fundamentally changed science and engineering, transforming many disciplines from data-poor to increasingly data-rich, and calling for new, data-intensive methods to conduct research in science and engineering. In this paper, we discuss the research challenges in science and engineering, from the data mining perspective, with a focus on the following issues: (1) information network analysis, (2) discovery, usage, and understanding of patterns and knowledge, (3) stream data mining, (4) mining moving object data, RFID data, and data from sensor networks, (5) spatiotemporal and multimedia data mining, (6) mining text, Web, and other unstructured data, (7) data cube-oriented multidimensional online analytical mining, (8) visual data mining, and (9) data mining by integration of sophisticated scientific and engineering domain knowledge.
Mining Research Communities in Bibliographical Data ⋆
"... Abstract. Extracting information from very large collections of structured, semistructured or even unstructured data can be a considerable challenge when much of the hidden information is implicit within relationships among entities in the data. Social networks are such data collections in which rel ..."
Abstract
- Add to MetaCart
Abstract. Extracting information from very large collections of structured, semistructured or even unstructured data can be a considerable challenge when much of the hidden information is implicit within relationships among entities in the data. Social networks are such data collections in which relationships play a vital role in the knowledge these networks can convey. A bibliographic database is an essential tool for the research community, yet finding and making use of relationships comprised within such a social network is difficult. In this paper we introduce DBconnect, a prototype that exploits the social network coded within the DBLP database by drawing on a new random walk approach to reveal interesting knowledge about the research community and even recommend collaborations. 1
BibNetMiner: Mining Bibliographic Information Networks ∗
"... Online bibliographic databases, such as DBLP in computer science and PubMed in medical sciences, contain abundant information about research publications in different fields. Each such database forms a gigantic information network (hence called BibNet), connecting in complex ways research papers, au ..."
Abstract
- Add to MetaCart
Online bibliographic databases, such as DBLP in computer science and PubMed in medical sciences, contain abundant information about research publications in different fields. Each such database forms a gigantic information network (hence called BibNet), connecting in complex ways research papers, authors, conferences/journals, and possibly citation information as well, and provides a fertile land for information network analysis. Our BibNetMiner is designed for sophisticated information network mining on such bibliographic databases. In this demo, we will take the DBLP database as an example, demonstrate several attractive functions of BibNetMiner, including clustering, ranking and profiling of conferences and authors based on the research subfields. A user-friendly, visualization-enhanced interface will be provided to facilitate interactive exploration of a bibliographic database. This project will serve as an example to demonstrate the power of links in information network mining. Since the dataset is large and the network is heterogeneous, such a study will benefit the research on the analysis of massive heterogeneous information networks.

