Results 1 - 10
of
32
Collective entity resolution in relational data
- ACM Transactions on Knowledge Discovery from Data (TKDD
, 2006
"... Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query proces ..."
Abstract
-
Cited by 146 (12 self)
- Add to MetaCart
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.
D.: Syntactic identifier conciseness and consistency
- In: SCAM ’06, IEEE CS Press
, 2006
"... Readers of programs have two main sources of domain information: identifier names and comments. It is therefore important for the identifier names (as well as comments) to communicate clearly the concepts that they are meant to represent. Deißenböck and Pizka recently introduced rules for concise an ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
(Show Context)
Readers of programs have two main sources of domain information: identifier names and comments. It is therefore important for the identifier names (as well as comments) to communicate clearly the concepts that they are meant to represent. Deißenböck and Pizka recently introduced rules for concise and consistent variable naming. One requirement of their approach is an expert provided mapping from identifiers to concepts. An approach for the concise and consistent naming of variables that does not require any additional information (e.g., a mapping) is presented. Using a pool of 48 million lines of code, experiments with the resulting syntactic rules for concise and consistent naming illustrate that violations of the syntactic pattern exist. Two case studies show that three quarters of the violations uncovered are “real”. That is they would be identified using a concept mapping. Techniques for reducing the number of false positives are also presented. Finally, two related studies show that evolution does not introduce rule violations and that programmers tend to use a rather limited vocabulary.
Graph-based Relational Learning with Application to Security
, 2005
"... We describe an approach to learning patterns in relational data represented as a graph. The approach, implemented in the Subdue system, searches for patterns that maximally compress the input graph. Subdue can be used for supervised learning, as well as unsupervised pattern discovery and clusterin ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
We describe an approach to learning patterns in relational data represented as a graph. The approach, implemented in the Subdue system, searches for patterns that maximally compress the input graph. Subdue can be used for supervised learning, as well as unsupervised pattern discovery and clustering. We apply Subdue in domains related to homeland security and social network analysis.
Gaussian Kernel Width Generator for Support Vector
- Clustering, International Conference on Bioinformatics and its Applications
, 2005
"... Clustering data into natural groupings has important applications in fields such as Bioinformatics. Support Vector Clustering (SVC) does not require prior knowledge of a dataset and it can identify irregularly shaped cluster boundaries. A major SVC challenge is the choice of an important parameter v ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Clustering data into natural groupings has important applications in fields such as Bioinformatics. Support Vector Clustering (SVC) does not require prior knowledge of a dataset and it can identify irregularly shaped cluster boundaries. A major SVC challenge is the choice of an important parameter value, the width of a kernel function that determines a nonlinear transformation of the input data. Since evaluating the result of a clustering algorithm is a highly subjective process, a collection of different parameter values must typically be examined. However, no algorithm has been proposed to specify the parameter values. This paper presents a secant-like numerical algorithm that generates an increasing sequence of SVC kernel width values. An estimate of sequence length depends on spatial characteristics of the data but not the number of data points or the data’s dimensionality. The algorithm relies on a function that relates the kernel width value to the radius of the minimal sphere enclosing the images of data points in a high-dimensional feature space. Experimental results with 2D and higher-dimensional datasets suggest that the algorithm yields useful data clusterings.
A Fuzzy FCA-based Approach to Conceptual Clustering for Automatic Generation of Concept Hierarchy on Uncertainty Data
"... Abstract. This paper proposes a new fuzzy FCA-based approach to conceptual clustering for automatic generation of concept hierarchy on uncertainty data. The proposed approach first incorporates fuzzy logic into Formal Concept Analysis (FCA) to form a fuzzy concept lattice. Next, a fuzzy conceptual c ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract. This paper proposes a new fuzzy FCA-based approach to conceptual clustering for automatic generation of concept hierarchy on uncertainty data. The proposed approach first incorporates fuzzy logic into Formal Concept Analysis (FCA) to form a fuzzy concept lattice. Next, a fuzzy conceptual clustering technique is proposed to cluster the fuzzy concept lattice into conceptual clusters. Then, hierarchical relations are generated among conceptual clusters for constructing the concept hierarchy. In this paper, we also apply the proposed approach to generate a concept hierarchy of research areas from a citation database. The performance of the proposed approach is also discussed in the paper.
Scalable semantic analytics on social networks for addressing the problem of conflict of interest detection
- ACM Trans. Web
, 2008
"... In this paper, we demonstrate the applicability of semantic techniques for detection of Conflict of Interest (COI). We explain the common challenges involved in building scalable Semantic Web applications, in particular those addressing connecting-the-dots problems. We describe in detail the challen ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In this paper, we demonstrate the applicability of semantic techniques for detection of Conflict of Interest (COI). We explain the common challenges involved in building scalable Semantic Web applications, in particular those addressing connecting-the-dots problems. We describe in detail the challenges involved in two important aspects on building Semantic Web applications, namely, data acquisition and entity disambiguation (or reference reconciliation). We extend upon our previous work where we integrated the collaborative network of a subset of DBLP researchers with persons in a Friend-of-a-Friend social network (FOAF). Our method finds the connections between people, measures collaboration strength, and includes heuristics that use friendship/affiliation information to provide an estimate of potential COI in a peer-review scenario. Evaluations are presented by measuring what could have been the COI between accepted papers in various conference tracks and their respective program committee members. The experimental results demonstrate that scalability can be achieved by using a dataset of over 3 million entities (all bibliographic data from DBLP and a large
Identifying inhabitants of an intelligent environment using a graph-based data mining systems
- In Proceedings of the Florida Artifcial Intelligence Research Symposium
, 2003
"... The goal of the MavHome smart home project is to build an intelligent home environment that is aware of its inhabitants and their activities. Such a home is designed to provide maximum comfort to inhabitants at minimum cost. This can be done by learning the activities of the inhabitants and to autom ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The goal of the MavHome smart home project is to build an intelligent home environment that is aware of its inhabitants and their activities. Such a home is designed to provide maximum comfort to inhabitants at minimum cost. This can be done by learning the activities of the inhabitants and to automate those activities. For this it is necessary to identify among multiple inhabitants who is currently present in the home. Subdue is a graph-based data mining algorithm that discovers patterns in structural data. By representing the activity patterns for each inhabitant as graphs, Subdue can be used for inhabitant identification. We introduce a multiple-class learning version of Subdue and show some preliminary results on synthetic smart home activity data for multiple inhabitants.
Hierarchical Comments-Based Clustering
"... Information resources on the Web like videos, images, and documents are increasingly becoming more “social ” through user engagement via commenting systems. These commenting systems provide a forum for users to discuss the resources but have the side effect of providing valuable editorial and contex ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Information resources on the Web like videos, images, and documents are increasingly becoming more “social ” through user engagement via commenting systems. These commenting systems provide a forum for users to discuss the resources but have the side effect of providing valuable editorial and contextual information about the resources. In this paper, we explore a comments-driven clustering framework for organizing Web resources according to this user-based perspective. Concretely, we propose a hierarchical comment clustering approach that relies on two key features: (i) comment term normalization and key term extraction for distilling noisy comments for effective clustering; and (ii) a real-time insertion component for incrementally updating the comments-based hierarchy so that resources can be efficiently placed in the hierarchy as comments arise and without the need to re-generate the (potentially) expensive hierarchy. We study the clustering approach over the popular video sharing site YouTube. YouTube is a challenging and difficult environment, notorious for its extremely short, illformed, and often unintelligible user-contributed comments. Through extensive experimental study, we find that the proposed approach can lead to effective and efficient comments-based video organizing even in a YouTube-like environment. 1.