Results 1 - 10
of
22
Adaptive blocking: Learning to scale up record linkage
- In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006
, 2006
"... Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dat ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an indexbased similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods. 1
Approximation algorithms for the label-covermax and red-blue set cover problems
- J. of Discrete Algorithms
"... This paper presents approximation algorithms for two extensions of the set cover problem: a graph-based extension known as the Max-Rep or Label-CoverMAXproblem, and a color-based extension known as the Red-Blue Set Cover problem. First, a randomized algorithm guaranteeing approximation ratio √ n wit ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
This paper presents approximation algorithms for two extensions of the set cover problem: a graph-based extension known as the Max-Rep or Label-CoverMAXproblem, and a color-based extension known as the Red-Blue Set Cover problem. First, a randomized algorithm guaranteeing approximation ratio √ n with high probability is proposed for the Max-Rep (or Label-CoverMAX) problem, where n is the number of vertices in the graph. This algorithm is then generalized into a 4 √ n-ratio algorithm for the nonuniform version of the problem. Secondly, it is shown that the Red-Blue Set Cover problem can be approximated with ratio 2 √ n log β, where n is the number of sets and β is the number of blue elements. Both algorithms can be adapted to the weighted variants of the respective problems, yielding the same approximation ratios. © 2006 Elsevier B.V. All rights reserved.
The Labeled perfect matching in bipartite graphs
- Information Processing Letters 96
, 2005
"... In this paper, we deal with both the complexity and the approximability of the labeled perfect matching problem in bipartite graphs. Given a simple graph G = (V,E) with |V | = 2n vertices such that E contains a perfect matching (of size n), together with a color (or label) function L: E → {c1,...,c ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
In this paper, we deal with both the complexity and the approximability of the labeled perfect matching problem in bipartite graphs. Given a simple graph G = (V,E) with |V | = 2n vertices such that E contains a perfect matching (of size n), together with a color (or label) function L: E → {c1,...,cq}, the labeled perfect matching problem consists in finding a perfect matching on G that uses a minimum or a maximum number of colors. Keywords: labeled matching; bipartite graphs; NP-complete; approximate algorithms. 1
Learnable Similarity Functions and Their Applications to Clustering and Record Linkage
, 2004
"... rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initia ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initial results are encouraging, there still remains a large number of similarity functions that are currently unable to adapt to a particular domain. In our research, we attempt to bridge this gap by developing both new learnable similarity functions and methods for their application to particular problems in machine learning and data mining. In preliminary work, we proposed two learnable similarity functions for strings that adapt distance computations given training pairs of equivalent and non-equivalent strings (Bilenko & Mooney 2003a). The first function is based on a probabilistic model of edit distance with affine gaps (Gus- Copyright c # 2004, American Association for Artificial Intelli
Approximation algorithms and hardness results for labeled connectivity problems
- In 31st MFCS
, 2006
"... Abstract. Let G = (V, E) be a connected multigraph, whose edges are associated with labels specified by an integer-valued function L: E → N. In addition, each label ℓ ∈ N to which at least one edge is mapped has a non-negative cost c(ℓ). The minimum label spanning tree problem (MinLST) asks to find ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Abstract. Let G = (V, E) be a connected multigraph, whose edges are associated with labels specified by an integer-valued function L: E → N. In addition, each label ℓ ∈ N to which at least one edge is mapped has a non-negative cost c(ℓ). The minimum label spanning tree problem (MinLST) asks to find a spanning tree in G that minimizes the overall cost of the labels used by its edges. Equivalently, we aim at finding a minimum cost subset of labels I ⊆ N such that the edge set {e ∈ E: L(e) ∈ I} forms a connected subgraph spanning all vertices. Similarly, in the minimum label s-t path problem (MinLP) the goal is to identify an s-t path minimizing the combined cost of its labels, where s and t are provided as part of the input. The main contributions of this paper are improved approximation algorithms and hardness results for MinLST and MinLP. As a secondary objective, we make a concentrated effort to relate the algorithmic methods utilized in approximating these problems to a number of well-known techniques, originally studied in the context of integer covering. 1
Turning clusters into patterns: Rectangle-based discriminative data description
- IEEE International Conference on Data Mining
, 2006
"... The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as human-comprehensible patterns from which end-users can gain intuitions and insights. Yet not all data mining methods produce such readily understandable knowledge, e.g., most clustering al ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as human-comprehensible patterns from which end-users can gain intuitions and insights. Yet not all data mining methods produce such readily understandable knowledge, e.g., most clustering algorithms output sets of points as clusters. In this paper, we perform a systematic study of cluster description that generates interpretable patterns from clusters. We introduce and analyze novel description formats leading to more expressive power, motivate and define novel description problems specifying different trade-offs between interpretability and accuracy. We also present effective heuristic algorithms together with their empirical evaluations. 1.
Algorithmic Aspects of the Consecutive-Ones Property
, 2009
"... We survey the consecutive-ones property of binary matrices. Herein, a binary matrix has the consecutive-ones property (C1P) if there is a permutation of its columns that places the 1s consecutively in every row. We provide an overview over connections to graph theory, characterizations, recognition ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We survey the consecutive-ones property of binary matrices. Herein, a binary matrix has the consecutive-ones property (C1P) if there is a permutation of its columns that places the 1s consecutively in every row. We provide an overview over connections to graph theory, characterizations, recognition algorithms, and applications such as integer linear programming and solving Set Cover.
Hyper-rectangle-based discriminative data generalization and applications in data mining
, 2007
"... The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as human-comprehensible patterns from which end-users can gain intuitions and insights. Axis-parallel hyper-rectangles provide interpretable generalizations for multi-dimensional data points ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as human-comprehensible patterns from which end-users can gain intuitions and insights. Axis-parallel hyper-rectangles provide interpretable generalizations for multi-dimensional data points with numerical attributes. In this dissertation, we study the fundamental problem of rectangle-based discriminative data generalization in the context of several useful data mining applications: cluster description, rule learning, and Nearest Rectangle classification. Clustering is one of the most important data mining tasks. However, most clustering methods output sets of points as clusters and do not generalize them into interpretable patterns. We perform a systematic study of cluster description, where we propose novel description formats leading to enhanced expressive power and introduce novel description problems specifying different trade-offs between interpretability and accuracy. We also present efficient heuristic algorithms for the introduced problems in the proposed formats. If-then rules are
Topical Query Decomposition
"... We introduce the problem of query decomposition, where we are given a query and a document retrieval system, and we want to produce a small set of queries whose union of resulting documents corresponds approximately to that of the original query. Ideally, these queries should represent coherent, con ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We introduce the problem of query decomposition, where we are given a query and a document retrieval system, and we want to produce a small set of queries whose union of resulting documents corresponds approximately to that of the original query. Ideally, these queries should represent coherent, conceptually well-separated topics. We provide an abstract formulation of the query decomposition problem, and we tackle it from two different perspectives. We first show how the problem can be instantiated as a specific variant of a set cover problem, for which we provide an efficient greedy algorithm. Next, we show how the same problem can be seen as a constrained clustering problem, with a very particular kind of constraint, i.e., clustering with predefined clusters. We develop a two-phase algorithm based on hierarchical agglomerative clustering followed by dynamic programming. Our experiments, conducted on a set of actual queries in a Web scale search engine, confirm the effectiveness of the proposed solutions.
Approximation and Hardness Results for Label Cut and Related Problems
"... We investigate a natural combinatorial optimization problem called the Label Cut problem. Given an input graph G with a source s and a sink t, the edges of G are classified into different categories, represented by a set of labels. The labels may also have weights. We want to pick a subset of labels ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We investigate a natural combinatorial optimization problem called the Label Cut problem. Given an input graph G with a source s and a sink t, the edges of G are classified into different categories, represented by a set of labels. The labels may also have weights. We want to pick a subset of labels of minimum cardinality (or minimum total weight), such that the removal of all edges with these labels disconnects s and t. We give the first non-trivial approximation and hardness results for the Label Cut problem. Firstly, we present an O ( √ m)-approximation algorithm for the Label Cut problem, where m is the number of edges in the input graph. Secondly, we show that it is NP-hard to approximate Label Cut within 2 log1−1 / log logc n n for any constant c < 1/2, where n is the input length of the problem. Thirdly, our techniques can be applied to other previously considered optimization problems. In particular we show that the Minimum Label Path problem has the same approximation hardness as that of Label Cut, simultaneously improving and unifying two known hardness results for this problem which were previously the best (but incomparable due to different complexity assumptions). 1

