Results 1  10
of
30
Adaptive blocking: Learning to scale up record linkage
 In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM2006
, 2006
"... Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dat ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an indexbased similarity function or selecting a set of predicates, followed by handtuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicatebased formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than nonadaptive blocking methods. 1
Approximation algorithms for the labelcovermax and redblue set cover problems
 J. of Discrete Algorithms
"... This paper presents approximation algorithms for two extensions of the set cover problem: a graphbased extension known as the MaxRep or LabelCoverMAXproblem, and a colorbased extension known as the RedBlue Set Cover problem. First, a randomized algorithm guaranteeing approximation ratio √ n wit ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
This paper presents approximation algorithms for two extensions of the set cover problem: a graphbased extension known as the MaxRep or LabelCoverMAXproblem, and a colorbased extension known as the RedBlue Set Cover problem. First, a randomized algorithm guaranteeing approximation ratio √ n with high probability is proposed for the MaxRep (or LabelCoverMAX) problem, where n is the number of vertices in the graph. This algorithm is then generalized into a 4 √ nratio algorithm for the nonuniform version of the problem. Secondly, it is shown that the RedBlue Set Cover problem can be approximated with ratio 2 √ n log β, where n is the number of sets and β is the number of blue elements. Both algorithms can be adapted to the weighted variants of the respective problems, yielding the same approximation ratios. © 2006 Elsevier B.V. All rights reserved.
Approximation algorithms and hardness results for labeled connectivity problems
 In 31st MFCS
, 2006
"... Abstract. Let G = (V, E) be a connected multigraph, whose edges are associated with labels specified by an integervalued function L: E → N. In addition, each label ℓ ∈ N to which at least one edge is mapped has a nonnegative cost c(ℓ). The minimum label spanning tree problem (MinLST) asks to find ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
Abstract. Let G = (V, E) be a connected multigraph, whose edges are associated with labels specified by an integervalued function L: E → N. In addition, each label ℓ ∈ N to which at least one edge is mapped has a nonnegative cost c(ℓ). The minimum label spanning tree problem (MinLST) asks to find a spanning tree in G that minimizes the overall cost of the labels used by its edges. Equivalently, we aim at finding a minimum cost subset of labels I ⊆ N such that the edge set {e ∈ E: L(e) ∈ I} forms a connected subgraph spanning all vertices. Similarly, in the minimum label st path problem (MinLP) the goal is to identify an st path minimizing the combined cost of its labels, where s and t are provided as part of the input. The main contributions of this paper are improved approximation algorithms and hardness results for MinLST and MinLP. As a secondary objective, we make a concentrated effort to relate the algorithmic methods utilized in approximating these problems to a number of wellknown techniques, originally studied in the context of integer covering. 1
The Labeled perfect matching in bipartite graphs
 Information Processing Letters 96
, 2005
"... In this paper, we deal with both the complexity and the approximability of the labeled perfect matching problem in bipartite graphs. Given a simple graph G = (V,E) with V  = 2n vertices such that E contains a perfect matching (of size n), together with a color (or label) function L: E → {c1,...,c ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
In this paper, we deal with both the complexity and the approximability of the labeled perfect matching problem in bipartite graphs. Given a simple graph G = (V,E) with V  = 2n vertices such that E contains a perfect matching (of size n), together with a color (or label) function L: E → {c1,...,cq}, the labeled perfect matching problem consists in finding a perfect matching on G that uses a minimum or a maximum number of colors. Keywords: labeled matching; bipartite graphs; NPcomplete; approximate algorithms. 1
Algorithmic Aspects of the ConsecutiveOnes Property
, 2009
"... We survey the consecutiveones property of binary matrices. Herein, a binary matrix has the consecutiveones property (C1P) if there is a permutation of its columns that places the 1s consecutively in every row. We provide an overview over connections to graph theory, characterizations, recognition ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
We survey the consecutiveones property of binary matrices. Herein, a binary matrix has the consecutiveones property (C1P) if there is a permutation of its columns that places the 1s consecutively in every row. We provide an overview over connections to graph theory, characterizations, recognition algorithms, and applications such as integer linear programming and solving Set Cover.
Learnable Similarity Functions and Their Applications to Clustering and Record Linkage
, 2004
"... rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initia ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initial results are encouraging, there still remains a large number of similarity functions that are currently unable to adapt to a particular domain. In our research, we attempt to bridge this gap by developing both new learnable similarity functions and methods for their application to particular problems in machine learning and data mining. In preliminary work, we proposed two learnable similarity functions for strings that adapt distance computations given training pairs of equivalent and nonequivalent strings (Bilenko & Mooney 2003a). The first function is based on a probabilistic model of edit distance with affine gaps (Gus Copyright c # 2004, American Association for Artificial Intelli
Turning clusters into patterns: Rectanglebased discriminative data description
 IEEE International Conference on Data Mining
, 2006
"... The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as humancomprehensible patterns from which endusers can gain intuitions and insights. Yet not all data mining methods produce such readily understandable knowledge, e.g., most clustering al ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as humancomprehensible patterns from which endusers can gain intuitions and insights. Yet not all data mining methods produce such readily understandable knowledge, e.g., most clustering algorithms output sets of points as clusters. In this paper, we perform a systematic study of cluster description that generates interpretable patterns from clusters. We introduce and analyze novel description formats leading to more expressive power, motivate and define novel description problems specifying different tradeoffs between interpretability and accuracy. We also present effective heuristic algorithms together with their empirical evaluations. 1.
Hyperrectanglebased discriminative data generalization and applications in data mining
, 2007
"... The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as humancomprehensible patterns from which endusers can gain intuitions and insights. Axisparallel hyperrectangles provide interpretable generalizations for multidimensional data points ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as humancomprehensible patterns from which endusers can gain intuitions and insights. Axisparallel hyperrectangles provide interpretable generalizations for multidimensional data points with numerical attributes. In this dissertation, we study the fundamental problem of rectanglebased discriminative data generalization in the context of several useful data mining applications: cluster description, rule learning, and Nearest Rectangle classification. Clustering is one of the most important data mining tasks. However, most clustering methods output sets of points as clusters and do not generalize them into interpretable patterns. We perform a systematic study of cluster description, where we propose novel description formats leading to enhanced expressive power and introduce novel description problems specifying different tradeoffs between interpretability and accuracy. We also present efficient heuristic algorithms for the introduced problems in the proposed formats. Ifthen rules are
Approximation and Hardness Results for Label Cut and Related Problems
"... We investigate a natural combinatorial optimization problem called the Label Cut problem. Given an input graph G with a source s and a sink t, the edges of G are classified into different categories, represented by a set of labels. The labels may also have weights. We want to pick a subset of labels ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We investigate a natural combinatorial optimization problem called the Label Cut problem. Given an input graph G with a source s and a sink t, the edges of G are classified into different categories, represented by a set of labels. The labels may also have weights. We want to pick a subset of labels of minimum cardinality (or minimum total weight), such that the removal of all edges with these labels disconnects s and t. We give the first nontrivial approximation and hardness results for the Label Cut problem. Firstly, we present an O ( √ m)approximation algorithm for the Label Cut problem, where m is the number of edges in the input graph. Secondly, we show that it is NPhard to approximate Label Cut within 2 log1−1 / log logc n n for any constant c < 1/2, where n is the input length of the problem. Thirdly, our techniques can be applied to other previously considered optimization problems. In particular we show that the Minimum Label Path problem has the same approximation hardness as that of Label Cut, simultaneously improving and unifying two known hardness results for this problem which were previously the best (but incomparable due to different complexity assumptions). 1
On the positive–negative partial set cover problem
 Inf. Process. Lett
"... The PositiveNegative Partial Set Cover problem is introduced and its complexity, especially the hardnessofapproximation, is studied. The problem generalizes the Set Cover problem, and it naturally arises in certain data mining applications. ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
The PositiveNegative Partial Set Cover problem is introduced and its complexity, especially the hardnessofapproximation, is studied. The problem generalizes the Set Cover problem, and it naturally arises in certain data mining applications.