• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Approximation algorithms for the label-covermax and red-blue set cover problems (0)

by David Peleg
Venue:J. of Discrete Algorithms
Add To MetaCart

Tools

Sorted by:
Results 1 - 9 of 9

Adaptive blocking: Learning to scale up record linkage

by Mikhail Bilenko - In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006 , 2006
"... Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dat ..."
Abstract - Cited by 17 (1 self) - Add to MetaCart
Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an indexbased similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods. 1

Learnable Similarity Functions and Their Applications to Clustering and Record Linkage

by Mikhail Bilenko , 2004
"... rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initia ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initial results are encouraging, there still remains a large number of similarity functions that are currently unable to adapt to a particular domain. In our research, we attempt to bridge this gap by developing both new learnable similarity functions and methods for their application to particular problems in machine learning and data mining. In preliminary work, we proposed two learnable similarity functions for strings that adapt distance computations given training pairs of equivalent and non-equivalent strings (Bilenko & Mooney 2003a). The first function is based on a probabilistic model of edit distance with affine gaps (Gus- Copyright c # 2004, American Association for Artificial Intelli

Improved approximation algorithms for label cover problems

by Moses Charikar, Mohammadtaghi Hajiaghayi, Howard Karloff - In ESA , 2009
"... Abstract In this paper we consider both the maximization variant Max Rep and the minimization variant Min Rep of the famous Label Cover problem, for which, till now, the best approximation ratios known were O ( √ n). In fact, several recent papers reduced Label Cover to other problems, arguing that ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
Abstract In this paper we consider both the maximization variant Max Rep and the minimization variant Min Rep of the famous Label Cover problem, for which, till now, the best approximation ratios known were O ( √ n). In fact, several recent papers reduced Label Cover to other problems, arguing that if better approximation algorithms for their problems existed, then a o ( √ n)-approximation algorithm for Label Cover would exist. We show, in fact, that there are a O(n 1/3)-approximation algorithm for Max Rep and a O(n 1/3 log 2/3 n)-approximation algorithm for Min Rep. In addition, we also exhibit a randomized reduction from Densest k-Subgraph to Max Rep, showing that any approximation factor for Max Rep implies the same factor (up to a constant) for Densest k-Subgraph. 1

Improved approximating algorithms for directed steiner forest

by Moran Feldman, Guy Kortsarz, Zeev Nutov , 2008
"... ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
Abstract not found

Topical Query Decomposition

by Francesco Bonchi, Debora Donato, Carlos Castillo, Aristides Gionis
"... We introduce the problem of query decomposition, where we are given a query and a document retrieval system, and we want to produce a small set of queries whose union of resulting documents corresponds approximately to that of the original query. Ideally, these queries should represent coherent, con ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
We introduce the problem of query decomposition, where we are given a query and a document retrieval system, and we want to produce a small set of queries whose union of resulting documents corresponds approximately to that of the original query. Ideally, these queries should represent coherent, conceptually well-separated topics. We provide an abstract formulation of the query decomposition problem, and we tackle it from two different perspectives. We first show how the problem can be instantiated as a specific variant of a set cover problem, for which we provide an efficient greedy algorithm. Next, we show how the same problem can be seen as a constrained clustering problem, with a very particular kind of constraint, i.e., clustering with predefined clusters. We develop a two-phase algorithm based on hierarchical agglomerative clustering followed by dynamic programming. Our experiments, conducted on a set of actual queries in a Web scale search engine, confirm the effectiveness of the proposed solutions.

Improved Guarantees for Agnostic Learning of Disjunctions

by Pranjal Awasthi, Avrim Blum, Or Sheffet
"... Given some arbitrary distribution D over {0, 1} n and arbitrary target function c ∗ , the problem of agnostic learning of disjunctions is to achieve an error rate comparable to the error OPTdisj of the best disjunction with respect to (D, c ∗). Achieving error O(n · OPTdisj) + ǫ is trivial, and Winn ..."
Abstract - Add to MetaCart
Given some arbitrary distribution D over {0, 1} n and arbitrary target function c ∗ , the problem of agnostic learning of disjunctions is to achieve an error rate comparable to the error OPTdisj of the best disjunction with respect to (D, c ∗). Achieving error O(n · OPTdisj) + ǫ is trivial, and Winnow [13] achieves error O(r · OPTdisj) + ǫ, where r is the number of relevant variables in the best disjunction. In recent work, Peleg [14] shows how to achieve a bound of Õ ( √ n·OPTdisj)+ǫ in polynomial time. In this paper we improve on Peleg’s bound, giving a polynomial-time algorithm achieving a bound of O(n 1/3+α · OPTdisj) + ǫ for any constant α> 0. The heart of the algorithm is a method for weak-learning when OPTdisj = O(1/n 1/3+α), which can then be fed into existing agnostic boosting procedures to achieve the desired guarantee. 1

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Distance Metric Learning under Covariate Shift

by Bin Cao, Xiaochuan Ni, Jian-tao Sun, Gang Wang, Qiang Yang
"... Learning distance metrics is a fundamental problem in machine learning. Previous distance-metric learning research assumes that the training and test data are drawn from the same distribution, which may be violated in practical applications. When the distributions differ, a situation referred to as ..."
Abstract - Add to MetaCart
Learning distance metrics is a fundamental problem in machine learning. Previous distance-metric learning research assumes that the training and test data are drawn from the same distribution, which may be violated in practical applications. When the distributions differ, a situation referred to as covariate shift, the metric learned from training data may not work well on the test data. In this case the metric is said to be inconsistent. In this paper, we address this problem by proposing a novel metric learning framework known as consistent distance metric learning (CDML), which solves the problem under covariate shift situations. We theoretically analyze the conditions when the metrics learned under covariate shift are consistent. Based on the analysis, a convex optimization problem is proposed to deal with the CDML problem. An importance sampling method is proposed for metric learning and two importance weighting strategies are proposed and compared in this work. Experiments are carried out on synthetic and real world datasets to show the effectiveness of the proposed method. 1

Thesis Proposal: Approximation Algorithms and New Models for Clustering and Learning

by Pranjal Awasthi
"... This thesis concerns two fundamental problems in clustering and learning: (a) the k-median and the k-means clustering problems, and (b) the problem of learning under adversarial noise, also known as agnostic learning. For k-median and k-means clustering we design efficient algorithms which provide a ..."
Abstract - Add to MetaCart
This thesis concerns two fundamental problems in clustering and learning: (a) the k-median and the k-means clustering problems, and (b) the problem of learning under adversarial noise, also known as agnostic learning. For k-median and k-means clustering we design efficient algorithms which provide arbitrarily good approximation guarantees on a wide class of datasets. These are datasets which satisfy a natural notion of stability called weak-deletion stability. In addition to giving good approximation algorithms, the notion of stability studied in this thesis seems quite promising in approaching the task of transfer clustering. We also make progress on the the problem of agnostically learning the class of Boolean disjunctions and improve on the best known approximation guarantee. In addition we study two new interactive models for clustering and learning which are well

Topicalquerydecomposition

by Francesco Bonchi, Debora Donato, Carlos Castillo, Aristides Gionis
"... We introduce the problem of query decomposition, where we are given a query and a document retrieval system, and we want to produce a small set of queries whose union of resulting documents corresponds approximately to that of the original query. Ideally, these queries should represent coherent, con ..."
Abstract - Add to MetaCart
We introduce the problem of query decomposition, where we are given a query and a document retrieval system, and we want to produce a small set of queries whose union of resulting documents corresponds approximately to that of the original query. Ideally, these queries should represent coherent, conceptually well-separated topics. We provide an abstract formulation of the query decomposition problem, and we tackle it from two different perspectives. We first show how the problem can be instantiated as a specific variant of a set cover problem, for which we provide an efficient greedy algorithm. Next, we show how the same problem can be seen as a constrained clustering problem, with a very particular kind of constraint, i.e., clustering with predefined clusters. We develop a two-phase algorithm based on hierarchical agglomerative clustering followed by dynamic programming. Our experiments, conducted on a set of actual queries in a Web scale search engine, confirm the effectiveness of the proposed solutions. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University