Results 1 - 10
of
25
A sample-and-clean framework for fast and accurate query processing on dirty data
- in To Appear: ACM Special Interest Group on Management of Data (SIGMOD
, 2014
"... In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query pro ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
(Show Context)
In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, ex-acerbates answer quality problems by introducing sampling error. In this paper, we explore an intriguing opportunity. That is, we explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean frame-work, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers. We derive confidence intervals as a function of sample size and show how our approach addresses error bias. We evalu-ate the Sample-and-Clean framework using data from three sources: the TPC-H benchmark with synthetic noise, a sub-set of the Microsoft academic citation index and a sensor data set. Our results are consistent with the theoretical confidence intervals and suggest that the Sample-and-Clean framework can produce significant improvements in accu-racy compared to query processing without data cleaning and speed compared to data cleaning without sampling. 1.
A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing
"... Spatial Crowdsourcing (SC) is a transformative platform that engages individuals, groups and communities in the act of collecting, analyzing, and disseminating environmental, social and other spatio-temporal information. The objective of SC is to outsource a set of spatio-temporal tasks to a set of ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
(Show Context)
Spatial Crowdsourcing (SC) is a transformative platform that engages individuals, groups and communities in the act of collecting, analyzing, and disseminating environmental, social and other spatio-temporal information. The objective of SC is to outsource a set of spatio-temporal tasks to a set of workers, i.e., individuals with mobile devices that perform the tasks by physically traveling to specified locations of in-terest. However, current solutions require the workers, who in many cases are simply volunteering for a cause, to dis-close their locations to untrustworthy entities. In this paper, we introduce a framework for protecting location privacy of workers participating in SC tasks. We argue that existing location privacy techniques are not sufficient for SC, and we propose a mechanism based on differential privacy and geocasting that achieves effective SC services while offering privacy guarantees to workers. We investigate analytical models and task assignment strategies that balance multiple crucial aspects of SC functionality, such as task completion rate, worker travel distance and system overhead. Exten-sive experimental results on real-world datasets show that the proposed technique protects workers ’ location privacy without incurring significant performance metrics penalties. 1.
Corleone: Hands-off crowdsourcing for entity matching
, 2014
"... Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot h ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot handle scenarios where ordinary users (i.e., the masses) want to leverage crowdsourcing to match entities. In response, we propose the notion of hands-off crowdsourcing (HOC), which crowdsources the entire workflow of a task, thus requiring no developers. We show how HOC can repre-sent a next logical direction for crowdsourcing research, scale up EM at enterprises and crowdsourcing startups, and open up crowdsourcing for the masses. We describe Corleone, a HOC solution for EM, which uses the crowd in all major steps of the EM process. Finally, we discuss the implica-tions of our work to executing crowdsourced RDBMS joins, cleaning learning models, and soliciting complex information types from crowd workers.
Crowdsourcing Algorithms for Entity Resolution
"... In this paper, we study a hybrid human-machine approach for solving the problem of Entity Resolution (ER). The goal of ER is to identify all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Our input is a graph over all the records in a data ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we study a hybrid human-machine approach for solving the problem of Entity Resolution (ER). The goal of ER is to identify all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Our input is a graph over all the records in a database, where each edge has a probability denoting our prior belief (based on Machine Learning models) that the pair of records represented by the given edge are duplicates. Our objective is to resolve all the duplicates by asking hu-mans to verify the equality of a subset of edges, leveraging the transitivity of the equality relation to infer the remaining edges (e.g. a = c can be inferred given a = b and b = c). We consider the problem of designing optimal strategies for ask-ing questions to humans that minimize the expected number of questions asked. Using our theoretical framework, we an-alyze several strategies, and show that a strategy, claimed as “optimal ” for this problem in a recent work, can perform arbitrarily bad in theory. We propose alternate strategies with theoretical guarantees. Using both public datasets as well as the production system at Facebook, we show that our techniques are e↵ective in practice. 1.
OASSIS: query driven crowd mining
- In SIGMOD
, 2014
"... Crowd data sourcing is increasingly used to gather infor-mation from the crowd and to obtain recommendations. In this paper, we explore a novel approach that broadens crowd data sourcing by enabling users to pose general questions, to mine the crowd for potentially relevant data, and to re-ceive con ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Crowd data sourcing is increasingly used to gather infor-mation from the crowd and to obtain recommendations. In this paper, we explore a novel approach that broadens crowd data sourcing by enabling users to pose general questions, to mine the crowd for potentially relevant data, and to re-ceive concise, relevant answers that represent frequent, sig-nificant data patterns. Our approach is based on (1) a sim-ple generic model that captures both ontological knowledge as well as the individual history or habits of crowd mem-bers from which frequent patterns are mined; (2) a query language in which users can declaratively specify their in-formation needs and the data patterns of interest; (3) an efficient query evaluation algorithm, which enables mining semantically concise answers while minimizing the number of questions posed to the crowd; and (4) an implementa-tion of these ideas that mines the crowd through an interac-tive user interface. Experimental results with both real-life crowd and synthetic data demonstrate the feasibility and effectiveness of the approach.
User-Driven Refinement of Imprecise Queries
"... Abstract—We propose techniques for exploratory search in large databases. The goal is to provide new functionality that aids users in homing in on the right query conditions to find what they are looking for. Query refinement proceeds interactively by repeatedly consulting the user to manage query c ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—We propose techniques for exploratory search in large databases. The goal is to provide new functionality that aids users in homing in on the right query conditions to find what they are looking for. Query refinement proceeds interactively by repeatedly consulting the user to manage query conditions. This process is characterized by three key challenges: (1) dealing with incomplete and imprecise user input, (2) keeping user effort low, and (3) guaranteeing interactive system response time. We address the first two challenges with a probability-based framework that guides the user to the most important query conditions. To recover from input errors, we introduce the notion of sensitivity and propose efficient algorithms for identifying the most sensitive user input, i.e., those inputs that had the greatest influence on the query results. For the third challenge, we develop techniques that can deliver estimates of the required probabilities within a given hard realtime limit and are able to adapt automatically as the interactive query refinement proceeds. I.
On Optimality of Jury Selection in Crowdsourcing
"... Recent advances in crowdsourcing technologies enable computa-tionally challenging tasks (e.g., sentiment analysis and entity reso-lution) to be performed by Internet workers, driven mainly by mon-etary incentives. A fundamental question is: how should work-ers be selected, so that the tasks in hand ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Recent advances in crowdsourcing technologies enable computa-tionally challenging tasks (e.g., sentiment analysis and entity reso-lution) to be performed by Internet workers, driven mainly by mon-etary incentives. A fundamental question is: how should work-ers be selected, so that the tasks in hand can be accomplished successfully and economically? In this paper, we study the Jury Selection Problem (JSP): Given a monetary budget, and a set of decision-making tasks (e.g., “Is Bill Gates still the CEO of Mi-crosoft now?”), return the set of workers (called jury), such that their answers yield the highest “Jury Quality ” (or JQ). Existing JSP solutions make use of the Majority Voting (MV) strategy, which uses the answer chosen by the largest number of workers. We show that MV does not yield the best solution for JSP. We further prove that among all voting strategies (including deterministic and ran-domized strategies), Bayesian Voting (BV) can optimally solve JSP. We then examine how to solve JSP based on BV. This is technically challenging, since computing the JQ with BV is NP-hard. We solve this problem by proposing an approximate algorithm that is com-putationally efficient. Our approximate JQ computation algorithm is also highly accurate, and its error is proved to be bounded within 1%. We extend our solution by considering the task owner’s “be-lief ” (or prior) on the answers of the tasks. Experiments on syn-thetic and real datasets show that our new approach is consistently better than the best JSP solution known. 1.
Optimization in knowledge-intensive crowdsourcing
- CoRR
"... We present SmartCrowd, a framework for optimizing col-laborative knowledge-intensive crowdsourcing. SmartCrowd distinguishes itself by accounting for human factors in the process of assigning tasks to workers. Human factors des-ignate workers ’ expertise in different skills, their expected minimum w ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
We present SmartCrowd, a framework for optimizing col-laborative knowledge-intensive crowdsourcing. SmartCrowd distinguishes itself by accounting for human factors in the process of assigning tasks to workers. Human factors des-ignate workers ’ expertise in different skills, their expected minimum wage, and their availability. In SmartCrowd, we formulate task assignment as an optimization problem, and rely on pre-indexing workers and maintaining the indexes adaptively, in such a way that the task assignment process gets optimized both qualitatively, and computation time-wise. We present rigorous theoretical analyses of the opti-mization problem and propose optimal and approximation algorithms. We finally perform extensive performance and quality experiments using real and synthetic data to demon-strate that adaptive indexing in SmartCrowd is necessary to achieve efficient high quality task assignment. 1.
The Expected Optimal Labeling Order Problem for Crowdsourced Joins and Entity Resolution
"... extending our earlier work on crowdsourced entity resolu-tion to improve crowdsourced join processing by exploiting transitive relationships. The VLDB 2014 conference has a paper [1] that follows up on our previous work, which points out and corrects a mistake we made in our SIGMOD paper. Specifical ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
extending our earlier work on crowdsourced entity resolu-tion to improve crowdsourced join processing by exploiting transitive relationships. The VLDB 2014 conference has a paper [1] that follows up on our previous work, which points out and corrects a mistake we made in our SIGMOD paper. Specifically, in Section 4.2 of our SIGMOD paper, we defined the “Expected Optimal Labeling Order ” (EOLO) problem, and proposed an algorithm for solving it. We incorrectly claimed that our algorithm is optimal. In their paper, Ves-dapunt et al. show that the problem is actually NP-Hard, and based on that observation, propose a new algorithm to solve it. In this note, we would like to put the Vesdapunt et al. results in context, something we believe that their paper does not adequately do.