Results 1 - 10
of
46
Programmatic gold: Targeted and scalable quality assurance in crowdsourcing
- Proc. HComp
, 2011
"... Crowdsourcing is an effective tool for scalable data annota-tion in both research and enterprise contexts. Due to crowd-sourcing’s open participation model, quality assurance is crit-ical to the success of any project. Present methods rely on EM-style post-processing or manual annotation of large go ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
(Show Context)
Crowdsourcing is an effective tool for scalable data annota-tion in both research and enterprise contexts. Due to crowd-sourcing’s open participation model, quality assurance is crit-ical to the success of any project. Present methods rely on EM-style post-processing or manual annotation of large gold standard sets. In this paper we present an automated quality assurance process that is inexpensive and scalable. Our novel process relies on programmatic gold creation to provide tar-geted training feedback to workers and to prevent common scamming scenarios. We find that it decreases the amount of manual work required to manage crowdsourced labor while improving the overall quality of the results. 1
Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking
"... The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out of reach of traditional methods that rely upon editorial relevance judgments. Increasingly, the use of crowdsourcing to collect relevance labels has been regarded as a viable altern ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
(Show Context)
The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out of reach of traditional methods that rely upon editorial relevance judgments. Increasingly, the use of crowdsourcing to collect relevance labels has been regarded as a viable alternative that scales with modest costs. However, crowdsourcing suffers from undesirable worker practices and low quality contributions. In this paper we investigate the design and implementation of effective crowdsourcing tasks in the context of book search evaluation. We observe the impact of aspects of the Human Intelligence Task (HIT) design on the quality of relevance labels provided by the crowd. We assess the output in terms of label agreement with a gold standard data set and observe the effect of the crowdsourced relevance judgments on the resulting system rankings. This enables us to observe the effect of crowdsourcing on the entire IR evaluation process. Using the test set and experimental runs from the INEX 2010 Book Track, we find that varying the HIT design, and the pooling and document ordering strategies leads to considerable differences in agreement with the gold set labels. We then observe the impact of the crowdsourced relevance label sets on the relative system rankings using four IR performance metrics. System rankings based on MAP and Bpref remain less affected by different label sets while the Precision@10 and nDCG@10 lead to dramatically different system rankings, especially for labels acquired from HITs with weaker quality controls. Overall, we find that crowdsourcing can be an effective tool for the evaluation of IR systems, provided that care is taken when designing the HITs.
Social Turing Tests: Crowdsourcing Sybil Detection
"... As popular tools for spreading spam and malware, Sybils (or fake accounts) pose a serious threat to online communities such as Online Social Networks (OSNs). Today, sophisticated attackers are creating realistic Sybils that effectively befriend legitimate users, rendering most automated Sybil detect ..."
Abstract
-
Cited by 20 (8 self)
- Add to MetaCart
(Show Context)
As popular tools for spreading spam and malware, Sybils (or fake accounts) pose a serious threat to online communities such as Online Social Networks (OSNs). Today, sophisticated attackers are creating realistic Sybils that effectively befriend legitimate users, rendering most automated Sybil detection techniques ineffective. In this paper, we explore the feasibility of a crowdsourced Sybil detection system for OSNs. We conduct a large user study on the ability of humans to detect today’s Sybil accounts, using a large corpus of ground-truth Sybil accounts from the Facebook and Renren networks. We analyze detection accuracy by both “experts ” and “turkers ” under a variety of conditions, and find that while turkers vary significantly in their effectiveness, experts consistently produce near-optimal results. We use these results to drive the design of a multi-tier crowdsourcing Sybil detection system. Using our user study data, we show that this system is scalable, and can be highly effective either as a standalone system or as a complementary technique to current tools. 1
ViewSer: Enabling Large-Scale Remote User Studies of Web Search Examination and Interaction
- In SIGIR ’11
, 2011
"... Web search behaviour studies, including eye-tracking stud-ies of search result examination, have resulted in numerous insights to improve search result quality and presentation. Yet, eye tracking studies have been restricted in scale, due to the expense and the effort required. Furthermore, as the r ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
Web search behaviour studies, including eye-tracking stud-ies of search result examination, have resulted in numerous insights to improve search result quality and presentation. Yet, eye tracking studies have been restricted in scale, due to the expense and the effort required. Furthermore, as the reach of the Web expands, it becomes increasingly impor-tant to understand how searchers around the world see and interact with the search results. To address both challenges, we introduce ViewSer, a novel methodology for performing web search examination studies remotely, at scale, and with-out requiring eye-tracking equipment. ViewSer operates by automatically modifying the appearance of a search engine result page, to clearly show one search result at a time as if through a “viewport”, while partially blurring the rest and allowing the participant to move the viewport naturally with a computer mouse or trackpad. Remarkably, the resulting result viewing and clickthrough patterns agree closely with unrestricted viewing of results, as measured by eye-tracking equipment, validated by a study with over 100 participants. We also explore applications of ViewSer to practical search tasks, such as analyzing the search result summary (snip-pet) attractiveness, result re-ranking, and evaluating snippet quality. These experiments could have only be done previ-ously by tracking the eye movements for a small number of subjects in the lab. In contrast, our study was performed with over 100 participants, allowing us to reproduce and ex-tend previous findings, establishing ViewSer as a valuable tool for large-scale search behavior experiments. Categories and Subject Descriptors H.4 [Informational storage and retrieval]: evaluation,
Worker Types and Personality Traits in Crowdsourcing Relevance Labels
"... Crowdsourcing platforms offer unprecedented opportunities for creating evaluation benchmarks, but suffer from varied output quality from crowd workers who possess different levels of competence and aspiration. This raises new challenges for quality control and requires an in-depth understanding of h ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Crowdsourcing platforms offer unprecedented opportunities for creating evaluation benchmarks, but suffer from varied output quality from crowd workers who possess different levels of competence and aspiration. This raises new challenges for quality control and requires an in-depth understanding of how workers ’ characteristics relate to the quality of their work. In this paper, we use behavioral observations (HIT completion time, fraction of useful labels, label accuracy) to define five worker types: Spammer, Sloppy, Incompetent, Competent, Diligent. Using data collected from workers engaged in the crowdsourced evaluation of the INEX 2010 Book Track Prove It task, we relate the worker types to label accuracy and personality trait information along the ‘Big Five ’ personality dimensions. We expect that these new insights about the types of crowd workers and the quality of their work will inform how to design HITs to attract the best workers to a task and explain why certain HIT designs are more effective than others. Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software—performance evaluation (efficiency and effectiveness)
Overview of the INEX 2010 book track: Scaling up the evaluation using crowdsourcing
- In Comparative Evaluation of Focused Retrieval : 9th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2010), LNCS
, 2011
"... Abstract. The goal of the INEX Book Track is to evaluate approaches for supporting users in searching, navigating and reading the full texts of digitized books. The investigation is focused around four tasks: 1) Best Books to Reference, 2) Prove It, 3) Structure Extraction, and 4) Active Reading. In ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
(Show Context)
Abstract. The goal of the INEX Book Track is to evaluate approaches for supporting users in searching, navigating and reading the full texts of digitized books. The investigation is focused around four tasks: 1) Best Books to Reference, 2) Prove It, 3) Structure Extraction, and 4) Active Reading. In this paper, we report on the setup and the results of these tasks in 2010. The main outcome of the track lies in the changes to the methodology for constructing the test collection for the evaluation of the Best Books and Prove It search tasks. In an effort to scale up the evaluation, we explored the use of crowdsourcing both to create the test topics and then to gather the relevance labels for the topics over a corpus of 50k digitized books. The resulting test collection construction methodology combines editorial judgments contributed by INEX participants with crowdsourced relevance labels. We provide an analysis of the crowdsourced data and conclude that – with appropriate task design – crowdsourcing does provide a suitable framework for the evaluation of book search approaches. 1
Social Book Search: Comparing Topical Relevance Judgements and Book Suggestions for Evaluation
"... The Web and social media give us access to a wealth of information, not only different in quantity but also in character—traditional descriptions from professionals are now supplemented with user generated content. This challenges modern search systems based on the classical model of topical relevan ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
(Show Context)
The Web and social media give us access to a wealth of information, not only different in quantity but also in character—traditional descriptions from professionals are now supplemented with user generated content. This challenges modern search systems based on the classical model of topical relevance and ad hoc search: How does their effectiveness transfer to the changing nature of information and to the changing types of information needs and search tasks? We use the INEX 2011 Books and Social Search Track’s collection of book descriptions from Amazon and social cataloguing site LibraryThing. We compare classical IR with social book search in the context of the LibraryThing discussion forums where members ask for book suggestions. Specifically, we compare book suggestions on the forum with Mechanical Turk judgements on topical relevance and recommendation, both the judgements directly and their resulting evaluation of retrieval systems. First, the book suggestions on the forum are a complete enough set of relevance judgements for system evaluation. Second, topical relevance judgements result in a different system ranking from evaluation based on the forum suggestions. Although it is an important aspect for social book search, topical relevance is not sufficient for evaluation. Third, professional metadata alone is often not enough to determine the topical relevance of a book. User reviews provide a better signal for topical relevance. Fourth, user-generated content is more effective for social book search than professional metadata. Based on our findings, we propose an experimental evaluation that better reflects the complexities of social book search.
Corleone: Hands-off crowdsourcing for entity matching
, 2014
"... Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot h ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot handle scenarios where ordinary users (i.e., the masses) want to leverage crowdsourcing to match entities. In response, we propose the notion of hands-off crowdsourcing (HOC), which crowdsources the entire workflow of a task, thus requiring no developers. We show how HOC can repre-sent a next logical direction for crowdsourcing research, scale up EM at enterprises and crowdsourcing startups, and open up crowdsourcing for the masses. We describe Corleone, a HOC solution for EM, which uses the crowd in all major steps of the EM process. Finally, we discuss the implica-tions of our work to executing crowdsourced RDBMS joins, cleaning learning models, and soliciting complex information types from crowd workers.
Beyond independent agreement: A tournament selection approach for quality assurance of human computation tasks
- In Proc. of Human Computation
, 2011
"... Quality assurance remains a key topic in human computation research field. Prior work indicates independent agreement is effective for low difficulty tasks, but has limitations. This paper addresses this problem by proposing a tournament selection based quality control process. The experimental resu ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Quality assurance remains a key topic in human computation research field. Prior work indicates independent agreement is effective for low difficulty tasks, but has limitations. This paper addresses this problem by proposing a tournament selection based quality control process. The experimental results from this paper show that the human are better at identifying the correct answers than producing them themselves.
The Effect of Threshold Priming and Need for Cognition on Relevance Calibration and Assessment
"... Human assessments of document relevance are needed for the construction of test collections, for ad-hoc evaluation, and for training text classifiers. Showing documents to assessors in different orderings, however, may lead to different assessment outcomes. We examine the effect that threshold primi ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Human assessments of document relevance are needed for the construction of test collections, for ad-hoc evaluation, and for training text classifiers. Showing documents to assessors in different orderings, however, may lead to different assessment outcomes. We examine the effect that threshold priming, seeing varying degrees of relevant documents, has on people’s calibration of relevance. Participants judged the relevance of a prologue of documents containing highly relevant, moderately relevant, or non-relevant documents, followed by a common epilogue of documents of mixed relevance. We observe that participants exposed to only non-relevant documents in the prologue assigned significantly higher average relevance scores to prologue and epilogue documents than participants exposed to moderately or highly relevant documents in the prologue. We also examine how need for cognition, an individual difference measure of the extent to which a person enjoys engaging in effortful cognitive activity, impacts relevance assessments. High need for cognition participants had a significantly higher level of agreement with expert assessors than low need for cognition participants did. Our findings indicate that assessors should be exposed to documents from multiple relevance levels early in the judging process, in order to calibrate their relevance thresholds in a balanced way, and that individual difference measures might be a useful way to screen assessors.