Results 1 - 10
of
57
Crowdsourced judgement elicitation with endogenous proficiency
- ACM International World Wide Web Conference (WWW
, 2013
"... Crowdsourcing is now widely used to replace judgement or evaluation by an expert authority with an aggregate evaluation from a number of non-experts, in applications ranging from rating and categorizing online content all the way to evaluation of student assignments in massively open online courses ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Crowdsourcing is now widely used to replace judgement or evaluation by an expert authority with an aggregate evaluation from a number of non-experts, in applications ranging from rating and categorizing online content all the way to evaluation of student assignments in massively open online courses (MOOCs) via peer grading. A key issue in these settings, where direct monitoring of both effort and accuracy is infeasible, is incentivizing agents in the ‘crowd ’ to put in effort to make good evaluations, as well as to truthfully report their evaluations. We study the design of mechanisms for crowdsourced judgement elicitation when workers strategically choose both their reports and the effort they put into their evaluations. This leads to a new family of information elicitation problems with unobservable ground truth, where an agent’s proficiency — the probability with which she correctly evaluates the underlying ground
Putting Out a HIT: Crowdsourcing Malware Installs
"... Today, several actors within the Internet’s burgeoning underground economy specialize in providing services to like-minded criminals. At the same time, gray and white markets exist for services on the Internet providing reasonably similar products. In this paper we explore a hypothetical arbitrage b ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Today, several actors within the Internet’s burgeoning underground economy specialize in providing services to like-minded criminals. At the same time, gray and white markets exist for services on the Internet providing reasonably similar products. In this paper we explore a hypothetical arbitrage between these two markets by purchasing “Human Intelligence ” on Amazon’s Mechanical Turk service, determining the vulnerability of and cost to compromise the computers being used by the humans to provide this service, and estimating the underground value of the computers which are vulnerable to exploitation. We show that it is economically feasible for an attacker to purchase access to high value hosts via Mechanical Turk, compromise the subset with unpatched, vulnerable browser plugins, and sell access to these hosts via Pay-Per-Install programs for a tidy profit. We also present supplementary statistics gathered regarding Mechanical Turk workers ’ browser security, antivirus usage, and willingness to run arbitrary programs in exchange for a small monetary reward. 1
Reviewing versus Doing: Learning and Performance in Crowd Assessment
"... In modern crowdsourcing markets, requesters face the chal-lenge of training and managing large transient workforces. Requesters can hire peer workers to review others ’ work, but the value may be marginal, especially if the reviewers lack requisite knowledge. Our research explores if and how workers ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
In modern crowdsourcing markets, requesters face the chal-lenge of training and managing large transient workforces. Requesters can hire peer workers to review others ’ work, but the value may be marginal, especially if the reviewers lack requisite knowledge. Our research explores if and how workers learn and improve their performance in a task do-main by serving as peer reviewers. Further, we investigate whether peer reviewing may be more effective in teams where the reviewers can reach consensus through discus-sion. An online between-subjects experiment compares the tradeoffs of reviewing versus producing work using three different organization strategies: working individually, working as an interactive team, and aggregating individuals into nominal groups. The results show that workers who review others ’ work perform better on subsequent tasks than workers who just produce. We also find that interac-tive reviewer teams outperform individual reviewers on all quality measures. However, aggregating individual review-ers into nominal groups produces better quality assessments than interactive teams, except in task domains where dis-cussion helps overcome individual misconceptions.
Perception of average value in multiclass scatterplots
- Visualization and Computer Graphics, IEEE Transactions on
"... (a) Larger differences between means lead to im-proved performance. (b) As the number of points per class increases per-formance remains good (in fact it may improve). (c) Stronger cues (color) outperform weaker ones (shape). Although, participants performed well even with weak cues. (d) Combining c ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
(a) Larger differences between means lead to im-proved performance. (b) As the number of points per class increases per-formance remains good (in fact it may improve). (c) Stronger cues (color) outperform weaker ones (shape). Although, participants performed well even with weak cues. (d) Combining cues redundantly does not improve performance. (e) Irrelevant cues do not degrade performance. Here, class is shown by color, but the random shape does not degrade performance. (f) Adding irrelevant additional classes to the scat-terplot does not degrade performance. Fig. 1. Summary of results: viewers can efficiently make comparative mean judgements, choosing the class with the highest average position in multiclass scatterplots across a wide variety of conditions and encodings. Abstract—The visual system can make highly efficient aggregate judgements about a set of objects, with speed roughly independent of the number of objects considered. While there is a rich literature on these mechanisms and their ramifications for visual summa-rization tasks, this prior work rarely considers more complex tasks requiring multiple judgements over long periods of time, and has not considered certain critical aggregation types, such as the localization of the mean value of a set of points. In this paper, we explore these questions using a common visualization task as a case study: relative mean value judgements within multi-class scatterplots. We describe how the perception literature provides a set of expected constraints on the task, and evaluate these predictions with a large-scale perceptual study with crowd-sourced participants. Judgements are no harder when each set contains more points, redundant and conflicting encodings, as well as additional sets, do not strongly affect performance, and judgements are harder when using less salient encodings. These results have concrete ramifications for the design of scatterplots. Index Terms—Psychophysics, Information Visualization, Perceptual Study 1
Because Hitler did it! Quantitative tests of Bayesian argumentation using ad hominem
- Thinking & Reasoning
, 2012
"... Abstract Bayesian probability has recently been proposed as a normative theory of argumentation. In this article, we provide a Bayesian formalisation of the ad Hitlerum argument, as a special case of the ad hominem argument. Across 3 experiments, we demonstrate that people's evaluation of the ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract Bayesian probability has recently been proposed as a normative theory of argumentation. In this article, we provide a Bayesian formalisation of the ad Hitlerum argument, as a special case of the ad hominem argument. Across 3 experiments, we demonstrate that people's evaluation of the argument is sensitive to probabilistic factors deemed relevant on a Bayesian formalisation. Moreover, we provide the first quantitative evidence in favour of the Bayesian approach to argumentation. Quantitative Bayesian prescriptions were derived from participants' stated subjective probabilities (Experiments 1 & 2), as well as from frequency information explicitly provided in the experiment (Experiment 3). Participants' stated evaluations of the convincingness of the argument were well matched to these prescriptions. 3
Quality control of crowd labeling through expert evaluation
- In Second Workshop on Computational Social Science and the Wisdom of Crowds (NIPS
, 2011
"... We propose a general scheme for quality-controlled labeling of large-scale data using multiple labels from the crowd and a “few ” ground truth labels from an expert of the field. Expert-labeled instances are used to assign weights to the expertise of each crowd labeler and to the difficulty of each ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
We propose a general scheme for quality-controlled labeling of large-scale data using multiple labels from the crowd and a “few ” ground truth labels from an expert of the field. Expert-labeled instances are used to assign weights to the expertise of each crowd labeler and to the difficulty of each instance. Ground truth labels for all instances are then approximated through those weights along with the crowd labels. We argue that injecting a little expertise in the labeling process, will significantly improve the accuracy of the labeling task. Indeed, empirical evaluation demonstrates that our methodology is efficient and effective as it gives better quality labels than majority voting and other state-of-art methods. 1
Bootstrapping trust in online dating: Social verification of online dating profiles
- In Proceedings of the Financial Cryptography and Data Security Workshop on Usable Security (USEC’13
, 2013
"... Abstract. Online dating is an increasingly thriving business which boasts billion-dollar revenues and attracts users in the tens of millions. Notwithstanding its pop-ularity, online dating is not impervious to worrisome trust and privacy concerns raised by the disclosure of potentially sensitive dat ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Online dating is an increasingly thriving business which boasts billion-dollar revenues and attracts users in the tens of millions. Notwithstanding its pop-ularity, online dating is not impervious to worrisome trust and privacy concerns raised by the disclosure of potentially sensitive data as well as the exposure to self-reported (and thus potentially misrepresented) information. Nonetheless, lit-tle research has, thus far, focused on how to enhance privacy and trustworthiness. In this paper, we report on a series of semi-structured interviews involving 20 participants, and show that users are significantly concerned with the veracity of online dating profiles. To address some of these concerns, we present the user-centered design of an interface, called Certifeye, which aims to bootstrap trust in online dating profiles using existing social network data. Certifeye verifies that the information users report on their online dating profile (e.g., age, relation-ship status, and/or photos) matches that displayed on their own Facebook profile. Finally, we present the results of a 161-user Mechanical Turk study assessing whether our veracity-enhancing interface successfully reduced concerns in on-line dating users and find a statistically significant trust increase. 1
CrowdPark: A Crowdsourcing-based Parking Reservation System for Mobile Phones
"... Parking in crowded urban areas is a precious resource that impacts driver stress levels, daily productivity, and the environment. A reservation system that enables individuals to buy parking spots prior to leaving their home would significantly ease these concerns. However, designing an infrastructu ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Parking in crowded urban areas is a precious resource that impacts driver stress levels, daily productivity, and the environment. A reservation system that enables individuals to buy parking spots prior to leaving their home would significantly ease these concerns. However, designing an infrastructure for guaranteed parking requires extensive sensor deployment and manpower, which is expensive and time-consuming proposition. In this paper, we present CrowdPark, a crowdsourcing platform that enables users to “loosely reserve ” parking spots. Unlike traditional reservation platforms where sellers are usually the owners of resources, CrowdPark achieves parking reservation by crowdsourcing information about when parking resources will be available, and using this availability information to help other users find parking spots. The design of such a crowdsourcing-based parking reservation system presents several challenges including incentive design, robustness to malicious users, and handling the spatial and temporal uncertainty due to real-world vagaries. We present novel solutions to address these challenges that combine protocol design, game-theoretic and cost-benefit analysis, sensor data processing techniques, and navigation-based tools. With a combination of simulation and real-world experiments, we show that CrowdPark can 1) effectively incentivize user participation and detect malicious users with accuracy of over 95%, and 2) handle over 95 % of spatial uncertainty and achieve over 90 % successful parking reservation with a few minute long waiting time. 1.
Crowdsourcing GUI Tests
"... Abstract—Graphical user interfaces are difficult to test: automated tests are hard to create and maintain, while manual tests are time-consuming, expensive and hard to integrate in a continuous testing process. In this paper, we show that it is possible to crowdsource GUI tests, that is, to outsourc ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Graphical user interfaces are difficult to test: automated tests are hard to create and maintain, while manual tests are time-consuming, expensive and hard to integrate in a continuous testing process. In this paper, we show that it is possible to crowdsource GUI tests, that is, to outsource them to individuals drawn from a very large pool of workers on the In-ternet. This is made possible by instantiating virtual machines running the system under test and letting testers access the VMs through their web browsers, enabling semi-automated continuous testing of GUIs and usability experiments with large numbers of participants at low cost. Several large experiments on the Amazon Mechanical Turk demonstrate that our approach is technically feasible and sufficiently reliable. I.
A medical risk attitude subscale for DOSPERT
"... Background: The Domain-Specific Risk Taking scale (DOSPERT) is a widely used instrument that measures perceived risk and benefit and attitude toward risk for activities in several domains, but does not include medical risks. Objective: To develop a medical risk domain subscale for DOSPERT. Methods: ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Background: The Domain-Specific Risk Taking scale (DOSPERT) is a widely used instrument that measures perceived risk and benefit and attitude toward risk for activities in several domains, but does not include medical risks. Objective: To develop a medical risk domain subscale for DOSPERT. Methods: Sixteen candidate risk items were developed through expert discussion. We conducted cognitive telephone interviews, an online survey, and a random-digit dialing (RDD) telephone survey to reduce and refine the scale, explore its factor structure, and obtain estimates of reliability. Participants: Eight patients recruited from UIC medical center waiting rooms participated in 45-60 minute cognitive interviews. Thirty Amazon Mechanical Turk workers completed the online survey. One hundred Chicago-area residents completed the RDD telephone survey. Results: On the basis of cognitive interviews, we eliminated five items due to poor variance or participant misunderstanding. The online survey suggested that two additional items were negatively correlated with the scale, and we considered them candidates for removal. Factor analysis of the responses in the RDD telephone survey and nonstatistical factors led us to recommend a final set of 6 items to represent the medical risk domain. The final set of items included blood donation, kidney donation, daily medication use for allergies, knee replacement surgery, general anesthesia in dentistry, and clinical trial participation. The interitem reliability (Cronbach’s α) of the final set of 6 items ranged from 0.57-0.59 depending on the response task. Older respondents gave lower overall ratings of expected benefit from the activities. Conclusion: We refined a set of items to measure risk and benefit perceptions for medical activities. Our next step will be to add these items to the complete DOSPERT scale, confirm the scale’s psychometric properties, determine whether medical risks constitute a psychologically distinct domain from other risky activities, and characterize individual differences in medical risk attitudes.