Results 1 - 10
of
10
Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing
"... Large-scale classification is an increasingly critical Big Data problem. So far, however, very little has been published on how this is done in practice. In this paper we describe Chimera, our solution to classify tens of millions of prod-ucts into 5000+ product types at WalmartLabs. We show that at ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Large-scale classification is an increasingly critical Big Data problem. So far, however, very little has been published on how this is done in practice. In this paper we describe Chimera, our solution to classify tens of millions of prod-ucts into 5000+ product types at WalmartLabs. We show that at this scale, many conventional assumptions regarding learning and crowdsourcing break down, and that existing solutions cease to work. We describe how Chimera employs a combination of learning, rules (created by in-house ana-lysts), and crowdsourcing to achieve accurate, continuously improving, and cost-effective classification. We discuss a set of lessons learned for other similar Big Data systems. In particular, we argue that at large scales crowdsourcing is critical, but must be used in combination with learning, rules, and in-house analysts. We also argue that using rules (in conjunction with learning) is a must, and that more re-search attention should be paid to helping analysts create and manage (tens of thousands of) rules more effectively. 1.
Data Quality: From Theory to Practice
"... Data quantity and data quality, like two sides of a coin, are equally important to data management. This paper provides an overview of recent advances in the study of data quality, from theory to practice. We also address challenges intro-duced by big data to data quality management. 1. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Data quantity and data quality, like two sides of a coin, are equally important to data management. This paper provides an overview of recent advances in the study of data quality, from theory to practice. We also address challenges intro-duced by big data to data quality management. 1.
Wisteria: Nurturing scalable data cleaning infrastructure
- Proceedings of the VLDB Endowment
"... ABSTRACT Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticat ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.
Refining Automatically Extracted Knowledge Bases Using Crowdsourcing
"... Machine-constructed knowledge bases often contain noisy and inaccurate facts. There exists significant work in developing automated algorithms for knowledge base refinement. Automated approaches improve the quality of knowledge bases but are far from perfect. In this paper, we leverage crowdsourcin ..."
Abstract
- Add to MetaCart
(Show Context)
Machine-constructed knowledge bases often contain noisy and inaccurate facts. There exists significant work in developing automated algorithms for knowledge base refinement. Automated approaches improve the quality of knowledge bases but are far from perfect. In this paper, we leverage crowdsourcing to improve the quality of automatically extracted knowledge bases. As human labelling is costly, an important research challenge is how we can use limited human resources to maximize the quality improvement for a knowledge base. To address this problem, we first introduce a concept of semantic constraints that can be used to detect potential errors and do inference among candidate facts. Then, based on semantic constraints, we propose rankbased and graph-based algorithms for crowdsourced knowledge refining, which judiciously select the most beneficial candidate facts to conduct crowdsourcing and prune unnecessary questions. Our experiments show that our method improves the quality of knowledge bases significantly and outperforms state-of-the-art automatic methods under a reasonable crowdsourcing cost.
Crowdsourcing Entity Resolution: a Short Overview and Open Issues
"... ABSTRACT Entity resolution (ER) is a process to identify records that stand for the same real-world entity. Although automatic algorithms aiming at solving this problem have been developed for many years, their accuracy remains far from perfect. Crowdsourcing is a technology currently investigated, ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT Entity resolution (ER) is a process to identify records that stand for the same real-world entity. Although automatic algorithms aiming at solving this problem have been developed for many years, their accuracy remains far from perfect. Crowdsourcing is a technology currently investigated, which leverages the crowd to solicit contributions to complete certain tasks via crowdsourced marketplaces. One of its advantages is to inject human reasoning to problems that are still hard to process for computers, which makes it suitable for ER and provides an opportunity to achieve a higher accuracy. As crowdsourcing ER is still a relatively new area in data processing, this paper provides an overview and a brief classification of current research state in crowdsourcing ER. Besides, some open issues are revealed that will be a starting point for our future research. General Terms Theory
Abstract Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics
, 2015
"... Data has become more widely available to the public for consumption, for example, through the Web and the recent “Open Data ” movement. An emerging cohort of users, called Data Enthusiasts, want to analyze this data, but have limited technical or data science expertise. In response to these trends, ..."
Abstract
- Add to MetaCart
(Show Context)
Data has become more widely available to the public for consumption, for example, through the Web and the recent “Open Data ” movement. An emerging cohort of users, called Data Enthusiasts, want to analyze this data, but have limited technical or data science expertise. In response to these trends, online visual analytics systems have emerged as a popular tool for data analysis and sharing. Current visual analytics systems such as Tableau and Many Eyes enable this user cohort to be able to perform sophisticated data analysis visually at interactive speeds and without any programming. Together, these two systems have been used by tens of thousands of authors to create hundreds of thousands of views, yet we know very little about how these systems are being used. The first challenge we address in this thesis, thus, is: how are popular visual analytics systems such as Tableau and Many Eyes being used for data analysis? To the best of our knowledge, this is the first study of its kind, and presents important details about the use of online, visual analytics systems. Visual analytics systems provide basic support for data integration. A simple approach for interactive data integration in Tableau was implemented in that tool in the context of
Managing General and Individual Knowledge in Crowd Mining Applications
"... ABSTRACT Crowd mining frameworks combine general knowledge, which can refer to an ontology or information in a database, with individual knowledge obtained from the crowd, which captures habits and preferences. To account for such mixed knowledge, along with user interaction and optimization issues ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT Crowd mining frameworks combine general knowledge, which can refer to an ontology or information in a database, with individual knowledge obtained from the crowd, which captures habits and preferences. To account for such mixed knowledge, along with user interaction and optimization issues, such frameworks must employ a complex process of reasoning, automatic crowd task generation and result analysis. In this paper, we describe a generic architecture for crowd mining applications. This architecture allows us to examine and compare the components of existing crowdsourcing systems and point out extensions required by crowd mining. It also highlights new research challenges and potential reuse of existing techniques/components. We exemplify this for the OASSIS project and for other prominent crowdsourcing frameworks.
SampleClean: Fast and Reliable Analytics on Dirty Data
"... An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, incorrect, or inconsistent values. In the SampleClean project, we have developed a new suite of techniques to esti-mate the results of queries when only a sample of data can be cleaned. Some forms of da ..."
Abstract
- Add to MetaCart
(Show Context)
An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, incorrect, or inconsistent values. In the SampleClean project, we have developed a new suite of techniques to esti-mate the results of queries when only a sample of data can be cleaned. Some forms of data corruption, such as duplication, can affect sampling probabilities, and thus, new techniques have to be designed to ensure correctness of the approximate query results. We first describe our initial project on computing statistically bounded estimates of sum, count, and avg queries from samples of cleaned data. We sub-sequently explored how the same techniques could apply to other problems in database research, namely, materialized view maintenance. To avoid expensive incremental maintenance, we maintain only a sam-ple of rows in a view, and then leverage SampleClean to approximate aggregate query results. Finally, we describe our work on a gradient-descent algorithm that extends the key ideas to the increasingly common Machine Learning-based analytics. 1
Human-Powered Blocking in Entity Resolution: A Feasibility Study
"... Entity Resolution (ER) is the problem of matching the records that refer to the same entity within or across two or more data sources. In recent years, human-powered ER solutions have been proposed so that challenging ER tasks, that machines cannot do well, can be helped by human workers. While succ ..."
Abstract
- Add to MetaCart
(Show Context)
Entity Resolution (ER) is the problem of matching the records that refer to the same entity within or across two or more data sources. In recent years, human-powered ER solutions have been proposed so that challenging ER tasks, that machines cannot do well, can be helped by human workers. While successful in achieving high matching accuracy, existing human-powered ER methods did not incorporate a core technique, i.e., blocking, for improving the scal-ability of the ER process. To address this issue, this paper carries out the feasibility study to validate whether the blocking technique can be integrated into the human-powered ER. Specifically, we first propose two variations of human-powered blocking methods. We then validate their effectiveness in improving the scalability of the ER process through simulated crowdsourcing and AMT-based ex-periments in synthetic and real-life datasets, respectively.
Minimizing Efforts in Validating Crowd Answers
"... In recent years, crowdsourcing has become essential in a wide range of Web applications. One of the biggest challenges of crowdsourc-ing is the quality of crowd answers as workers have wide-ranging levels of expertise and the worker community may contain faulty workers. Although various techniques f ..."
Abstract
- Add to MetaCart
(Show Context)
In recent years, crowdsourcing has become essential in a wide range of Web applications. One of the biggest challenges of crowdsourc-ing is the quality of crowd answers as workers have wide-ranging levels of expertise and the worker community may contain faulty workers. Although various techniques for quality control have been proposed, a post-processing phase in which crowd answers are val-idated is still required. Validation is typically conducted by experts, whose availability is limited and who incur high costs. Therefore, we develop a probabilistic model that helps to identify the most beneficial validation questions in terms of both, improvement of result correctness and detection of faulty workers. Our approach allows us to guide the expert’s work by collecting input on the most problematic cases, thereby achieving a set of high quality an-swers even if the expert does not validate the complete answer set. Our comprehensive evaluation using both real-world and synthetic datasets demonstrates that our techniques save up to 50 % of ex-pert efforts compared to baseline methods when striving for perfect result correctness. In absolute terms, for most cases, we achieve close to perfect correctness after expert input has been sought for only 20 % of the questions.