## A Multilevel Bayesian Model of Categorical Data Annotation Author(s) Affiliation

### BibTeX

@MISC{_amultilevel,

author = {},

title = {A Multilevel Bayesian Model of Categorical Data Annotation Author(s) Affiliation},

year = {}

}

### OpenURL

### Abstract

Address

### Citations

1708 |
Nonparametric Statistics for the Behavioral Sciences
- Siegel, Castellan
- 1988
(Show Context)
Citation Context ...j and j ′ from the formula: Aj,j ′ = π(θ1,jθ1,j ′ + (1 − θ1,j)(1 − θ1,j ′)) + (1 − π)(θ0,jθ0,j ′ + (1 − θ0,j)(1 − θ0,j ′)) The chance agreement rate can be computed given the overall true prevalence (=-=Siegel and Castellan, 1988-=-): E = ππ + (1 − π)(1 − π) or with a slightly more complex formula relative to annotator-specific distributions. With a sequence of posterior samples π (n), θ (n) 0 , θ(n) 1 we can compute a posterior... |

1472 |
A coefficient of agreement for nominal scales
- Cohen
- 1960
(Show Context)
Citation Context ...least 3% in the gold standard. For the effects of this on evaluating classifiers and hence annotators, see (Lam and Stork, 2003). 6 Bayesian κ The family of κ “chance corrected agreement” statistics (=-=Cohen, 1960-=-) is widely used, despite a number of well-studied problems due to bias, prevalence, the lack of probabilistic interpretation, and difficulties in extending to multiple annotators or varying panel des... |

816 |
Inference from iterative simulation using multiple sequences. Statistical Science 7:457–511
- Gelman, Rubin
- 1992
(Show Context)
Citation Context ...d real data, we ran the Gibbs samplers multiple times for 1000 iterations from dispersed starting points, discarded the first half of the chains, and computed potential scale reduction ( ˆ R) values (=-=Gelman and Rubin, 1992-=-) very close to 1 for all parameters in the remaining 500 samples. Figure 2 shows how well the posterior estimates for the θ values match the simulated true values. As expected, the posterior interval... |

436 |
Probabilistic models for some intelligence and attainment tests
- Rasch
- 1960
(Show Context)
Citation Context ...ching from a collection of clinician’s estimates. With only two annotators, the priors exert a strong influence. (Uebersax and Grove, 1993) introduce a variant of the item-response model (Lord, 1980; =-=Rasch, 1980-=-) in which categories are unknown (latent) to the problem of agreement analysis in an ordinal response setting. They model items with a single latent trait generated by binary normal mixtures of 0 and... |

329 |
Applications of item response theory to practical testing problems
- Lord
- 1980
(Show Context)
Citation Context ...g moment matching from a collection of clinician’s estimates. With only two annotators, the priors exert a strong influence. (Uebersax and Grove, 1993) introduce a variant of the item-response model (=-=Lord, 1980-=-; Rasch, 1980) in which categories are unknown (latent) to the problem of agreement analysis in an ordinal response setting. They model items with a single latent trait generated by binary normal mixt... |

288 | The PASCAL Recognising Textual Entailment Challenge
- Dagan, Glickman, et al.
- 2005
(Show Context)
Citation Context ...nd thus remain highly uncertain. 4 RTE-1 Data (Snow et al., 2008) used the Amazon Mechanical Turk to re-annotate the 800 item test set from the First Recognizing Textual Entailment Challenge (RTE-1) (=-=Dagan et al., 2006-=-). Items consists of a text (e.g. “The city Tenochtitlan grew rapidly and was the center of the Aztec’s great empire.”) and a hypothesis (e.g. “Tenochtitlan quickly spread over the island, marshes, an... |

167 | Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks - Snow, O'Connor, et al. - 2008 |

91 | Maximum likelihood estimation of observer Error-Rates using the EM algorithm
- Dawid, Skene
- 1979
(Show Context)
Citation Context ...tinomial setting. The Bernoulli distributions for prevalence and category are replaced with corresponding discrete (multinomial) distributions, and the beta priors are replaced with Dirichlet priors (=-=Dawid and Skene, 1979-=-). To properly estimate the Dirichlet and multinomial parameters will require more data, but otherwise nothing changes. Ordinal outcomes, as involved in sentiment and other rating tasks, may be peform... |

67 | Inferring ground truth from subjective labeling of Venus images
- Smyth, Fayyad, et al.
- 1994
(Show Context)
Citation Context ...fiability of this model. (MendozaBlanco et al., 1996) apply Bayesian inference to the uncertainty in sensitivity and specificity estimates in estimating prevalence in a model like Dawid and Skene’s. (=-=Smyth et al., 1995-=-) apply Dawid and Skene’s model to estimate gold standards from annotations by four geologists of radar images collected by the Magellan spacecraft of Venus for volcanos. (Smyth, 1995) shows that prob... |

49 | Recognizing subjectivity: A case study in manual tagging
- Bruce, Wiebe
- 1999
(Show Context)
Citation Context ...ards from annotations by four geologists of radar images collected by the Magellan spacecraft of Venus for volcanos. (Smyth, 1995) shows that probabilistic supervision works well for simulated data. (=-=Bruce and Wiebe, 1999-=-) perform an EM estimate over a model very similar to Dawid and Skene’s to estimate annotator accuracies and prevalence from which they could assign a single true category for each item in a corpus. (... |

45 | R2WinBUGS: a package for running WinBUGS from R
- Sturtz, Ligges, et al.
- 2005
(Show Context)
Citation Context ...formed with R 2.7.1 (R Core Development Team, 2008). Gibbs sampling was performed with WinBUGS 1.4.3 (Spiegelhalter et al., 2008). Communication between BUGS and R and evaluation used R2WinBUGS 2.1.8(=-=Sturtz et al., 2005-=-). Amazon Mechanical Turk services were carried out using their command line tools over the 2008-08-02 API release. mean estimate theta.0 0.5 0.6 0.7 0.8 0.9 1.0 Estimated vs. Simulated theta.0 ● ● 0.... |

34 |
B.: Design of the MUC-6 evaluation
- Grishman, Sundheim
- 1995
(Show Context)
Citation Context ...ely, inference is not a relation between texts, but rather requires context and disambiguation. 5 MUC-6 Data MUC-6 is a corpus that includes named entity mention annotations in sentences of newswire (=-=Grishman and Sundheim, 1995-=-). We had mechanical Turk annotators re-annotate the person-name section of the data, which constituted 190,124 tokens, 4127 of which were part of person names according to the gold standard. To reduc... |

34 |
Random effects models in latent class analysis for evaluating accuracy of diagnostic tests
- Qu, Tan, et al.
- 1996
(Show Context)
Citation Context ...the κ statistic between annotators j and j ′. 7 Why No Item-Level Predictors? Although it is possible to extend these models to general item- and annotator-level predictors (Uebersax and Grove, 1993; =-=Qu et al., 1996-=-; Albert et al., 2001) using a logistic or probit generalized linear model, our simulation and real data experiments have shown that even with ten annotators per item, item-level difficulty parameters... |

32 |
Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard
- JOSEPH, GYORKOS, et al.
- 1995
(Show Context)
Citation Context ...tive in the sense that all tests are expected to return the same result. (Klebanov et al., 2008) apply a similar mixture of easy and regular cases to filter unreliable examples from a gold standard. (=-=Joseph et al., 1995-=-) introduce a binomial model for sensitivity and specificity with independent beta priors. They estimate priors using moment matching from a collection of clinician’s estimates. With only two annotato... |

31 |
Estimating the error rates of diagnostic tests
- HUI, WALTER
- 1980
(Show Context)
Citation Context ...79) introduced a multinomial model where annotator’s responses vary by category, thus generalizing the notion of sensitivity and specificity. They used EM to find maximum likelihood point estimates. (=-=Hui and Walter, 1980-=-)discusses identifiability of this model. (MendozaBlanco et al., 1996) apply Bayesian inference to the uncertainty in sensitivity and specificity estimates in estimating prevalence in a model like Da... |

21 |
Reliability measurement without limits
- REIDSMA, CARLETTA
- 2008
(Show Context)
Citation Context ...idely used, despite a number of well-studied problems due to bias, prevalence, the lack of probabilistic interpretation, and difficulties in extending to multiple annotators or varying panel designs (=-=Reidsma and Carletta, 2008-=-; Artstein and Poesio, In press). The definition κ = (A − E)/(1 − E) takes A to be the annotator agreement rate and E the chance agreement rate. Given annotator sensitivity, specificity and prevalence... |

18 | Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests
- Dendukuri, Joseph
- 2001
(Show Context)
Citation Context ...elike model, but separately model annotator sensitivity and specificity. They hand-tune item-difficulty interaction parameters given known correlations among clinical diagnostics (e.g. blood tests). (=-=Dendukuri and Joseph, 2001-=-) apply Bayesian inference to Qu et al.’s model, again fixing priors by interviewing human experts. (Basu et al., 2000) develop a Bayesian estimate of the kappa statistic (chance adjusted inter-rater ... |

16 | Evaluating classifiers by means of test data with noisy labels
- Lam, Stork
- 1995
(Show Context)
Citation Context ...gold standard are actually errors in the gold standard itself, indicating an impurity of at least 3% in the gold standard. For the effects of this on evaluating classifiers and hence annotators, see (=-=Lam and Stork, 2003-=-). 6 Bayesian κ The family of κ “chance corrected agreement” statistics (Cohen, 1960) is widely used, despite a number of well-studied problems due to bias, prevalence, the lack of probabilistic inter... |

12 |
Using latent class models to characterize and assess relative-error in discrete measurements
- Espeland, Handelman
- 1989
(Show Context)
Citation Context ...difficulty parameters allow a second explanation of errors, greatly widening the posteriors on annotator accuracy. Another approach that has been tried is a mixture model of regular and “easy” items (=-=Espeland and Handelman, 1989-=-; Albert et al., 2001; Klebanov et al., 2008). Such a model resembles a zero-inflated Poisson (Gelman et al., 2003) and improves the fit of posterior marginal all-1 and all-0 item annotations. It is o... |

12 | DEVELOPMENT TEAM (2008). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing - CORE |

9 | Analyzing disagreements
- Klebanov, Beata, et al.
- 2008
(Show Context)
Citation Context ... errors, greatly widening the posteriors on annotator accuracy. Another approach that has been tried is a mixture model of regular and “easy” items (Espeland and Handelman, 1989; Albert et al., 2001; =-=Klebanov et al., 2008-=-). Such a model resembles a zero-inflated Poisson (Gelman et al., 2003) and improves the fit of posterior marginal all-1 and all-0 item annotations. It is otherwise rather unrealistic in the setting w... |

5 | Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers - Get |

4 | Bayesian inference on prevalence using a missing-data approach with simulation-based techniques: applications to HIV screening - MENDOZA-BLANCO, TU, et al. - 1996 |

3 | The US National Cancer Institute Bladder Tumor Marker Netword: Latent class modeling approaches for assessing diagnostic error without a gold standard: with applications to p53 immunohistochemical assays in bladder tumors - PS, McShane, et al. |

1 |
Bayesian inference for kappa from single and multiple studies. Biometrics
- Basu, Banerjee, et al.
- 2000
(Show Context)
Citation Context ...given known correlations among clinical diagnostics (e.g. blood tests). (Dendukuri and Joseph, 2001) apply Bayesian inference to Qu et al.’s model, again fixing priors by interviewing human experts. (=-=Basu et al., 2000-=-) develop a Bayesian estimate of the kappa statistic (chance adjusted inter-rater agreement) by supplying a beta prior for binomial responses and then reasoning about values of kappa derived in the po... |

1 |
Computational Learning Theory and Natural Learning Systems Volume III, chapter Learning with probabilistic supervision
- Smyth
- 1995
(Show Context)
Citation Context ...Skene’s. (Smyth et al., 1995) apply Dawid and Skene’s model to estimate gold standards from annotations by four geologists of radar images collected by the Magellan spacecraft of Venus for volcanos. (=-=Smyth, 1995-=-) shows that probabilistic supervision works well for simulated data. (Bruce and Wiebe, 1999) perform an EM estimate over a model very similar to Dawid and Skene’s to estimate annotator accuracies and... |

1 |
WinBUGS version 1.4.3 user manual
- Spiegelhalter, Thomas, et al.
- 2008
(Show Context)
Citation Context ...drawn from their respective betas. 1 Simulations, data manipulations and graphical display was performed with R 2.7.1 (R Core Development Team, 2008). Gibbs sampling was performed with WinBUGS 1.4.3 (=-=Spiegelhalter et al., 2008-=-). Communication between BUGS and R and evaluation used R2WinBUGS 2.1.8(Sturtz et al., 2005). Amazon Mechanical Turk services were carried out using their command line tools over the 2008-08-02 API re... |