• Documents
  • Authors
  • Tables

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Overview of the 2nd international competition on wikipedia vandalism detection (2011)

by M Potthast, T Holfeld
Venue:In CLEF
Add To MetaCart

Tools

Sorted by:
Results 1 - 6 of 6

User Edits Classification Using Document Revision Histories

by Amit Bronner, Christof Monz
"... Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revisio ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments, our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy (88%) and is shown to generalize to additional sets of unseen data. 1
(Show Context)

Citation Context

...annotated examples. Wikipedia vandalism detection is a user edits classification problem addressed by a yearly competition (since 2010) in conjunction with the CLEF conference (Potthast et al., 2010; =-=Potthast and Holfeld, 2011-=-). State-of-the-art solutions involve supervised machine learning using various content and metadata features. Content features use spelling, grammar, character- and word-level attributes. Many of the...

Automatically Classifying Edit Categories in Wikipedia Revisions

by Johannes Daxenberger, Iryna Gurevych
"... In this paper, we analyze a novel set of fea- ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
In this paper, we analyze a novel set of fea-

Lang Resources & Evaluation DOI 10.1007/s10579-013-9232-5 ORIGINAL PAPER

by Enrique Alfonseca, Guillermo Garrido, Jean-yves Delort, Anselmo Peñas, Ó Springer, Science+business Media Dordrecht, J. -y. Delort, G. Garrido, A. Peñas, G. Garrido, A. Peñas
"... Historical structured data extraction and vandalism detection from the Wikipedia edit history ..."
Abstract - Add to MetaCart
Historical structured data extraction and vandalism detection from the Wikipedia edit history
(Show Context)

Citation Context

...analysis of these bots. Much research has been encouraged by the release of the manually annotated PAN-WVC-10 English Wikipedia vandalism corpus (Potthast 2010), extended later to German and Spanish (=-=Potthast and Holfeld 2011-=-), and the first two editions of a vandalism detection competition, PAN 2010 (Potthast et al. 2010), and PAN 2011 (Potthast and Holfeld 2011). 6 The proposed systems can be grouped by the kind of feat...

People’s Web Meets NLP: Collaboratively Constructed Language Resources ” in the Springer

by In Wikipedia, Oliver Ferschke, Johannes Daxenberger, Iryna Gurevych, Oliver Ferschke, Johannes Daxenberger, Iryna Gurevych, Johannes Daxenberger, Iryna Gurevych
"... Abstract With the rise of the Web 2.0, participatory and collaborative content production have largely replaced the traditional ways of information sharing and have created the novel genre of collaboratively constructed language resources. A vast untapped potential lies in the dynamic aspects of the ..."
Abstract - Add to MetaCart
Abstract With the rise of the Web 2.0, participatory and collaborative content production have largely replaced the traditional ways of information sharing and have created the novel genre of collaboratively constructed language resources. A vast untapped potential lies in the dynamic aspects of these resources, which cannot be unleashed with traditional methods designed for static corpora. In this chapter, we focus on Wikipedia as the most prominent instance of collaboratively constructed language resources. In particular, we discuss the significance of Wikipedia’s revision history for applications in Natural Language Processing (NLP) and the unique prospects of the user discussions, a new resource that has just begun to be mined. While the body of research on processing Wikipedia’s revision history is dominated by works that use the revision data as the basis for practical applications such as spelling correction or vandalism detection, most of the work focused on user discussions uses NLP for analyzing and understanding the data itself.
(Show Context)

Citation Context

.... 33 http://toolserver.org/5 Analyzing the Collaborative Writing Process in Wikipedia Table 5.11 Resources based on Wikipedia articles and Talk pages Resource Based on Annotations Format License WVC =-=[26, 27]-=- revisions vandalism CSV CC WiCoPaCo [18] revisions spelling errors and paraphrases XML GFDL [40] revisions lexical simplifications CSV – [41] revisions textual entailment XML, TXT – [43] revisions re...

Snooping Wikipedia Vandals with MapReduce

by Michele Spina, Dario Rossi, Mauro Sozio, Silviu Maniu, Bogdan Cautis
"... Abstract—In this paper, we present and validate an algorithm able to accurately identify anomalous behaviors on online and collaborative social networks, based on their interaction with other fellows. We focus on Wikipedia, where accurate ground truth for the classification of vandals can be reliabl ..."
Abstract - Add to MetaCart
Abstract—In this paper, we present and validate an algorithm able to accurately identify anomalous behaviors on online and collaborative social networks, based on their interaction with other fellows. We focus on Wikipedia, where accurate ground truth for the classification of vandals can be reliably gathered by manual inspection of the page edit history. We develop a distributed crawler and classifier tasks, both implemented in MapReduce, with whom we are able to explore a very large dataset, consisting of over 5 millions articles collaboratively edited by 14 millions authors, resulting in over 8 billion pairwise interactions. We represent Wikipedia as a signed network, where positive arcs imply constructive interaction between editors. We then isolate a set of high reputation editors (i.e., nodes having many positive incoming links) and classify the remaining ones based on their interactions with high reputation editors. We demonstrate our approach not only to be practically relevant (due to the size of our dataset), but also feasible (as it requires few MapReduce iteration) and accurate (over 95 % true positive rate). At the same time, we are able to classify only about half of the dataset editors (recall of 50%) for which we outline some solution under study. I.
(Show Context)

Citation Context

... 32,452 edits on 28,468 Wikipedia articles, with 2,391 vandals (smaller than ours, but with exhaustive ground truth). On the PAN-WVC-10 corpus, which has become fairly popular since, more recent work =-=[20]-=-, [19] reported an accuracy of 96% (thus comparable to ours). Yet, as previously pointed out, vandals could easily “game” some features (e.g., breaking long suspicious edits in many smaller ones) that...

Damage Detection and Mitigation in Open Collaboration Applications

by Andrew Granville West , 2013
"... Collaborative functionality is changing the way information is amassed, refined, and disseminated in online environments. A subclass of these systems characterized by "open collaboration " uniquely allow participants to *modify * content with low barriers-to-entry. A prominent example and ..."
Abstract - Add to MetaCart
Collaborative functionality is changing the way information is amassed, refined, and disseminated in online environments. A subclass of these systems characterized by "open collaboration " uniquely allow participants to *modify * content with low barriers-to-entry. A prominent example and our case study, English Wikipedia, exemplifies the vulnerabilities: 7%+ of its edits are blatantly unconstructive. Our measurement studies show this damage manifests in novel socio-technical forms, limiting the effectiveness of computational detection strategies from related domains. In turn this has made much mitigation the responsibility of a poorly organized and ill-routed human workforce. We aim to improve all facets of this incident response workflow. Complementing language based solutions we first develop content agnostic predictors of damage. We implicitly glean reputations for system entities and overcome sparse behavioral histories with a spatial reputation model combining evidence from multiple granularity. We also identify simple yet indicative metadata features that capture participatory dynamics and content maturation. When brought to bear over damage corpora our contributions: (1) advance benchmarks over a broad set of security issues ("vandalism"), (2) perform well in the first anti-spam specific approach, and (3) demonstrate their portability over diverse open collaboration use cases.
(Show Context)

Citation Context

...e evaluation uses the PAN-WVC-10 corpus [110] which has become authoritative in the field. The corpus contains ≈32,000 randomly selected English Wikipedia revisions (a 2011 extension adds 10,000 more =-=[111]-=-) with 7% being vandalism, matching research estimates of vandalism prevalence. Tagging was outsourcing to Amazon Mechanical Turk where multiple workers reviewed each edit and only those with strong c...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University