Author Identification on the Large Scale (2005)
Cached
Download Links
- [mms-03.rutgers.edu]
- [www.stat.rutgers.edu]
- CiteULike
Other Repositories/Bibliography
| Venue: | In Proc. of the Meeting of the Classification Society of North America |
| Citations: | 9 - 0 self |
BibTeX
@INPROCEEDINGS{Madigan05authoridentification,
author = {David Madigan and Alexander Genkin and David D. Lewis and Er Genkin David D. Lewis and Shlomo Argamon and Dmitriy Fradkin and Li Ye and David D. Lewis Consulting},
title = {Author Identification on the Large Scale},
booktitle = {In Proc. of the Meeting of the Classification Society of North America},
year = {2005}
}
OpenURL
Abstract
this paper is on techniques for identifying authors in large collections of textual artifacts (e-mails, communiques, transcribed speech, etc.). Our approach focuses on very high-dimensional, topic-free document representations and particular attribution problems, such as: (1) Which one of these K authors wrote this particular document? (2) Did any of these K authors write this particular document? Scientific investigation into measuring style and authorship of texts goes back to the late nineteenth century, with the pioneering studies of Mendenhall [36] and Mascol [34, 35] on distributions of sentence and word lengths in works of literature and the gospels of the New Testament. The underlying notion was that works by di#erent authors are strongly distinguished by quantifiable features of the text. By the mid-twentieth century, this line of research had grown into what became known as "stylometrics", and a variety of textual statistics had been proposed to quantify textual style. The style of early work was characterized by a search for invariant properties of textual statistics, such as Zipf's distribution and Yule's K statistic







