Results 1 - 10
of
30
The Author-Topic Model for Authors and Documents
"... We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial dist ..."
Abstract
-
Cited by 153 (10 self)
- Add to MetaCart
We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics
that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact
inference is intractable for these datasets and
we use Gibbs sampling to estimate the topic
and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model)
and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications
to computing similarity between authors and
entropy of author output.
Mining E-mail Content for Author Identification Forensics
- SIGMOD RECORD
, 2001
"... We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different em ..."
Abstract
-
Cited by 59 (1 self)
- Add to MetaCart
We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different email topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.
Automatically categorizing written texts by author gender
- Literary and Linguistic Computing
, 2003
"... The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and ..."
Abstract
-
Cited by 42 (8 self)
- Add to MetaCart
The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 % accuracy. The same techniques can be used to determine if a document is fiction or non-fiction with approximately 98 % accuracy.
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
N-gram-based author profiles for authorship attribution
, 2003
"... We present a novel method for computer-assisted authorship attribution based on character-level n-gram author profiles, which is motivated by an almost-forgotten, pioneering method in 1976. The existing approaches to automated authorship attribution implicitly build author profiles as vectors of fea ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
We present a novel method for computer-assisted authorship attribution based on character-level n-gram author profiles, which is motivated by an almost-forgotten, pioneering method in 1976. The existing approaches to automated authorship attribution implicitly build author profiles as vectors of feature weights, as language models, or similar. Our approach is based on byte-level n-grams, it is language independent, and the generated author profiles are limited in size. The effectiveness of the approach and language independence are demonstrated in experiments performed on English, Greek, and Chinese data. The accuracy of the results is at the level of the current state of the art approaches or higher in some cases.
Key words: Authorship attribution, character n-grams, text categorization
Language and Task Independent Text Categorization with Simple Language Models
- In Proc. of HLT-NAACL ’03
, 2003
"... We present a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models. Our approach uses simple information theoretic principles and achieves effective performance across a variety of languages and tasks without requiri ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
We present a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models. Our approach uses simple information theoretic principles and achieves effective performance across a variety of languages and tasks without requiring feature selection or extensive pre-processing. To demonstrate the language and task independence of the proposed technique, we present experimental results on several languages - Greek, English, Chinese and Japanese - in several text categorization problems - language identification, authorship attribution, text genre classification, and topic detection. Our experimental results show that the simple approach achieves state of the art performance in each case.
A Survey of Modern Authorship Attribution Methods
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
"... Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.
A controlled-corpus experiment in authorship identification by cross-entropy
- Literary and Linguistic Computing
, 2003
"... Abstract. This paper describes an authorship, and more generally document classification, experiment on a preexisting Dutch corpus of university writings. By measuring linguistic distances using a cross-entropy technique, a technique sensitive not only to the distributions of language features, but ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Abstract. This paper describes an authorship, and more generally document classification, experiment on a preexisting Dutch corpus of university writings. By measuring linguistic distances using a cross-entropy technique, a technique sensitive not only to the distributions of language features, but also to their relative intersequencing, classification judgments can be made with great sensitivity, significance, confidence, and accuracy. In particular, despite the designed difficulty of the Dutch corpus used, the technique was still able to reliably detect not only authorship, but also subtle features of register, topic, and even the educational attainments of the author. We present evidence suggesting that this technique outperforms more well-known techniques such as function word principal components analysis or linear discriminant analysis, as well as suggest ways in which performance can be improved.
Language Independent Authorship Attribution using Character Level Language Models
, 2003
"... We present a method for computerassisted authorship attribution based on character-level n-gram language models. ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
We present a method for computerassisted authorship attribution based on character-level n-gram language models.
Using Markov Chains for Identification of Writers
, 2002
"... In this paper we present a technique for authorship attribution based on a simple Markov chain of letters (i.e., just letter bigrams are used). Many proposed methods of authorship attribution are illustrated on small examples. We show that this technique provides excellent results when applied t ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
In this paper we present a technique for authorship attribution based on a simple Markov chain of letters (i.e., just letter bigrams are used). Many proposed methods of authorship attribution are illustrated on small examples. We show that this technique provides excellent results when applied to over 380 texts from the Project Gutenberg archives, as well as to two previously published data sets.

