Results 1 - 10
of
23
Automatically categorizing written texts by author gender
- Literary and Linguistic Computing
, 2003
"... The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and ..."
Abstract
-
Cited by 42 (8 self)
- Add to MetaCart
The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 % accuracy. The same techniques can be used to determine if a document is fiction or non-fiction with approximately 98 % accuracy.
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
N-gram-based author profiles for authorship attribution
, 2003
"... We present a novel method for computer-assisted authorship attribution based on character-level n-gram author profiles, which is motivated by an almost-forgotten, pioneering method in 1976. The existing approaches to automated authorship attribution implicitly build author profiles as vectors of fea ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
We present a novel method for computer-assisted authorship attribution based on character-level n-gram author profiles, which is motivated by an almost-forgotten, pioneering method in 1976. The existing approaches to automated authorship attribution implicitly build author profiles as vectors of feature weights, as language models, or similar. Our approach is based on byte-level n-grams, it is language independent, and the generated author profiles are limited in size. The effectiveness of the approach and language independence are demonstrated in experiments performed on English, Greek, and Chinese data. The accuracy of the results is at the level of the current state of the art approaches or higher in some cases.
Key words: Authorship attribution, character n-grams, text categorization
Exploiting Stylistic Idiosyncrasies for Authorship Attribution
- In IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis
, 2003
"... Introduction Early researchers in authorship attribution used a variety of statistical methods to identify stylistic discriminators characteristics which remain approximately invariant within the works of a given author but which tend to vary from author to author (Holmes 1998, McEnery & Oakes ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Introduction Early researchers in authorship attribution used a variety of statistical methods to identify stylistic discriminators characteristics which remain approximately invariant within the works of a given author but which tend to vary from author to author (Holmes 1998, McEnery & Oakes 2000). In recent years machine learning methods have been applied to authorship attribution. A few examples include (Matthews & Merriam 1993, Holmes & Forsyth 1995, Stamatatos et al 2001, de Vel et al 2001). Both the earlier "stylometric" work and the more recent machine-learning work have tended to focus on initial sets of candidate discriminators which are fairly ubiquitous. For example, the classical work of Mosteller and Wallace (1964) on the Federalist Papers used a set of several hundred function words, that is, words that are context-independent and hence unlikely to be biased towards specific topics. Other features used in even earlier work (Yule 1938) are complexity-base
Applying authorship analysis to extremist-group web forum messages
- IEEE Intelligent Systems
, 2005
"... linguistic features of ..."
A Survey of Modern Authorship Attribution Methods
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
"... Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.
Change of Writing Style With Time
- COMPUTERS AND THE HUMANITIES
, 2004
"... This study investigates the writing style change of two Turkish authors, etin Altan and Ya sar Kemal, in their old and new works using respectively their newspaper columns and novels. The style markers are the frequencies of word lengths in both text and vocabulary, and the rate of usage of most fre ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
This study investigates the writing style change of two Turkish authors, etin Altan and Ya sar Kemal, in their old and new works using respectively their newspaper columns and novels. The style markers are the frequencies of word lengths in both text and vocabulary, and the rate of usage of most frequent words. For both authors, t-tests and logistic regressions show that the length of the words in new works is significantly longer than that of the old. The principal component analyses graphically illustrate the separation between old and new texts. The works are correctly categorized as old or new with 75 to 100% accuracy and 92% average accuracy using discriminant analysisbased cross validation. The results imply higher time gap may have positive impact in separation and categorization. For Altan a regression analysis demonstrates a decrease in average word length as the age of his column increases. One interesting observation is that for one word each author has similar preference changes over time.
Gender, genre, and writing style in formal written texts
- TEXT
, 2003
"... This paper explores differences between male and female writing in a large subset of the British National Corpus covering a range of genres. Several classes of simple lexical and syntactic features that differ substantially according to author gender are identified, both in fiction and in non-fictio ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
This paper explores differences between male and female writing in a large subset of the British National Corpus covering a range of genres. Several classes of simple lexical and syntactic features that differ substantially according to author gender are identified, both in fiction and in non-fiction documents. In particular, we find significant differences between male- and female-authored documents in the use of pronouns and certain types of noun modifiers: although the total number of nominals used by male and female authors is virtually identical, females use many more pronouns and males use many more noun specifiers. More generally, it is found that even in formal writing, female writing exhibits greater usage of features identified by previous researchers as "involved " while male writing exhibits greater usage of features which have been identified as "informational". Finally, a strong correlation between the characteristics of male (female) writing and those of nonfiction (fiction) is demonstrated.
Effective and Scalable Authorship Attribution Using Function Words
- Proceedings of the Second AIRS Asian Information Retrieval Symposium
, 2005
"... Abstract. Techniques for identifying the author of an unattributed document can be applied to problems in information analysis and in academic scholarship. A range of methods have been proposed in the research literature, using a variety of features and machine learning approaches, but the methods h ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Abstract. Techniques for identifying the author of an unattributed document can be applied to problems in information analysis and in academic scholarship. A range of methods have been proposed in the research literature, using a variety of features and machine learning approaches, but the methods have been tested on very different data and the results cannot be compared. It is not even clear whether the differences in performance are due to feature selection or other variables. In this paper we examine the use of a large publicly available collection of newswire articles as a benchmark for comparing authorship attribution methods. To demonstrate the value of having a benchmark, we experimentally compare several recent feature-based techniques for authorship attribution, and test how well these methods perform as the volume of data is increased. We show that the benchmark is able to clearly distinguish between different approaches, and that the scalability of the best methods based on using function words features is acceptable, with only moderate decline as the difficulty of the problem is increased. 1
Analyzing large collections of electronic text using OLAP
- In APICS 2005
, 2005
"... Computer-assisted reading and analysis of text has applications in the humanities and social sciences. Ever-larger electronic text archives have the advantage of allowing a more complete analysis but the disadvantage of forcing longer waits for results. On-Line Analytical Processing (OLAP) allows qu ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Computer-assisted reading and analysis of text has applications in the humanities and social sciences. Ever-larger electronic text archives have the advantage of allowing a more complete analysis but the disadvantage of forcing longer waits for results. On-Line Analytical Processing (OLAP) allows quick analysis of multidimensional data. By storing text-analysis information in an OLAP system, queries may be solved in seconds instead of minutes or hours. This analysis is user-driven, allowing users the freedom to pursue their own directions of research. 1

