## Authorship Attribution with Support Vector Machines (2000)

Venue: | APPLIED INTELLIGENCE |

Citations: | 62 - 0 self |

### BibTeX

@ARTICLE{Diederich00authorshipattribution,

author = {Joachim Diederich and Jörg Kindermann and Edda Leopold and Gerhard Paass},

title = {Authorship Attribution with Support Vector Machines},

journal = {APPLIED INTELLIGENCE},

year = {2000},

volume = {19},

pages = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper we explore the use of text-mining methods for the identification of the author of a text. For the first time we apply the support vector machine (SVM) to this problem. As it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60-80% of the cases. In a second experiment we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVM on full word forms was remarkably robust even if the author wrote about different topics.

### Citations

8973 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...ently, a boosting version was applied to text categorization with good results [ADW98]. 3.3 Support Vector Machines Support Vector Machines (SVMs) recently gained popularity in the learning community =-=[Vap98]-=-. In its simplest linear form, an SVM is a hyperplane that separates a set of positive examples from a set of negative examples with maximum interclass distance, the margin. Figure 1 shows such a hype... |

3122 |
Introduction to modern Information Retrieval
- Salton, McGill
(Show Context)
Citation Context ... is often thought in terms of reduction of dimensionality of the input space. Common importance weights like inverse document frequency (idf) originally have been designed for identifying index words =-=[SM83]-=-. However, SVM are capable to manage a large number of dimensions. Therefore reduction of dimensionality is not necessary and importance weights can be used to quantify how important a specific given ... |

2168 | Support-Vector Networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...h training example. Note that the hyperplane is only determined by the training instances x i on the margin, the support vectors. Of course, not all problems are linearly separable. Cortes and Vapnik =-=[CV95]-=- proposed a modification to the optimization formulation that allows, but penalizes, examples that fall on the wrong side of the decision boundary. Support vector machines are based on the structural ... |

1696 | Text categorization with support vector machines: learning with many relevant features
- Joachims
- 1998
(Show Context)
Citation Context ...y 1,000 style markers have already been isolated." There is clearly no agreement on significant style markers. It seems that in text categorization nearly all words contain some information. Joac=-=hims [Joa98b]-=- ranked 10000 word stems of a large corpus according to their information gain with respect to some classification. It turned out that a model using features with ranks 201-500 performed nearly as wel... |

1526 | Term Weighting Approaches in Automatic Text Retrieval
- unknown authors
- 1988
(Show Context)
Citation Context ...tion procedure. This effects mainly high frequency terms and collections of larger documents. Empirical studies show that the utilization of F idf (w k ) in many cases lead to an improved performance =-=[SB88]-=-. Redundancy quantifies the skewness of a probability distribution. We consider the empirical distribution of each type over the different documents and define an importance weight by R(w k ) = Mmax \... |

1440 |
Making large-scale svm learning practical
- Joachims
- 1999
(Show Context)
Citation Context ...esian probabilistic networks [TH93], multilayer perceptrons [NGL97], radial basis function networks [LM95], decision trees [ADW98], nearest neighbor classification [LH98], and support vector machines =-=[Joa98a]-=-. These models are universal approximators as they are able to approximate any functional relation arbitrarily well. They model the underlying distribution with a potentially infinite number of parame... |

1289 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...ls. 5 The SVM can be extended to nonlinear models by mapping the input space into a very high-dimensional feature space chosen a priori. In this space the optimal separating hyperplane is constructed =-=[BGV92]-=- [Vap98, p.421]. Let \Phi : ! N ! F be a mapping x ! z = \Phi(x) such that Nsdim(F ). Then for a given \Phi the linear classification u = ssz + b with parameter s may be learned. Note, however, that t... |

757 | Nigam K.: A Comparison of Event Models for Naive Bayes Text Classification
- Mccallum
(Show Context)
Citation Context ...c Models for Classification The advent of powerful computers initiated the development of machine learning techniques with larger flexibility. While regression models [YC94] and naive Bayesian models =-=[MN98]-=- for text classification still have structural limitations, a number of more and more versatile procedures were applied to text categorization, e.g. inductive rule learning [MRG96], Bayesian probabili... |

682 | Transductive inference for text classification using support vector machines
- Joachims
- 1999
(Show Context)
Citation Context ...every case, the two SVM versions (polynomial and rbf) performed substantially better than the currently best performing conventional methods (naive Bayes, Rocchio, decision trees, knearest neighbor). =-=[Joa99]-=- used a transductive SVM for text categorization which is able to exploit the information in unlabeled training data. Dumais et al. [DPHS98] use linear SVMs for text categorization because they are bo... |

561 | Statistical language learning - Charniak - 1996 |

503 | Inductive learning algorithms and representations for text categorization
- Dumais, Platt, et al.
- 1998
(Show Context)
Citation Context ...s (naive Bayes, Rocchio, decision trees, knearest neighbor). [Joa99] used a transductive SVM for text categorization which is able to exploit the information in unlabeled training data. Dumais et al. =-=[DPHS98]-=- use linear SVMs for text categorization because they are both accurate and fast. They are 35 times faster to train than the next most accurate (a decision tree) of the tested classifiers. They applie... |

417 | Discrete Multivariate Analysis: Theory and Practice - Bishop, Fienberg, et al. - 1975 |

320 |
Human Behaviour and the Principle of Least-Effort
- Zipf
- 1949
(Show Context)
Citation Context ...owerful criterion of stylometry is the 'richness' or 'diversity ' of an author's vocabulary. Zipf observed the number of words ff(f) which occure exactly f-times is given by ff(f) = f fl ; where fls2 =-=[Zip32a]-=-. He conjectured that the parameter fl depends on the age and intelligence of an author [Zip37]. Sichel [Sic75] was able to fit successfully a family of compound Poisson distributions to the word freq... |

239 | Support VectorMachines for Spam Categorization
- Drucker, Vapnik
- 1999
(Show Context)
Citation Context ...fast. They are 35 times faster to train than the next most accurate (a decision tree) of the tested classifiers. They applied SVMs to the Reuter-21578 collection, emails and web pages. Drucker at al. =-=[DWV99]-=- classify emails as spam and non spam. They find that boosting trees and SVMs have similar performance in terms of accuracy and speed. SVMs train significantly faster. 6 4 Transformations of Frequency... |

234 |
A survey and critique of techniques for extracting rules from trained artificial neural networks
- Andrews, Diederich, et al.
- 1995
(Show Context)
Citation Context ...t into the process by which they arrived at a given result nor, in general, the totality of "knowledge" actually embedded in them. A number of techniques to explain these networks have been =-=developed [ADT95]-=- recently. Radial basis function (RBF) networks start with a number of prototype feature vectors for each class and assume that the feature vector of a new exemplar is `close' to some prototype of its... |

188 |
Inference and disputed authorship: The Federalist
- Mosteller, Wallace
- 1964
(Show Context)
Citation Context ...pecific words has a larger discriminatory power. Obviously word usage highly depends on the topic of the text. For discrimination purposes we need "contentfree " or function words. In a semi=-=nal paper [MW64]-=- counted the use of words like `while' and `upon' to discriminate between possible authors. Burrows [Bur87] developed the idea of using sets of more than fifty common high-frequency words and conducte... |

115 |
Feature selection, perceptron learning, and a usability case study for text categorization
- Ng, Goh, et al.
- 1997
(Show Context)
Citation Context ...l limitations, a number of more and more versatile procedures were applied to text categorization, e.g. inductive rule learning [MRG96], Bayesian probabilistic networks [TH93], multilayer perceptrons =-=[NGL97]-=-, radial basis function networks [LM95], decision trees [ADW98], nearest neighbor classification [LH98], and support vector machines [Joa98a]. These models are universal approximators as they are able... |

115 |
An example-based mapping method for text categorization and retrieval
- Yang, Chute
- 1994
(Show Context)
Citation Context ...owe's plays. 3 3.2 Semi-parametric Models for Classification The advent of powerful computers initiated the development of machine learning techniques with larger flexibility. While regression models =-=[YC94]-=- and naive Bayesian models [MN98] for text classification still have structural limitations, a number of more and more versatile procedures were applied to text categorization, e.g. inductive rule lea... |

110 |
Selected studies of the principle of relative frequency in language
- Zipf
- 1932
(Show Context)
Citation Context ...4 Transformations of Frequency Vectors 4.1 Normalization of Length It is a well known fact that the frequency distribution of words in large texts is fairly skewed. Zipf's law in the original version =-=[Zip32b]-=- was given by f(r) = A B + r ; (8) where f(r) is the frequency of the term of rank r in a text and A and B are positive parameters. It can be seen that the distribution of frequencies of terms in text... |

77 |
Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution
- Baayen, Halteren, et al.
- 1996
(Show Context)
Citation Context ... of 25 prepositions to distinguish between Oscar Wilde's plays and essays. Instead of using word counts directly one can employ features derived from words. An example is the syntactic class of words =-=[BvT96]-=-. Compared to the use of syntax, word use is more easily influenced by choices which are under the conscious control of the author. As the discourse structure of texts from the same author and the cor... |

62 |
Using a generalized instance set for automatic text categorization
- Lam, Ho
- 1998
(Show Context)
Citation Context ...inductive rule learning [MRG96], Bayesian probabilistic networks [TH93], multilayer perceptrons [NGL97], radial basis function networks [LM95], decision trees [ADW98], nearest neighbor classification =-=[LH98]-=-, and support vector machines [Joa98a]. These models are universal approximators as they are able to approximate any functional relation arbitrarily well. They model the underlying distribution with a... |

54 | Automatic indexing based on Bayesian inference networks
- Tzeras, Hartmann
- 1993
(Show Context)
Citation Context ...sification still have structural limitations, a number of more and more versatile procedures were applied to text categorization, e.g. inductive rule learning [MRG96], Bayesian probabilistic networks =-=[TH93]-=-, multilayer perceptrons [NGL97], radial basis function networks [LM95], decision trees [ADW98], nearest neighbor classification [LH98], and support vector machines [Joa98a]. These models are universa... |

44 |
Text categorization: a symbolic approach
- Moulinier, Raskinis, et al.
- 1996
(Show Context)
Citation Context ...aive Bayesian models [MN98] for text classification still have structural limitations, a number of more and more versatile procedures were applied to text categorization, e.g. inductive rule learning =-=[MRG96]-=-, Bayesian probabilistic networks [TH93], multilayer perceptrons [NGL97], radial basis function networks [LM95], decision trees [ADW98], nearest neighbor classification [LH98], and support vector mach... |

41 |
INFERNO: A Cautious Approach to Uncertain
- Quinlan
- 1983
(Show Context)
Citation Context ...ees sequentially partition the input space along a single dimension. The corresponding variable is selected in a one-step lookahead greedy search using some heuristic measure of classification quality=-=[Qui83]-=-. This splits the training set into two parts and the procedure starts over. The size of the resulting tree is limited by cross-validation. Recently, a boosting version was applied to text categorizat... |

36 | The state of authorship attribution studies: Some problems and solutions - Rudman - 1997 |

34 |
1887. The characteristic curves of composition
- Mendenhall
(Show Context)
Citation Context ... are expressive enough to discriminate an author from other writers. Early stylometric studies introduced the idea of counting features in a text and applied this to word lengths and sentence lengths =-=[Men87]-=-. Yule [Yul38] reported a wider variation of sentence lengths than word lengths. There are differences in sentence length distributions for the same author, not only depending on time but also on the ... |

30 |
The evolution of stylometry in humanities scholarship
- Holmes
- 1998
(Show Context)
Citation Context ...]. They parallel differences in word length distributions in the prose and verse of the same author. Other features are counts of words beginning with a vowel or counts of words with specific lengths =-=[Hol98]-=-. A powerful criterion of stylometry is the 'richness' or 'diversity ' of an author's vocabulary. Zipf observed the number of words ff(f) which occure exactly f-times is given by ff(f) = f fl ; where ... |

29 |
Neural network applications in stylometry: The Federalist papers
- Tweedie, Singh, et al.
- 1996
(Show Context)
Citation Context ...he resulting algorithm is very efficient. It has been extended to a mixture of multinomials and successfully applied to text categorization [MN98]. Multi-layer perceptrons were used by Tweedie et al. =-=[TSH96]-=- to attribute authorship to the disputed Federalist papers. They used the normalized frequency of eleven common function words in a text as input to the network. The neural network had three hidden an... |

27 |
On the theory of word frequencies and on related Markovian models of discourse
- Mandelbrot
- 1953
(Show Context)
Citation Context .... [CB93]). Zipf himself explained equation (8) by a "principle of least effort". A generalization of Zipf's law is given by f(r) = ` A B + r ' 1 fl \Gamma1 ; (9) which is known as Zipf-Mande=-=lbrot law [Man53]-=-. It contains equation (8) as a special case. In order to compare documents of different length term-frequency vectors d i have to be normalized to a standard length. From the standpoint of performanc... |

26 |
Text mining with decision rules and decision trees
- Apte, Damerau, et al.
- 1998
(Show Context)
Citation Context ...ere applied to text categorization, e.g. inductive rule learning [MRG96], Bayesian probabilistic networks [TH93], multilayer perceptrons [NGL97], radial basis function networks [LM95], decision trees =-=[ADW98]-=-, nearest neighbor classification [LH98], and support vector machines [Joa98a]. These models are universal approximators as they are able to approximate any functional relation arbitrarily well. They ... |

23 |
Word patterns and story shapes: the statistical analysis of narrative style
- Burrows
- 1987
(Show Context)
Citation Context ...ext. For discrimination purposes we need "contentfree " or function words. In a seminal paper [MW64] counted the use of words like `while' and `upon' to discriminate between possible authors=-=. Burrows [Bur87]-=- developed the idea of using sets of more than fifty common high-frequency words and conducted a version of principal component analysis on the data. This technique has been successfully applied to th... |

21 |
Analysing for Authorship: A Guide to the Cusum Technique
- Farringdon
- 1996
(Show Context)
Citation Context ...ic assumptions, e.g. the independence of the terms (x i \Gammasx). These bounds allow a nice graphical representation. It was used in a number of court cases and received significant public attention =-=[Far96]-=-. However, a number of independent investigations found the method unreliable [Hol98], as again the stability of these characteristics over multiple texts is not warranted. A test developed by Thisted... |

21 |
On a distribution law for word frequencies
- Sichel
- 1975
(Show Context)
Citation Context ...umber of words ff(f) which occure exactly f-times is given by ff(f) = f fl ; where fls2 [Zip32a]. He conjectured that the parameter fl depends on the age and intelligence of an author [Zip37]. Sichel =-=[Sic75]-=- was able to fit successfully a family of compound Poisson distributions to the word frequencies of a number of authors and works in different languages. In order to remove the dependency of vocabular... |

13 | A freely available morphological analyzer, disambiguator, and context sensitive lemmatizer for German
- Lezius, Rapp, et al.
- 1998
(Show Context)
Citation Context ...on Words The second set of inputs was aimed at using less content information and more structural data. Here we lemmatized the corpus using the lemmatizer Morphy. Morphy was designed by Lezius et al. =-=[LRW98]-=- and is freely available on the Web. Lemmatization produced 455 different categories. An example is shown in table 4. We excluded all nouns (SUB), verbs (VER), and adjectives (ADJ) and used the word c... |

13 |
On sentence length as a statistical characteristic of style in prose with application to two cases of disputed authorship
- Yule
(Show Context)
Citation Context ...e enough to discriminate an author from other writers. Early stylometric studies introduced the idea of counting features in a text and applied this to word lengths and sentence lengths [Men87]. Yule =-=[Yul38]-=- reported a wider variation of sentence lengths than word lengths. There are differences in sentence length distributions for the same author, not only depending on time but also on the genre of text ... |

12 | The advanced theory of language as choice and chance - Herdan - 1966 |

8 |
Word frequency distributions
- Chitashvili, Baayen
- 1993
(Show Context)
Citation Context ...cur only once. Unfortunately especially the seldom units contain highly specific information about the content of the text. Formula (8) has been generalized by various authors (for a summary see e.g. =-=[CB93]). Zipf hi-=-mself explained equation (8) by a "principle of least effort". A generalization of Zipf's law is given by f(r) = ` A B + r ' 1 fl \Gamma1 ; (9) which is known as Zipf-Mandelbrot law [Man53].... |

8 |
Shakespeare vs Fletcher: A stylometric analysis by radial basis functions
- Lowe, Matthews
- 1995
(Show Context)
Citation Context ... versatile procedures were applied to text categorization, e.g. inductive rule learning [MRG96], Bayesian probabilistic networks [TH93], multilayer perceptrons [NGL97], radial basis function networks =-=[LM95]-=-, decision trees [ADW98], nearest neighbor classification [LH98], and support vector machines [Joa98a]. These models are universal approximators as they are able to approximate any functional relation... |

8 |
Did Shakespeare write a newly discovered poem
- Thisted, Efron
- 1987
(Show Context)
Citation Context ...An interesting feature is the comparison of the number of words which occur exactly j-times in the training data and the number of words which occur exactly j-times in a new text, for j = 0; 1; : : : =-=[Thi87]. Thisted -=-and Efron estimated the size of Shakespeare's vocabulary by asking "How many new words would Shakespeare use if he were to write another play?" Many studies found distinct differences in voc... |

4 |
Are the thisted-efron authorship tests valid
- Valenza
- 1991
(Show Context)
Citation Context ...in the corpus and a sample text and propose various significance test under the assumption that word selection of an author is a Poisson process. As for the other tests the results are mixed. Valenza =-=[Val91]-=- applied these tests to the works of Shakespeare and Marlowe and found good consistency for the Shakespeare plays but poor consistency between Shakespeare poems and plays or Marlowe's plays. 3 3.2 Sem... |

2 |
Ein Modell der Haufigkeitsstruktur des Vokabulars
- Orlov
- 1983
(Show Context)
Citation Context ... to remove the dependency of vocabulary size from the text length N , alternate features have been proposed. These range from the simple type-token ratio to more complex measures as Orlov's Zipf size =-=[Orl83]-=-. An interesting feature is the comparison of the number of words which occur exactly j-times in the training data and the number of words which occur exactly j-times in a new text, for j = 0; 1; : : ... |

1 |
A bridge between statistics and literature: The graphs of Oscar Wilde's literary genres
- Binongo, Smith
- 1999
(Show Context)
Citation Context ...ponent analysis on the data. This technique has been successfully applied to the classical `Federalist Papers' problem and promises large gains if more computing power is available. Binongo and Smith =-=[BS99]-=- used the frequency of occurrence of 25 prepositions to distinguish between Oscar Wilde's plays and essays. Instead of using word counts directly one can employ features derived from words. An example... |

1 |
Literature and statistics
- Gani
- 1985
(Show Context)
Citation Context ... reported a wider variation of sentence lengths than word lengths. There are differences in sentence length distributions for the same author, not only depending on time but also on the genre of text =-=[Gan85]-=-. They parallel differences in word length distributions in the prose and verse of the same author. Other features are counts of words beginning with a vowel or counts of words with specific lengths [... |

1 |
Cumulative sum control charts
- Goel
- 1982
(Show Context)
Citation Context ...ibution. A number of tests have been developed checking different distributional features. Starting with a characteristicsx i of the i-th sentence in an author's text, e.g. its length, the cusum test =-=[Goe82]-=- detects significant deviances from the mean valuesx. It establishes bounds for the test quantity P i (x i \Gammasx) which are valid under specific assumptions, e.g. the independence of the terms (x i... |

1 |
Observations on the possible effects of mental age upon the frequency-distribution of words from the viewpoint of dynamic philology
- Zipf
- 1937
(Show Context)
Citation Context ...f observed the number of words ff(f) which occure exactly f-times is given by ff(f) = f fl ; where fls2 [Zip32a]. He conjectured that the parameter fl depends on the age and intelligence of an author =-=[Zip37]-=-. Sichel [Sic75] was able to fit successfully a family of compound Poisson distributions to the word frequencies of a number of authors and works in different languages. In order to remove the depende... |