Results 1 - 10
of
32
Learning to Classify Documents According to Genre
- In IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis
, 2003
"... Genre or style analysis can be used to improve results achieved using standard IR techniques. A genre class is a group of documents that are written in a similar style. Genre classification can identify documents that are written in a style most likely to satisfy a user's information need. ..."
Abstract
-
Cited by 56 (0 self)
- Add to MetaCart
Genre or style analysis can be used to improve results achieved using standard IR techniques. A genre class is a group of documents that are written in a similar style. Genre classification can identify documents that are written in a style most likely to satisfy a user's information need.
Automatic text categorization in terms of genre and author
- COMPUTATIONAL LINGUISTICS
, 2001
"... The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to tak ..."
Abstract
-
Cited by 53 (5 self)
- Add to MetaCart
The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to take full advantage of existing natural language processing (NLP) tools. To this end, we propose a set of style markers including analysis-level measures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost. We present a set of small-scale but reasonable experiments in text genre detection, author identification, and author verification tasks and show that the proposed method performs better than the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. All the presented experiments are based on unrestricted text downloaded from the World Wide Web without any manual text preprocessing or text sampling. Various performance issues regarding the training set size and the significance of the proposed style markers are discussed. Our system can be used in any application that requires fast and easily adaptable text categorization in terms of stylistically homogeneous categories. Moreover, the procedure of defining analysis-level markers can be followed in order to extract useful stylistic information using existing text processing tools.
Integrating Automatic Genre Analysis into Digital Libraries
- IN FIRST ACM-IEEE JOINT CONF ON DIGITAL LIBRARIES
, 2001
"... With the number and types of documents in digital library systems increasing, tools for automatically organizing and presenting the content have to be found. While many approaches focus on topic-based organization and structuring, hardly any system incorporates automatic structural analysis and repr ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
With the number and types of documents in digital library systems increasing, tools for automatically organizing and presenting the content have to be found. While many approaches focus on topic-based organization and structuring, hardly any system incorporates automatic structural analysis and representation. Yet, genre information (unconsciously) forms one of the most distinguishing features in conventional libraries and in information searches. In this paper we present an approach to automatically analyze the structure of documents and to integrate this information into an automatically created content-based organization. In the resulting visualization, documents on similar topics, yet representing different genres, are depicted as books in diering colors. This representation supports users intuitively in locating relevant information presented in a relevant form.
Computer-based Authorship Attribution without Lexical Measures
- Computers and the Humanities
, 2001
"... Abstract. The most important approaches to computer-assisted authorship attribution are exclusively based on lexical measures that either represent the vocabulary richness of the author or simply comprise frequencies of occurrence of common words. In this paper we present a fully-automated approach ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
Abstract. The most important approaches to computer-assisted authorship attribution are exclusively based on lexical measures that either represent the vocabulary richness of the author or simply comprise frequencies of occurrence of common words. In this paper we present a fully-automated approach to the identification of the authorship of unrestricted text that excludes any lexical measure. Instead we adapt a set of style markers to the analysis of the text performed by an already existing natural language processing tool using three stylometric levels, i.e., token-level, phrase-level, and analysis-level measures. The latter represent the way in which the text has been analyzed. The presented experiments on a Modern Greek newspaper corpus show that the proposed set of style markers is able to distinguish reliably the authors of a randomly-chosen group and performs better than a lexically-based approach. However, the combination of these two approaches provides the most accurate solution (i.e., 87 % accuracy). Moreover, we describe experiments on various sizes of the training data as well as tests dealing with the significance of the proposed set of style markers. 1.
Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results
, 2003
"... This paper considers the use of computational stylistics for performing authorship attribution of electronic messages, addressing categorization problems with as many as 20 different classes (authors). E#ective stylistic characterization of text is potentially useful for a variety of tasks, as langu ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
This paper considers the use of computational stylistics for performing authorship attribution of electronic messages, addressing categorization problems with as many as 20 different classes (authors). E#ective stylistic characterization of text is potentially useful for a variety of tasks, as language style contains cues regarding the authorship, purpose, and mood of the text, all of which would be useful adjuncts to information retrieval or knowledge-management tasks. We focus here on the problem of determining the author of an anonymous message, based only on the message text. Several multiclass variants of the Winnow algorithm were applied to a vector representation of the message texts to learn models for discriminating di#erent authors. We present results comparing the classification accuracy of the di#erent approaches. The results show that stylistic models can be learned to determine an author's identity with promising accuracy. This work thus forms a baseline for future research in author attribution of electronic messages.
Genre Classification and Domain Transfer for Information Filtering
, 2002
"... The World Wide Web is a vast repository of information, but the sheer volume makes it difficult to identify useful documents. We identify document genre is an important factor in retrieving useful documents and focus on the novel document genre dimension of subjectivity. ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
The World Wide Web is a vast repository of information, but the sheer volume makes it difficult to identify useful documents. We identify document genre is an important factor in retrieving useful documents and focus on the novel document genre dimension of subjectivity.
Web-Specific Genre Visualization
- In Proceedings of the Webnet World Conference on the WWW and Internet
, 1998
"... : User interfaces to WWW search engines typically present results as ranked lists of documents. Such lists give users little help in understanding document variation: we propose a richer representation of retrieval results in the search interface. Fundamental to us is the notion of document grouping ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
: User interfaces to WWW search engines typically present results as ranked lists of documents. Such lists give users little help in understanding document variation: we propose a richer representation of retrieval results in the search interface. Fundamental to us is the notion of document grouping. We use both stylistic genre-based document categorization and statistical content-based clustering, and organize documents along these criteria in a highly interactive visualization front-end to WWW search engines, enabling quick overview and incremental query refinement. Introduction The vast majority of user interfaces to WWW search engines are still based on an exceedingly simple interaction model where a linear list of hits, i.e. document items, is sorted after so-called "relevance" with inner workings and metrics hidden and all but incomprehensible to most users: "This is appealing in its simplicity, but users are often frustrated as they do not know what the results mean, nor can th...
Author Identification on the Large Scale
- In Proc. of the Meeting of the Classification Society of North America
, 2005
"... this paper is on techniques for identifying authors in large collections of textual artifacts (e-mails, communiques, transcribed speech, etc.). Our approach focuses on very high-dimensional, topic-free document representations and particular attribution problems, such as: (1) Which one of these K au ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
this paper is on techniques for identifying authors in large collections of textual artifacts (e-mails, communiques, transcribed speech, etc.). Our approach focuses on very high-dimensional, topic-free document representations and particular attribution problems, such as: (1) Which one of these K authors wrote this particular document? (2) Did any of these K authors write this particular document? Scientific investigation into measuring style and authorship of texts goes back to the late nineteenth century, with the pioneering studies of Mendenhall [36] and Mascol [34, 35] on distributions of sentence and word lengths in works of literature and the gospels of the New Testament. The underlying notion was that works by di#erent authors are strongly distinguished by quantifiable features of the text. By the mid-twentieth century, this line of research had grown into what became known as "stylometrics", and a variety of textual statistics had been proposed to quantify textual style. The style of early work was characterized by a search for invariant properties of textual statistics, such as Zipf's distribution and Yule's K statistic
Lexical Predictors Of Personality Type
- IN PROCEEDINGS OF THE JOINT ANNUAL MEETING OF THE INTERFACE AND THE CLASSIFICATION SOCIETY OF NORTH AMERICA
, 2005
"... We are currently pursuing methods for "author profiling" in which various aspects of the author's identity might be identified from a text, without necessarily having a corpus of documents from the same individual. A key component of such an identity profile is personality; this paper addresses d ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
We are currently pursuing methods for "author profiling" in which various aspects of the author's identity might be identified from a text, without necessarily having a corpus of documents from the same individual. A key component of such an identity profile is personality; this paper addresses distinguishing high from low neuroticism and extraversion in authors of informal text. We consider four different sets of lexical features for this task: a standard function word list, conjunctive phrases, modality indicators, and appraisal adjectives and modifiers. SMO, a support vector machine learner, was used to learn linear separators for the high and low classes in each of the two tasks.
Finding buying guides with a web carnivore
- In Proceedings of the 1st Latin American Web Congress (LA-WEB
, 2003
"... ..."

