Results 1 - 10
of
25
Machine Learning in Automated Text Categorization
- ACM Computing Surveys
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract
-
Cited by 839 (13 self)
- Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Automatic Web Page Categorization by Link and Context Analysis
, 1999
"... Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of mater ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of material on the Web; it is thus becoming necessary to resort to techniques for the automatic classification of documents. Automatic classification is traditionally performed by extracting the information for representing a document (``indexing'') from the document itself. The paper describes the novel technique of categorization by context, which instead extracts useful information for classifying a document from the context where a URL referring to it appears. We present the results of experimenting with Theseus, a classifier that exploits this technique.
The Application of Classical Information Retrieval Techniques to Spoken Documents
, 1995
"... Object Description General Discussion Map Reading Photographic Interpretation Cartoon Description Table 4.1: Message classes in classification experiments of Rose et al. Now, an estimate of I(C i ; w k ) can be calculated by a four--way partition of the set of test messages, depending on (a) whether ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
Object Description General Discussion Map Reading Photographic Interpretation Cartoon Description Table 4.1: Message classes in classification experiments of Rose et al. Now, an estimate of I(C i ; w k ) can be calculated by a four--way partition of the set of test messages, depending on (a) whether or not a message belongs to topic class C i and (b) whether or not it contains word w k . If N is the number of messages in the test collection, R i is the number belonging to topic class C i , n k is the number of messages containing word w k and r ik is the number of messages in class C i containing word w k , then, estimating the probabilities by frequency counts, I(C i ; w k ) = log ( r ik R i ) ( n k N ) : This is actually identical to a form of retrospective term relevance weight, initially proposed in the IR literature by both Barkla [66] and Miller [67], and reviewed by Robertson and Sparck Jones in their classic paper on the subject [42]. Moreover, Rose proposed, but did no...
A tutorial on automated text categorisation
- Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pages 7--35, Buenos Aires, AR
, 1999
"... The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late ’80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert k ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late ’80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert knowledge on how to classify documents. In the ’90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest. A newer paradigm based on machine learning has superseded the previous approach. Within this paradigm, a general inductive process automatically builds a classifier by “learning”, from a set of previously classified documents, the characteristics of one or more categories; the advantages are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues of document indexing, classifier construction, and classifier evaluation, will be touched upon. 1 A definition of the text categorisation task
Proofs in Context
- In Principles of Knowledge Representation and Reasoning
, 1994
"... Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of mat ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of material, and therefore it will be necessary to resort to techniques for automatic classification of documents. Classification is traditionally performed by extracting information for indexing a document from the document itself. The paper describes the technique of categorisation by context, which exploits the context perceivable from the structure of HTML documents to extract useful information for classifying the documents they refer to. We present the results of experiments with a preliminary implementation of the technique. 1.
Parallel Text Search Methods
- Communications of the ACM
, 1988
"... A comparison of recently proposed parallel text search methods to alternative available search strategies that use serial processing machines suggests parallel methods do not provide large-scale gains in either retrieval effectiveness or efficiency. ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
A comparison of recently proposed parallel text search methods to alternative available search strategies that use serial processing machines suggests parallel methods do not provide large-scale gains in either retrieval effectiveness or efficiency.
Searchers’ selection of search keys: II. Controlled vocabulary or free-text searching
- Journal of the American Society for Information Science
, 1991
"... Searching with descriptors from controlled vocabu-laries complements free-text searching with textwords. The case study method provided data about the manner in which the two types of search keys interact through: (1) observation of 47 professional searchers performing their job-related searches; an ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Searching with descriptors from controlled vocabu-laries complements free-text searching with textwords. The case study method provided data about the manner in which the two types of search keys interact through: (1) observation of 47 professional searchers performing their job-related searches; and (2) analysis of verbal and search protocols, denoting reasons for the selec-tion of each search key and for each search modifica-tion. Results show that searchers used thesauri and indexing when it was of satisfactory quality and avail-able to them, and that these and other database-related reasons were the most influential in search-key selec-tion. Further, having to perform a multidatabase search induced the use of textwords without consulting a the-saurus. There is a need for high quality thesauri which are easily available and for mechanisms, such as switching languages, to aid in multidatabase searches.
Machine Learning in Automated Text Categorisation
, 1999
"... The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering techniques, i.e. manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories. In the ’90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the machine learning paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach. Within the machine learning paradigm, a general inductive process (called the learner) automatically builds a classifier (also called the rule, or the hypothesis) by “learning”, from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this survey we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into “Yahoo!-like ” hierarchically structured sets of categories.
Effective Reformulation of Boolean Queries with Concept Lattices
- In Proceedings of the 3rd International Conference on Flexible Query-Answering Systems
, 1998
"... In this paper we describe an approach, implemented in a system named REFINER, to combining Boolean information retrieval and content-based navigation with concept lattices. When REFINER i s presented with a Boolean query, it builds and displays a portion of the concept lattice associated with th ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
In this paper we describe an approach, implemented in a system named REFINER, to combining Boolean information retrieval and content-based navigation with concept lattices. When REFINER i s presented with a Boolean query, it builds and displays a portion of the concept lattice associated with the documents being searched centered around the user query. The cluster network displayed by REFINER shows the result of the query along with a set of minimal query refinements/enlargements. REFINER has two main advantages. The first i s that it can be used to improve the effectiveness of Boolean retrieval, because it allows content-driven query reformulation with controlled amount of output. The second is that it has potentials for information exploration, because the displayed network is navigatable. We compared information retrieval using REFINER with conventional Boolean retrieval. The results of an experiment conducted on a medium-sized bibliographic database showed that the performance of REFINER was better than unrefined Boolean retrieval.
Evaluation of Learning Schemes Used in Information Retrieval
, 1996
"... Searching within the context of information retrieval may be viewed as a communication process between the users and the indexers (or the authors). It is known that in expressing the same concept or idea, different people tend to use different words or phrases, and also that the meaning of words att ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Searching within the context of information retrieval may be viewed as a communication process between the users and the indexers (or the authors). It is known that in expressing the same concept or idea, different people tend to use different words or phrases, and also that the meaning of words attached to document surrogates tends to change over time. To overcome these phenomena, various learning schemes have been designed so as to automatically infer knowledge about document content from the relevance assessments of past queries. Thus, in contrast to most retrieval models that represent the semantic content of documents as static entities, these adaptive search models might change the descriptions of documents through an inductive learning scheme. The evaluation of such dynamic document space strategies may be based on retrospective tests within which the same set of queries is applied to train and test the system. Based on cross-validation principles, this paper suggests a more "ho...

