Results 1 - 10
of
63
Term-weighting approaches in automatic text retrieval
- INFORMATION PROCESSING AND MANAGEMENT
, 1988
"... The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucia ..."
Abstract
-
Cited by 1216 (9 self)
- Add to MetaCart
The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term-weighting systems. This article summarizes the insights gained in automatic term weighting, and provides baseline single-term-indexing models with which other more elaborate content analysis procedures can be compared.
Modern information retrieval: a brief overview
- BULLETIN OF THE IEEE COMPUTER SOCIETY TECHNICAL COMMITTEE ON DATA ENGINEERING
, 2001
"... For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. The field of Information Retrieval (IR) wa ..."
Abstract
-
Cited by 101 (0 self)
- Add to MetaCart
For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. The field of Information Retrieval (IR) was born in the 1950s out of this necessity. Over the last forty years, the field has matured considerably. Several IR systems are used on an everyday basis by a wide variety of users. This article is a brief overview of the key advances in the field of Information Retrieval, and a description of where the state-of-the-art is at in the field.
SCAM: A Copy Detection Mechanism for Digital Documents
- In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries
, 1995
"... Copy detection in Digital Libraries may provide the necessary guarantees for publishers and newsfeed services to offer valuable on-line data. We consider the case for a registration server that maintains registered documents against which new documents can be checked for overlap. In this paper we pr ..."
Abstract
-
Cited by 91 (9 self)
- Add to MetaCart
Copy detection in Digital Libraries may provide the necessary guarantees for publishers and newsfeed services to offer valuable on-line data. We consider the case for a registration server that maintains registered documents against which new documents can be checked for overlap. In this paper we present a new scheme for detecting copies based on comparing the word frequency occurrences of the new document against those of registered documents. We also report on an experimental comparison between our proposed scheme and COPS [6], a detection scheme based on sentence overlap. The tests involve over a million comparisons of netnews articles and show that in general the new scheme performs better in detecting documents that have partial overlap. Keywords: Copy Detection, Plagiarism, Registration Ser-ver, Databases. 1 Introduction A Digital Library provides users with on-line access to digitized news articles, books, and other information. This material is based upon work supported by ...
A Web Browser for Small Terminals
- In Proc. UIST
, 1999
"... Abstract. We describe WEST, a WEb browser for Small Terminals, that aims to solve some of the problems associated with accessing web pages on hand-held devices. Through a novel combination of text reduction and focus+context visualization, users can access web pages from a very limited display envir ..."
Abstract
-
Cited by 52 (7 self)
- Add to MetaCart
Abstract. We describe WEST, a WEb browser for Small Terminals, that aims to solve some of the problems associated with accessing web pages on hand-held devices. Through a novel combination of text reduction and focus+context visualization, users can access web pages from a very limited display environment, since the system will provide an overview of the contents of a web page even when it is too large to be displayed in its entirety. To make maximum use of the limited resources available on a typical hand-held terminal, much of the most demanding work is done by a proxy server, allowing the terminal to concentrate on the task of providing responsive user interaction. The system makes use of some interaction concepts reminiscent of those defined in the Wireless Application Protocol (WAP), making it possible to utilize the techniques described here for WAP-compliant devices and services that may become available in the near future. Keywords. Hand-held devices, web browser, proxy systems, focus+context visualization, text reduction, flip zooming, WAP (wireless application protocol) 1
Stylistic Experiments For Information Retrieval
, 2000
"... Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topi ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topic. The experiments described in this text investigate stylistic variation. Roughly put, style is the difference between two ways of saying the same thing -- and systematic stylistic variation can be used to characterize the genre of documents. These experiments investigate if stylistic information is distinguishable using simple language engineering methods, and if in that case this type of information can be used to improve information retrieval systems.
Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information
, 2003
"... This paper explains a keyword extraction algorithm based solely on a single document. First, frequent terms are extracted. Co-occurrences of a term and frequent terms are counted. If a term appears frequently with a particular subset of terms, the term is likely to have important meaning. The degree ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
This paper explains a keyword extraction algorithm based solely on a single document. First, frequent terms are extracted. Co-occurrences of a term and frequent terms are counted. If a term appears frequently with a particular subset of terms, the term is likely to have important meaning. The degree of bias of the cooccurrence distribution is measured by the # -measure. We show that our keyword extraction performs well without the need for a corpus. In this paper, a term is defined as a word or a word sequence. We do not intend to limit the meaning in a terminological sense. A word sequence is written as a phrase
Highlights: Language- and domain-independent automatic indexing terms for abstracting
- Journal of the American Society for Information Science
, 1995
"... A method of drawing index terms from text is presented. The approach uses no stop list, stemmer, or other language-and domain-specific component, allowing operation in any language or domain with only trivial modification. The method uses n-gram counts, achieving a function similar to, but more gene ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
A method of drawing index terms from text is presented. The approach uses no stop list, stemmer, or other language-and domain-specific component, allowing operation in any language or domain with only trivial modification. The method uses n-gram counts, achieving a function similar to, but more general than, a stemmer. The generated index terms, which the author calls “highlights, ” are suitable for identifying the topic for perusal and selection. An extension is also described and demonstrated which selects index terms to represent a subset of documents, distinguishing them from the corpus. Some experimental results are presented, showing operation in English, Spanish, German, Georgian, Russian, and Japanese.
Methods of Automatic Term Recognition - A Review
, 1996
"... Following the growing interest in "corpus-based" approaches to computational linguistics, a number of studies have recently appeared on the topic of automatic term recognition or extraction. Because a successful term recognition method has to be based on proper insights into the nature of terms, stu ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Following the growing interest in "corpus-based" approaches to computational linguistics, a number of studies have recently appeared on the topic of automatic term recognition or extraction. Because a successful term recognition method has to be based on proper insights into the nature of terms, studies of automatic term recognition not only contribute to the applications of computational linguistics but also to the theoretical foundation of terminology. Many studies on automatic term recognition treat interesting aspects of terms, but most of them are not well founded and described. This paper tries to give an overview of the principles and methods of automatic term recognition. For that purpose, two major trends are examined, i.e. studies in automatic recognition of significant elements for indexing mainly carried out in information retrieval circles, and current research in automatic term recognition in the field of computational linguistics. Keywords Automatic term recognition, au...
Searching and browsing collections of structural information
- In IEEE Advances in Digital Libraries (ADL’2000
, 1997
"... This paper proposes a new approach to querying collections of structured textual information such as SGML/XML documents. Knowledge about the structure of documents is an additional resource that should be exploited during retrieval since the semantics of the different textual objects can be used to ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
This paper proposes a new approach to querying collections of structured textual information such as SGML/XML documents. Knowledge about the structure of documents is an additional resource that should be exploited during retrieval since the semantics of the different textual objects can be used to specify an information need much more precisely. However, the traditional probabilistic retrieval model lacks the ability to handle structural information. We define a new retrieval function based on the probabilistic model which overcomes this drawback. The presented query language allows the assignment of structural roles to individual terms. The efficient evaluation of queries in this framework requires appropriate index structures. We design text and structure indexes and show how their information is combined during evaluation. The implementation supports additional functionalities such as a table of contents for browsing. First evaluation results show the feasibility of the approach on collections of unstructured documents. 1
Text Categorization Based on Weighted Inverse Document Frequency
- Special Interest Groups and Information Process Society of Japan (SIG-IPSJ
, 1994
"... This paper proposes a new term weighting method called weighted inverse document frequency (WIDF). As its name indicates, WIDF is an extension of IDF (inverse document frequency) to incorporate the term frequency over the collection of texts. WIDF of a term in a text is given by dividing the frequen ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
This paper proposes a new term weighting method called weighted inverse document frequency (WIDF). As its name indicates, WIDF is an extension of IDF (inverse document frequency) to incorporate the term frequency over the collection of texts. WIDF of a term in a text is given by dividing the frequency of the term in the text by the sum of the frequency of the term over the collection of texts. WIDF is applied to the text categorization task and proved to be superior to the other methods. The improvement of accuracy on IDF is 7.4% at the maximum.

