Results 1 - 10
of
38
Automatic Word Sense Discrimination
- Journal of Computational Linguistics
, 1998
"... This paper presents context-group discrimination, a disambiguation algorithm based on clustering. Senses are interpreted as groups (or clusters) of similar contexts of the ambiguous word. Words, contexts, and senses are represented in Word Space, a high-dimensional, real-valued space in which closen ..."
Abstract
-
Cited by 536 (1 self)
- Add to MetaCart
This paper presents context-group discrimination, a disambiguation algorithm based on clustering. Senses are interpreted as groups (or clusters) of similar contexts of the ambiguous word. Words, contexts, and senses are represented in Word Space, a high-dimensional, real-valued space in which closeness corresponds to semantic similarity. Similarity in Word Space is based on second-order co-occurrence: two tokens (or contexts) of the ambiguous word are assigned to the same sense cluster if the words they co-occur with in turn occur with similar words in a training corpus. The algorithm is automatic and unsupervised in both training and application: senses are induced from a corpus without labeled training insta,nces or other external knowledge sources. The paper demonstrates good performance of context-group discrimination for a sample of natural and artificial ambiguous words
Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results
, 1996
"... We present Scatter/Gather, a cluster-based document browsing method, as an alternative to ranked titles for the organization and viewing of retrieval results. We systematically evaluate Scatter/Gather in this context and find significant improvements over similarity search ranking alone. This resul ..."
Abstract
-
Cited by 480 (5 self)
- Add to MetaCart
(Show Context)
We present Scatter/Gather, a cluster-based document browsing method, as an alternative to ranked titles for the organization and viewing of retrieval results. We systematically evaluate Scatter/Gather in this context and find significant improvements over similarity search ranking alone. This result provides evidence validating the cluster hypothesis which states that relevant documents tend to be more similar to each other than to non-relevant documents. We describe a system employing Scatter/Gather and demonstrate that users are able to use this system close to its full potential. 1 Introduction An important service offered by an information access system is the organization of retrieval results. Conventional systems rank results based on an automatic assessment of relevance to the query [20]. Alternatives include graphical displays of interdocument similarity (e.g., [1, 22, 7]), relationship to fixed attributes (e.g., [21, 14]), and query term distribution patterns (e.g., [12]). I...
A practical part-of-speech tagger
- IN PROCEEDINGS OF THE THIRD CONFERENCE ON APPLIED NATURAL LANGUAGE PROCESSING
, 1992
"... We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and op ..."
Abstract
-
Cited by 409 (5 self)
- Add to MetaCart
We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and optimizations which result in high-speed operation. Three applications for tagging are described: phrase recognition; word sense disambiguation; and grammatical function assignment.
TileBars: Visualization of Term Distribution Information in Full Text Information Access
, 1995
"... The field of information retrieval has traditionally focused on textbases consisting of titles and abstracts. As a consequence, many underlying assumptions must be altered for retrieval from full-length text collections. This paper argues for making use of text structure when retrieving from full te ..."
Abstract
-
Cited by 341 (10 self)
- Add to MetaCart
The field of information retrieval has traditionally focused on textbases consisting of titles and abstracts. As a consequence, many underlying assumptions must be altered for retrieval from full-length text collections. This paper argues for making use of text structure when retrieving from full text documents, and presents a visualization paradigm, called TileBars, that demonstrates the usefulness of explicit term distribution information in Boolean-type queries. TileBars simultaneously and compactly indicate relative document length, query term frequency, and query term distribution. The patterns in a column of TileBars can be quickly scanned and deciphered, aiding users in making judgments about the potential relevance of the retrieved documents. KEYWORDS: Information retrieval, Full-length text, Visualization. INTRODUCTION Information access systems have traditionally focused on retrieval of documents consisting of titles and abstracts. As a consequence, the underlying assumpt...
Silk from a Sow's Ear: Extracting Usable Structures from the Web
, 1996
"... In its current implementation, the World-Wide Web lacks much of the explicit structure and strong typing found in many closed hypertext systems. While this property has directly fueled the explosive acceptance of the Web, it further complicates the already difficult problem of identifying usable str ..."
Abstract
-
Cited by 270 (9 self)
- Add to MetaCart
(Show Context)
In its current implementation, the World-Wide Web lacks much of the explicit structure and strong typing found in many closed hypertext systems. While this property has directly fueled the explosive acceptance of the Web, it further complicates the already difficult problem of identifying usable structures and aggregates in large hypertext collections. These reduced structures, or localities, form the basis to simplifying visualizations of and navigation through complex hypertext systems. Much of the previous research into identifying aggregates utilize graph theoretic algorithms based upon structural topology, i.e., the linkages between items. Other research has focused on content analysis to form document collections. This paper presents our exploration into techniques that harness both the topology and textual similarity between items as well as integrate new analyses based upon actual usage of the Xerox's WWW space. Linear equations and spreading activation models are employed to arrange Web pages based upon functional categories, node types, and relevancy. Keywords Information Visualization, World Wide Web, Hypertext.
A comparison of classifiers and document representations for the routing problem
- ANNUAL ACM CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL - ACM SIGIR
, 1995
"... In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant ..."
Abstract
-
Cited by 196 (2 self)
- Add to MetaCart
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks.
Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
Method Combination For Document Filtering
, 1996
"... There is strong empirical and theoretic evidence that combination of retrieval methods can improve performance. In this paper, we systematically compare combination strategies in the context of document filtering, using queries from the Tipster reference corpus. We find that simple averaging strateg ..."
Abstract
-
Cited by 56 (1 self)
- Add to MetaCart
There is strong empirical and theoretic evidence that combination of retrieval methods can improve performance. In this paper, we systematically compare combination strategies in the context of document filtering, using queries from the Tipster reference corpus. We find that simple averaging strategies do indeed improve performance, but that direct averaging of probability estimates is not the correct approach. Instead, the probability estimates must be renormalized using logistic regression on the known relevance judgements. We examine more complex combination strategies but find them less successful due to the high correlations among our filtering methods which are optimized over the same training data and employ similar document representations. 1 Introduction A text filtering system monitors an incoming document stream and selects documents identified as relevant to one or more of its query profiles. If profile interactions are ignored, this reduces to a number of independent bina...
Improving Full-Text Precision on Short Queries using Simple Constraints
- Proceedings of the Symposium on Document Analysis and Information Retrieval
, 1996
"... We show that two simple constraints, when applied to short user queries (on the order of 5--10 words) can yield precision scores comparable to or better than those achieved using long queries (50--85 words) at low document cutoff levels. These constraints are meant to detect documents that have subt ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
(Show Context)
We show that two simple constraints, when applied to short user queries (on the order of 5--10 words) can yield precision scores comparable to or better than those achieved using long queries (50--85 words) at low document cutoff levels. These constraints are meant to detect documents that have subtopic passages that includes the most important components of the query. The constraints are: (i) a simple Boolean constraint which requires the user to specify the query as a list of topics; this list is converted into a conjunct of disjuncts by the system, and (ii) a subtopic-sized proximity constraint imposed over the Boolean constraint. The vector space model is used to rank the documents that satisfy both constraints. Experiments run over 45 TREC queries show significant, almost consistent improvements over rankings that use no constraints. These results have important ramifications for interactive systems intended for casual users, such as those searching on the World Wide Web. 1 Introd...
The Information Grid: A Framework for Information Retrieval and Retrieval-Centered Applications
, 1992
"... The Information Grid (InfoGrid) is a framework for building information access applications that provides a user interface design and an interaction model. It focuses on retrieval of application objects as its top level mechanism for accessing user information, documents, or services. We have e ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
The Information Grid (InfoGrid) is a framework for building information access applications that provides a user interface design and an interaction model. It focuses on retrieval of application objects as its top level mechanism for accessing user information, documents, or services. We have embodied the InfoGrid design in an object-oriented application framework that supports rapid construction of applications. This appli- cation framework has been used to build a number of applications, some that are classically characterized as information retrieval applications, others that are more typically viewed as personal work tools.
Speech-based retrieval using semantic co-occurrence filtering
- In Proc. ARPA Human Language Technology Workshop, Plainsboro, NJ
, 1994
"... In this paper we demonstrate that speech recognition can be effectively applied to information retrieval (IR) applications. Our system exploits the fact that the intended words of a spo-ken query tend to co-occur in text documents in close proxim-ity whereas word combinations that are the result of ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
(Show Context)
In this paper we demonstrate that speech recognition can be effectively applied to information retrieval (IR) applications. Our system exploits the fact that the intended words of a spo-ken query tend to co-occur in text documents in close proxim-ity whereas word combinations that are the result of recogni-tion errors are usually not semantically correlated and thus do not appear together. Termed "Semantic Co-occurrence Filtering " this enables the system to simultaneously disam-biguate word hypotheses and find relevant text for retrieval. The system is built by integrating standard IR and speech recognition techniques. An evaluation of the system is pre-seated and we discuss several refinements to the functionality. 1.