Results 1 - 10
of
21
Improved Algorithms for Topic Distillation in a Hyperlinked Environment
, 1998
"... This paper addresses the problem of topic distillation on the World Wide Web, namely, given a typical user query to find quality documents related to the query topic. Connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked docu ..."
Abstract
-
Cited by 374 (6 self)
- Add to MetaCart
This paper addresses the problem of topic distillation on the World Wide Web, namely, given a typical user query to find quality documents related to the query topic. Connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. The essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. We identify three problems with the existing approach and devise algorithms to tackle them. The results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis.
Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases Into Hierarchical Topic Taxonomies
, 1998
"... We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for r ..."
Abstract
-
Cited by 87 (7 self)
- Add to MetaCart
We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or...
Using taxonomy, discriminants, and signatures for navigating in text databases
- In Proceedings of the 23rd VLDB Conference
, 1997
"... We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through ..."
Abstract
-
Cited by 67 (4 self)
- Add to MetaCart
We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through the query response not as a at unstructured list, but embedded in the familiar taxonomy, and annotated with document signatures computed dynamically with respect to where the user is located at any time. Weshowhowto update such databases with new documents with high speed and accuracy. Weuse techniques from statistical pattern recognition to e ciently separate the feature words or discriminants from the noise words at each node of the taxonomy. Using these, we build a multi-level classi er. At each node, this classi er can ignore the large number of noise words in a document. Thus the classi er has a small model size and is very fast. However, owing to the use of context-sensitive features, the classi er is very accurate. We report on experiences with the Reuters newswire benchmark, the US Patent database, and web document samples from Yahoo!. 1
Information retrieval on the Web
- ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract
-
Cited by 58 (0 self)
- Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Improving Browsing in Digital Libraries with Keyphrase Indexes
, 1998
"... Browsing accounts for much of people's interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
Browsing accounts for much of people's interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a result, users cannot easily determine what is in a collection, how well a particular topic is covered, or what kinds of queries will provide useful results. We have built
Improving Interactive Retrieval by Combining Ranked Lists and Clustering
- IN PROCEEDINGS OF RIAO’2000
, 2000
"... We study the problem of organizing the documents returned by an information retrieval system in response to a natural language query. We consider two well-known approaches -- the ranked list and clustering of the results -- and we show how they can be integrated. This new procedure is designed to ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
We study the problem of organizing the documents returned by an information retrieval system in response to a natural language query. We consider two well-known approaches -- the ranked list and clustering of the results -- and we show how they can be integrated. This new procedure is designed to accept user feedback and direct the user toward the relevant material as effectively as the traditional relevance feedback approach. We show how our technique can be explained to the user by visualizing the process in two or three dimensions, providing him or her with complete control of the procedure. We show that increasing the dimensionality of the visualization generally improves its quality, albeit only a small amount. Additionally we present the result of a small user study designed to investigate how effective our visualization is in supporting the user navigating the retrieved results.
User-Chosen Phrases in Interactive Query Formulation for Information Retrieval
- in: Proceedings of The 20th BCS Colloquium on Information Retrieval (IRSG’98) (Springer-Verlag
, 1998
"... The impact of using phrases as content representation for documents and for queries has generally been accepted as a desirable feature in information retrieval systems because phrases are generally regarded as being more content-bearing than their constituent words. This has been borne by experiment ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
The impact of using phrases as content representation for documents and for queries has generally been accepted as a desirable feature in information retrieval systems because phrases are generally regarded as being more content-bearing than their constituent words. This has been borne by experiments in which the impact of phrases on retrieval performance has usually been found to be positive. However, most of the experimental results reported have derived phrases from documents and from queries in a fully automatic way. While this is acceptable for document indexing it is less acceptable for query formulation which is increasingly heading towards being an iterative process with users investing time in browsing the term space to choose appropriate search terms. In this paper we report a series of experiments in which two users, one experienced and the other a novice, formulate their queries by browsing the term space in advance of issuing a retrieval request. For these users we analyse...
Evaluating combinations of ranked lists and visualizations of inter-document similarity
- Information Processing and Management
, 2001
"... ..."
Human Evaluation of Kea, an Automatic Keyphrasing System
, 2001
"... This paper describes an evaluation of the Kea automatic keyphrase extraction algorithm. Tools that automatically identify keyphrases are desirable because document keyphrases have numerous applications in digital library systems, but are costly and time consuming to manually assign. Keyphrase extrac ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This paper describes an evaluation of the Kea automatic keyphrase extraction algorithm. Tools that automatically identify keyphrases are desirable because document keyphrases have numerous applications in digital library systems, but are costly and time consuming to manually assign. Keyphrase extraction algorithms are usually evaluated by comparison to author-specified keywords, but this methodology has several well-known shortcomings. The results presented in this paper are based on subjective evaluations of the quality and appropriateness of keyphrases by human assessors, and make a number of contributions. First, they validate previous evaluations of Kea that rely on author keywords. Second, they show Kea's performance is comparable to that of similar systems that have been evaluated by human assessors. Finally, they justify the use of author keyphrases as a performance metric by showing that authors generally choose good keywords.
A Medical Digital Library to Support Scenario and User-Tailored Information Retrieval
, 2000
"... Current large scale information sources are designed to support general queries and lack the ability to support scenario specific information navigation, gathering, and presentation. As a result, users are often unable to obtain desired specific information within a well defined subject area. Today' ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Current large scale information sources are designed to support general queries and lack the ability to support scenario specific information navigation, gathering, and presentation. As a result, users are often unable to obtain desired specific information within a well defined subject area. Today's information systems do not provide efficient content navigation, incremental appropriate matching, or content correlation. We are developing the following innovative technologies to remedy these problems: (1) Scenario-based proxies, enabling the gathering and filtering of information customized for users within a pre-defined domain; (2) Context-sensitive navigation and matching, providing approximate matching and similarity links when an exact match to a user's request is unavailable; (3) Content correlation of documents, creating semantic links between documents and information sources; and (4) User models for customization of retrieved information and result presentation. A digital medical library is currently being constructed using these technologies to provide customized information for the user. The technologies are general in nature and can provide custom and scenario-specific information in many other domains (e.g. crisis management).

