• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections (1993)

by Douglass R. Cutting, David R. Karger, Jan O. Pedersen
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 74
Next 10 →

HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering

by Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Namprempre, Peter Szilagyi, Andrzej Duda, David K. Gifford - PROCEEDINGS OF THE SEVENTH ACM CONFERENCE ON HYPERTEXT , 1996
"... HyPursuit is a new hierarchical network search engine that clusters hypertext documents to structure a given information space for browsing and search activities. Our content-link clustering algorithm is based on the semantic information embedded in hyperlink structures and document contents. HyPurs ..."
Abstract - Cited by 88 (2 self) - Add to MetaCart
HyPursuit is a new hierarchical network search engine that clusters hypertext documents to structure a given information space for browsing and search activities. Our content-link clustering algorithm is based on the semantic information embedded in hyperlink structures and document contents. HyPursuit admits multiple, coexisting cluster hierarchies based on different principles for grouping documents, such as the Library of Congress catalog scheme and automatically created hypertext clusters. HyPursuit's abstraction functions summarize cluster contents to support scalable query processing. The abstraction functions satisfy system resource limitations with controlled information loss. The result of query processing operations on a cluster summary approximates the result of performing the operations on the entire information space. We constructed a prototype system comprising 100 leaf World Wide Web sites and a hierarchy of 42 servers that route queries to the leaf sites. Experience with our system suggests that abstraction functions based on hypertext clustering can be used to construct meaningful and scalable cluster hierarchies. We are also encouraged by preliminary results on clustering based on both document contents and hyperlink structures.

Projections for Efficient Document Clustering

by Hinrich Schütze , Craig Silverstein , 1997
"... Clustering is increasing in importance, but linear- and even constant-time clustering algorithms are often too slow for real-time applications. A simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routines. We study two techniques for improving the ..."
Abstract - Cited by 86 (0 self) - Add to MetaCart
Clustering is increasing in importance, but linear- and even constant-time clustering algorithms are often too slow for real-time applications. A simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routines. We study two techniques for improving the cost of distance calculations, LSI and truncation, and determine both how much these techniques speed up clustering and how much they affect the quality of the resulting clusters. We find that the speed increase is significant while --- surprisingly --- the quality of clustering is not adversely affected. We conclude that truncation yields clusters as good as those produced by full-profile clustering while offering a significant speed advantage.

Knowledge Discovery in Textual Databases (KDT)

by Ronen Feldman, Ido Dagan - In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95 , 1995
"... The information age is characterized by a rapid growth in the amount of information available in electronic media. Traditional data handling methods are not adequate to cope with this information flood. Knowledge Discovery in Databases (KDD) is a new paradigm that focuses on computerized exploration ..."
Abstract - Cited by 80 (2 self) - Add to MetaCart
The information age is characterized by a rapid growth in the amount of information available in electronic media. Traditional data handling methods are not adequate to cope with this information flood. Knowledge Discovery in Databases (KDD) is a new paradigm that focuses on computerized exploration of large amounts of data and on discovery of relevant and interesting patterns within them. While most work on KDD is concerned with structured databases, it is clear that this paradigm is required for handling the huge amount of information that is available only in unstructured textual form. To apply traditional KDD on texts it is necessary to impose some structure on the data that would be rich enough to allow for interesting KDD operations. On the other hand, we have to consider the severe limitations of current text processing technology and define rather simple structures that can be extracted from texts fairly automatically and in a reasonable cost. We propose using a text categoriza...

Using taxonomy, discriminants, and signatures for navigating in text databases

by Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan - In Proceedings of the 23rd VLDB Conference , 1997
"... We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through ..."
Abstract - Cited by 67 (4 self) - Add to MetaCart
We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through the query response not as a at unstructured list, but embedded in the familiar taxonomy, and annotated with document signatures computed dynamically with respect to where the user is located at any time. Weshowhowto update such databases with new documents with high speed and accuracy. Weuse techniques from statistical pattern recognition to e ciently separate the feature words or discriminants from the noise words at each node of the taxonomy. Using these, we build a multi-level classi er. At each node, this classi er can ignore the large number of noise words in a document. Thus the classi er has a small model size and is very fast. However, owing to the use of context-sensitive features, the classi er is very accurate. We report on experiences with the Reuters newswire benchmark, the US Patent database, and web document samples from Yahoo!. 1

Integrating content-based access mechanisms with hierarchical file systems

by Burra Gopal, Udi Manber , 1999
"... We present a new file system that combines name-based and content-based access to files at the same time. Our design allows both methods to be used at any time, thus preserving the benefits of both. Users can create their own name spaces based on queries, on explicit path names, or on any combinatio ..."
Abstract - Cited by 62 (0 self) - Add to MetaCart
We present a new file system that combines name-based and content-based access to files at the same time. Our design allows both methods to be used at any time, thus preserving the benefits of both. Users can create their own name spaces based on queries, on explicit path names, or on any combination interleaved arbitrarily. All regular file operations -- such as adding, deleting, or moving files -- are supported in the same way, and in addition, query consistency is maintained and adapted to what the user is manually doing. One can add, remove, or move results of queries, and in general handle them as if they were regular files. This creates interesting new consistency problems, for which we suggest and implement solutions. Remote le systems or remote query systems (e.g., web search) can be integrated by users into their own coherent name spaces in a clean way. We believe that our design can serve as the basis for the future information-rich file systems, allowing users better handle on their information.

Data mining for hypertext: A tutorial survey

by Soumen Chakrabarti - ACM SIGKDD Explorations , 2000
"... With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching ..."
Abstract - Cited by 61 (0 self) - Add to MetaCart
With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end. In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of ...

Exact and Approximation Algorithms for Clustering

by Pankaj K. Agarwal, Cecilia M. Procopiuc , 1997
"... In this paper we present a n O(k 1\Gamma1=d ) time algorithm for solving the k-center problem in R d , under L1 and L 2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete k-center problem, as well. We also describe a simple (1 + ffl)-approximation algorith ..."
Abstract - Cited by 48 (4 self) - Add to MetaCart
In this paper we present a n O(k 1\Gamma1=d ) time algorithm for solving the k-center problem in R d , under L1 and L 2 metrics. The algorithm extends to other metrics, and can be used to solve the discrete k-center problem, as well. We also describe a simple (1 + ffl)-approximation algorithm for the k-center problem, with running time O(n log k) + (k=ffl) O(k 1\Gamma1=d ) . Finally, we present a n O(k 1\Gamma1=d ) time algorithm for solving the L-capacitated k-center problem, provided that L = \Omega\Gamma n=k 1\Gamma1=d ) or L = O(1). We conclude with a simple approximation algorithm for the L-capacitated k-center problem. The work on this paper was partially supported by a National Science Foundation Grant CCR-93--01259, by an Army Research Office MURI grant DAAH04-96-1-0013, by a Sloan fellowship, by an NYI award and matching funds from Xerox Corporation, and by a grant from the U.S.-Israeli Binational Science Foundation. y Department of Computer Science, Box ...

Evaluating Document Clustering for Interactive Information Retrieval

by Anton Leuski - In Proceedings of the tenth International Conference on Information and Knowledge Managment (CIKM , 2001
"... We consider the problem of organizing and browsing the top ranked portion of the documents returned by an information retrieval system. We study the effectiveness of a document organization in helping a user to locate the relevant material among the retrieved documents as quickly as possible. In thi ..."
Abstract - Cited by 43 (3 self) - Add to MetaCart
We consider the problem of organizing and browsing the top ranked portion of the documents returned by an information retrieval system. We study the effectiveness of a document organization in helping a user to locate the relevant material among the retrieved documents as quickly as possible. In this context we examine a set of clustering algorithms and experimentally show that a clustering of the retrieved documents can be significantly more effective than traditional ranked list approach. We also show that the clustering approach can be as effective as the interactive relevance feedback based on query expansion while retaining an important advantage -- it provides the user with a valuable sense of control over the feedback process.

WebGlimpse - Combining Browsing and Searching

by Udi Manber, Mike Smith, Burra Gopal - Proc. Of the Sixteenth ACM Symposium on Principles of Database Systems , 1997
"... The two paradigms of searching and browsing are currently almost always used separately. One can either look at the library card catalog, or browse the shelves; one can either search large WWW sites (or the whole web), or browse page by page. In this paper we describe a software tool we developed ..."
Abstract - Cited by 37 (2 self) - Add to MetaCart
The two paradigms of searching and browsing are currently almost always used separately. One can either look at the library card catalog, or browse the shelves; one can either search large WWW sites (or the whole web), or browse page by page. In this paper we describe a software tool we developed, called WebGlimpse, that combines the two paradigms. It allows the search to be limited to a neighborhood of the current document. WebGlimpse automatically analyzes collections of web pages and computes those neighborhoods (at indexing time). With WebGlimpse users can browse at will, using the same pages; they can also jump from each page, through a search, to "close-by" pages related to their needs. In a sense, our combined paradigm allows users to browse using hypertext links that are constructed on the fly through a neighborhood search. The design of WebGlimpse concentrated on four goals: fast search, efficient indexing (both in terms of time and space), flexible facilities for d...

An evaluation of techniques for clustering search results

by Anton V. Leouski, W. Bruce Croft , 1996
"... The ability to effectively organize retrieval results becomes more important as the focus of Information Retrieval (IR) shifts towards interactive search processes. Automatic classification techniques are capable of providing the necessary information organization by arranging the retrieved data int ..."
Abstract - Cited by 35 (3 self) - Add to MetaCart
The ability to effectively organize retrieval results becomes more important as the focus of Information Retrieval (IR) shifts towards interactive search processes. Automatic classification techniques are capable of providing the necessary information organization by arranging the retrieved data into groups of documents with common subjects. In this paper, we compare classification methods from IR and Machine Learning (ML) for clustering search results. Issues such as document representation, classification algorithms, and cluster representation are discussed. We introduce several evaluation techniques and use them in preliminary experiments. These experiments indicate that the proposed techniques have promise, but it is clear that user experiments are required to carry out more thorough evaluation.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University