Results 1 -
6 of
6
Efficient Clustering Of Very Large Document Collections
, 2001
"... An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical da ..."
Abstract
-
Cited by 74 (9 self)
- Add to MetaCart
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-ecient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented - a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
SETS: Search Enhanced by Topic Segmentation
, 2003
"... We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are joined together into a single network by long-distance links. Queries are then matched and routed to only the topically closest regions. We draw on ideas from machine learning and social network theory to build an efficient search network. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is ecient in network traffic and query processing load.
A Medical Digital Library to Support Scenario and User-Tailored Information Retrieval
, 2000
"... Current large scale information sources are designed to support general queries and lack the ability to support scenario specific information navigation, gathering, and presentation. As a result, users are often unable to obtain desired specific information within a well defined subject area. Today' ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Current large scale information sources are designed to support general queries and lack the ability to support scenario specific information navigation, gathering, and presentation. As a result, users are often unable to obtain desired specific information within a well defined subject area. Today's information systems do not provide efficient content navigation, incremental appropriate matching, or content correlation. We are developing the following innovative technologies to remedy these problems: (1) Scenario-based proxies, enabling the gathering and filtering of information customized for users within a pre-defined domain; (2) Context-sensitive navigation and matching, providing approximate matching and similarity links when an exact match to a user's request is unavailable; (3) Content correlation of documents, creating semantic links between documents and information sources; and (4) User models for customization of retrieved information and result presentation. A digital medical library is currently being constructed using these technologies to provide customized information for the user. The technologies are general in nature and can provide custom and scenario-specific information in many other domains (e.g. crisis management).
Bayesian Multidimensional Scaling and Choice of Dimension
- Journal of the American Statistical Association
, 2001
"... Multidimensional scaling is widely used to handle data which consist of dissimilarity measures between pairs of objects or people. We deal with two major problems in metric multidimensional scaling | conguration of objects and determination of the dimension of object conguration | within a Bayesian ..."
Abstract
- Add to MetaCart
Multidimensional scaling is widely used to handle data which consist of dissimilarity measures between pairs of objects or people. We deal with two major problems in metric multidimensional scaling | conguration of objects and determination of the dimension of object conguration | within a Bayesian framework. A Markov chain Monte Carlo algorithm is proposed for object con- guration, along with a simple Bayesian criterion for choosing their eective dimension, called MDSIC. Simulation results are presented, as well as examples on real data. Our method provides better results than classical multidimensional scaling for object conguration, and MDSIC seems to work well for dimension choice in the examples considered. Key Words: Clustering, Dimensionality, Dissimilarity, Markov chain Monte Carlo, Metric scaling, Model selection. Contents 1 Introduction 1 2 Classical Multidimensional Scaling 3 3 Bayesian Multidimensional Scaling 5 3.1 Model and Prior . . . . . . . . . . . . . . . . ...
KhojYantra: An Integrated MetaSearch Engine with Classification, Clustering and Ranking
, 2000
"... The present search engines generally return a long ordered list of results which the users are forced to sift for getting relevant documents. The Information Retrieval Community has explored various methods for information presentation, which help users to separate interesting information from the s ..."
Abstract
- Add to MetaCart
The present search engines generally return a long ordered list of results which the users are forced to sift for getting relevant documents. The Information Retrieval Community has explored various methods for information presentation, which help users to separate interesting information from the set of retrieved documents. The ranked list is a well known and widely accepted technique for information presentation. Clustering and Classi cation are also explored as successful methods for organizing retrieved results, but these techniques are yet to be deployed as part of major search engines. It is

