Results 1 - 10
of
12
Building efficient and effective metasearch engines
- ACM Computing Surveys
, 2002
"... Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a met ..."
Abstract
-
Cited by 107 (9 self)
- Add to MetaCart
Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a metasearch engine can be constructed. When a metasearch engine receives a query from a user, it invokes the underlying search engines to retrieve useful information for the user. Metasearch engines have other benefits as a search tool such as increasing the search coverage of the Web and improving the scalability of the search. In this article, we survey techniques that have been proposed to tackle several underlying challenges for building a good metasearch engine. Among the main challenges, the database selection problem is to identify search engines that are likely to return useful documents to a given query. The document selection problem is to determine what documents to retrieve from each identified search engine. The result merging problem is to combine the documents returned from multiple search engines. We will also point out some problems that need to be further researched.
Detection of Heterogeneities in a Multiple Text Database Environment
- IN PROCEEDINGS OF THE FOURTH IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS
, 1999
"... As the number of text retrieval systems (search engines) grows rapidly on the World Wide Web, there is an increasing need to build search brokers (metasearch engines) on top of them. Often, the task of building an effective and efficient metasearch engine is hindered by the heterogeneities among the ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
As the number of text retrieval systems (search engines) grows rapidly on the World Wide Web, there is an increasing need to build search brokers (metasearch engines) on top of them. Often, the task of building an effective and efficient metasearch engine is hindered by the heterogeneities among the underlying local search engines. In this paper, we first analyze the impact of various heterogeneities on building a metasearch engine. We then present some techniques that can be used to detect the most prominent heterogeneities among multiple search engines. Applications of utilizing the detected heterogeneities in building better metasearch engines will be provided.
Content Routing in a Network of WAIS Servers
- In 14th IEEE International Conference on Distributed Computing Systems
, 1993
"... Locating and accessing information in a large distributed system is a difficult problem of growing importance. This paper reports on our experience building and using a prototype system for transparent, user-guided associative access to the contents of a large, distributed set of WAIS servers. Our s ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Locating and accessing information in a large distributed system is a difficult problem of growing importance. This paper reports on our experience building and using a prototype system for transparent, user-guided associative access to the contents of a large, distributed set of WAIS servers. Our system is based on content routing, an architecture that makes use of content labels for locating and accessing information in large distributed systems [12]. Our content router for WAIS servers is implemented as a Semantic File System that constructs content labels from WAIS source and catalog files. The content router guides locating documents by suggesting terms that frequently appear with a given query term in document headlines. Sufficiently narrowed queries are routed to WAIS servers and processed in parallel. We have successfully used our content router to locate documents on a large number of WAIS servers. Along with demonstrating the feasibility of distributed finding in a large netw...
Data Structures for Efficient Broker Implementation
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1997
"... ..."
Generalizing GIOSS to vector-space databases and broker hierarchies
- VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases
, 1995
"... As large numbers of text databases have be-come available on the Internet, it is harder to locate the right sources for given queries. In this paper we present gGlOSS, a generalized Glossary-Of-Servers Server, that keeps statis-t,ics on the available databases to estimate which databases are the pot ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
As large numbers of text databases have be-come available on the Internet, it is harder to locate the right sources for given queries. In this paper we present gGlOSS, a generalized Glossary-Of-Servers Server, that keeps statis-t,ics on the available databases to estimate which databases are the potentially most use-ful for a given query. gGlOSS extends our pre-vious work [l], which focused on databases us-ing the boolean model of document retrieval, to cover databases using the more sophisti-cated vector-space retrieval model. We evalu-ate our new techniques using real-user queries and 53 databases. Finally, we further gener-alize our approach by showing how to build a hierarchy of gGlOSS brokers. The top level of the hierarchy is so small it could be widely replicated, even at end-user workstations. *This research was sponsored by the Advanced Research
Boolean Similarity Measures for Resource Discovery
- IEEE Transactions on Knowledge and Data Engineering
, 1997
"... We develop a new method to rank the degree of similarity between Boolean expressions, contrast it with other known methods, and describe its implementation. Our method reduces time and space complexity from exponential to polynomial in the number of Boolean terms. Index Terms - Boolean query, infor ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We develop a new method to rank the degree of similarity between Boolean expressions, contrast it with other known methods, and describe its implementation. Our method reduces time and space complexity from exponential to polynomial in the number of Boolean terms. Index Terms - Boolean query, information retrieval, ranking, resource discovery, similarity measure. 1 Introduction Most library information systems let users make Boolean queries against their database. Internet resource discovery systems, such as WAIS [1] and our Indie [2], also support Boolean queries. Frequently, users find it convenient if the retrieval system returns the answers to their queries in a ranked order. This paper develops an efficient algorithm to rank the similarity between a user's Boolean query and a set of objects, each described by a Boolean expression. Our method produces similarity rankings between zero and one. If the query and the object with which it is compared contain some identical terms, the ...
Descriptive Name Services For Large Internets
, 1993
"... This thesis addresses the challenge of locating people, resources, and other objects in the global Internet. As the Internet grows beyond a million hosts in tens of thousands of organizations, it is increasingly difficult to locate any particular object. Hierarchical name services are frustrating, b ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This thesis addresses the challenge of locating people, resources, and other objects in the global Internet. As the Internet grows beyond a million hosts in tens of thousands of organizations, it is increasingly difficult to locate any particular object. Hierarchical name services are frustrating, because users must guess the unique names for objects or navigate the name space to find information. Descriptive (i.e. relational) name services offer the promise of simple resource location through a non-procedural query language. Users locate resources by describing resource attributes. This thesis makes the promise of descriptive name services real by providing fast query processing in large internets. The key to speed in descriptive query processing is constraining the search space using two new techniques, called an active catalog and meta-data caching. The active catalog constrains the search space for a query by returning a list of data repositories where the answer to the query is li...
dSCAM: Finding Document Copies across Multiple Databases
- In Proceedings of the 4th International Conference on Parallel and Distributed Information Systems
, 1996
"... The advent of the Internet has made the illegal dissemination of copyrighted material easy. An important problem is how to automatically detect when a "new" digital document is "suspiciously close" to existing ones. The SCAM project at Stanford University has addressed this problem when there is a s ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The advent of the Internet has made the illegal dissemination of copyrighted material easy. An important problem is how to automatically detect when a "new" digital document is "suspiciously close" to existing ones. The SCAM project at Stanford University has addressed this problem when there is a single registered-document database. However, in practice, text documents may appear in many autonomous databases, and one would like to discover copies without having to exhaustively search in all databases. Our approach, dSCAM, is a distributed version of SCAM that keeps succinct metainformation about the contents of the available document databases. Given a suspicious document S, dSCAM uses its information to prune all databases that cannot contain any document that is close enough to S, and hence the search can focus on the remaining sites. We also study how to query the remaining databases so as to minimize different querying costs. We empirically study the pruning and searching schemes,...

