Results 1 - 10
of
18
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Self-Indexing Inverted Files for Fast Text Retrieval
- ACM Transactions on Information Systems
, 1996
"... Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, b ..."
Abstract
-
Cited by 127 (23 self)
- Add to MetaCart
Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for Boolean queries of 5--10 terms, can reduce processing time to under one fifth of the previous cost. Similarly, ranked queries of 40--50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
A survey of technologies for parsing and indexing digital video
- Journal of visual Communication and image representation
, 1996
"... Abstract–In the future we envision systems that will provide video information delivery services to customers on a very large scale. These systems must provide customers with mechanisms to select programs of their choice from live broadcasts. Customers should also be provided with easy means of brow ..."
Abstract
-
Cited by 64 (8 self)
- Add to MetaCart
Abstract–In the future we envision systems that will provide video information delivery services to customers on a very large scale. These systems must provide customers with mechanisms to select programs of their choice from live broadcasts. Customers should also be provided with easy means of browsing and accessing pre-recorded digital data (e.g., distributed digital multimedia libraries), and downloading data from other information sources. To be viable for such large information sets, these systems must understand customer preferences and tailor the available information to the customer’s needs. To support this vision, a number of issues must be addressed and obstacles overcome. Intuitive interfaces, powerful query formulation and evaluation techniques, comprehensive data models, and flexible presentation functionalities must be developed. To realize these components, an effective query evaluation engine with the capabilities of query resolution in different content-specific formats (e.g., by graphics, by image, by sound) and in different domain-specific models (e.g., database of movies, database of newsclips) should be present. Additionally, the digital video database will require an efficient indexing system for easy access to the stored information. In this paper we discuss existing research trends in this
An efficient indexing technique for full-text database systems
- In Proceedings of 18th International Conference on Very Large Databases
, 1992
"... Abstract: Full-text database systems require an in-dex to allow fast access to documents based on their content. We propose an inverted file indexing scheme based on compression. This scheme allows users to retrieve documents using words occurring in the doc-uments, sequences of adjacent words, and ..."
Abstract
-
Cited by 62 (10 self)
- Add to MetaCart
Abstract: Full-text database systems require an in-dex to allow fast access to documents based on their content. We propose an inverted file indexing scheme based on compression. This scheme allows users to retrieve documents using words occurring in the doc-uments, sequences of adjacent words, and statistical ranking techniques. The compression methods cho-sen ensure that the storage requirements are small and that dynamic update is straightforward. The only as-sumption that we make is that sufficient main memory is available to support an in-memory vocabulary; given this assumption, the method we describe requires at most one disc access per query term to identify an-swers to queries.
Semantic cache mechanism for heterogeneous Web querying
, 1999
"... In Web-based searching systems that access distributed information providers, efficient query processing requires an advanced caching mechanism to reduce the query response time. The keyword-based querying is often the only way to retrieve data from Web providers, and therefore standard page-based a ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
In Web-based searching systems that access distributed information providers, efficient query processing requires an advanced caching mechanism to reduce the query response time. The keyword-based querying is often the only way to retrieve data from Web providers, and therefore standard page-based and tuple-based caching mechanisms turn out to be improper for such a task. In this work, we develop a mechanism for efficient caching of Web queries and the answers received from heterogeneous Web providers. We also report results of experiments and show how the caching mechanism is implemented in the Knowledge Broker system. Published by Elsevier Science B.V. All rights reserved.
PAT expressions: an algebra for text search
- Acta Linguistica Hungarica
, 1994
"... this paper is to introduce the powerful search capabilities of PAT expressions. Text search is usually considered so simple that only a rough description of the operations is given. For example, when word search is discussed, we are seldom told what is meant by a "word". The reader has to find out t ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
this paper is to introduce the powerful search capabilities of PAT expressions. Text search is usually considered so simple that only a rough description of the operations is given. For example, when word search is discussed, we are seldom told what is meant by a "word". The reader has to find out through experimentation how many words are contained in the strings "Jean-Marie" and "O'Hara". However, a careless description of search operations may lead to search errors or unnecessarily long retrieval sessions. A second goal of the paper, therefore, is to introduce a mechanism for precise specification of text search semantics. + PAT is a registered trademark of Open Text Corporation. -- 2 -- Text search using PAT is typically simple and straightforward [Raymond90]. However, because of the powerful definition capabilities included in PAT, explaining and understanding the semantics of some operations may be difficult. As a side-effect of our systematic specification of PAT, we have identified some features of PAT expressions that cause problems and thus would benefit from further development. From this we see that precise specification also serves as a means for evaluation and offers a means for comparing text search systems. As is common in information retrieval systems, a PAT search is applied to indexed text [see, for example, Gonnet83, Croft84, Larson84, Faloutsos85, Salton89, Burkowski91]. Indexing is usually described from the point of view of implementation, for example, by giving an algorithm for the indexing [Salton81, Salton89, Gonnet91]. However, since the way text is indexed affects search behaviour, our systematic approach to precise description must include mechanisms that accommodate indexing definition capabilities. 2. The PAT system
Fast image retrieval using color-spatial information
- The VLDB Journal
, 1998
"... Abstract. In this paper, we present an image retrieval system that employs both the color and spatial information of images to facilitate the retrieval process. The basic unit used in our technique is a single-colored cluster, which bounds a homogeneous region of that color in an image. Two clusters ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Abstract. In this paper, we present an image retrieval system that employs both the color and spatial information of images to facilitate the retrieval process. The basic unit used in our technique is a single-colored cluster, which bounds a homogeneous region of that color in an image. Two clusters from two images are similar if they are of the same color and overlap in the image space. The number of clusters that can be extracted from an image can be very large, and it affects the accuracy of retrieval. We study the effect of the number of clusters on retrieval effectiveness to determine an appropriate value for “optimal ” performance. To facilitate efficient retrieval, we also propose a multi-tier indexing mechanism called the Sequenced Multi-Attribute Tree (SMAT). We implemented a two-tier SMAT, where the first layer is used to prune away clusters that are of different colors, while the second layer discriminates clusters of different spatial locality. We conducted an experimental study on an image database consisting of 12,000 images. Our results show the effectiveness of the proposed color-spatial approach, and the efficiency of the proposed indexing mechanism. Key words: Single-colored cluster – Content-based retrieval – Color-spatial information – Sequenced multi-attribute tree 1
Using Bitmaps for Medium Sized Information Retrieval Systems
- Information Processing & Management
, 1990
"... : We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, effici ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
: We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, efficient and, relative to the customary concordance approach, inexpensive in storage costs. 1. Introduction Our ability to control textual information is being strongly influenced by a variety of technological advances. These include new means of storing and sharing information that makes possible and realistic an information system model in which large bodies of full text are compactly stored, widely distributed, and shared by a large number of interested persons. Such changes require a careful search for techniques that promise convenient and effective access to such textual databases. The research that is required in this environment differs from that traditional in Information Retrieval (IR)...
Content-Aware DataGuides: Interleaving IR and DB Indexing Techniques for Efficient Retrieval of Textual XML Data
, 2003
"... Not only since the advent of XML, many applications call for efficient structured document retrieval, challenging both Information Retrieval (IR) and database (DB) research. Most approaches combining indexing techniques from both fields still separate path and content matching, merging the hits ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Not only since the advent of XML, many applications call for efficient structured document retrieval, challenging both Information Retrieval (IR) and database (DB) research. Most approaches combining indexing techniques from both fields still separate path and content matching, merging the hits in an expensive join. This paper shows that retrieval is significantly accelerated by processing text and structure simultaneously. The
Signature File Methods for Semantic Query Caching
- In: Proc. of the 2nd European Conf. on Digital Libraries, LNCS 1513
, 1998
"... In digital libraries accessing distributed Web-based bibliographic repositories, performance is a major issue. Efficient query processing requires an appropriate caching mechanism. Unfortunately, standard page-based as well as tuple-based caching mechanisms designed for conventional databases are no ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In digital libraries accessing distributed Web-based bibliographic repositories, performance is a major issue. Efficient query processing requires an appropriate caching mechanism. Unfortunately, standard page-based as well as tuple-based caching mechanisms designed for conventional databases are not efficient on the Web, where keyword-based querying is often the only way to retrieve data. Therefore, we study the problem of semantic caching of Web queries and develop a caching mechanism for conjunctive Web queries based on signature files. We propose two implementation...

