Results 1 - 10
of
100
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces
, 1998
"... For similarity search in high-dimensional vector spaces (or `HDVSs'), researchers have proposed a number of new methods (or adaptations of existing methods) based, in the main, on data-space partitioning. However, the performance of these methods generally degrades as dimensionality increases. Altho ..."
Abstract
-
Cited by 413 (12 self)
- Add to MetaCart
For similarity search in high-dimensional vector spaces (or `HDVSs'), researchers have proposed a number of new methods (or adaptations of existing methods) based, in the main, on data-space partitioning. However, the performance of these methods generally degrades as dimensionality increases. Although this phenomenon---known as the `dimensional curse'---is well known, little or no quantitative analysis of the phenomenon is available. In this paper, we provide a detailed analysis of partitioning and clustering techniques for similarity search in HDVSs. We show formally that these methods exhibit linear complexity at high dimensionality, and that existing methods are outperformed on average by a simple sequential scan if the number of dimensions exceeds around 10. Consequently, we come up with an alternative organization based on approximations to make the unavoidable sequential scan as fast as possible. We describe a simple vector approximation scheme, called VA-file, and report on an ...
Fast Subsequence Matching in Time-Series Databases
- SIGMOD 94
, 1994
"... We present an efficient indexing method to locate 1-dimensional subsequences witbin a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space ..."
Abstract
-
Cited by 372 (18 self)
- Add to MetaCart
We present an efficient indexing method to locate 1-dimensional subsequences witbin a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*-tree [9]. In more deteil, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an efficient and effective algorithm to divide such trails into sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times.
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Self-Indexing Inverted Files for Fast Text Retrieval
- ACM Transactions on Information Systems
, 1996
"... Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, b ..."
Abstract
-
Cited by 127 (23 self)
- Add to MetaCart
Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for Boolean queries of 5--10 terms, can reduce processing time to under one fifth of the previous cost. Similarly, ranked queries of 40--50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Scalable Internet Resource Discovery: Research Problems and Approaches
, 1994
"... Over the past several years, a number of information discovery and access tools have been introduced in the Internet, including Archie, Gopher, Netfind, and WAIS. These tools have become quite popular, and are helping to redefine how people think about wide-area network applications. Yet, they ar ..."
Abstract
-
Cited by 121 (3 self)
- Add to MetaCart
Over the past several years, a number of information discovery and access tools have been introduced in the Internet, including Archie, Gopher, Netfind, and WAIS. These tools have become quite popular, and are helping to redefine how people think about wide-area network applications. Yet, they are not well suited to supporting the future information infrastructure, which will be characterized by enormous data volume, rapid growth in the user base, and burgeoning data diversity. In this paper we indicate trends in these three dimensions and survey problems these trends will create for current approaches. We then suggest several promising directions of future resource discovery research, along with some initial results from projects carried out by members of the Internet Research Task Force Research Group on Resource Discovery and Directory Service.
Searching the Web
- ACM TRANSACTIONS ON INTERNET TECHNOLOGY
, 2001
"... We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and im ..."
Abstract
-
Cited by 108 (1 self)
- Add to MetaCart
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering
- In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge discovery and data mining
, 1999
"... This paper introduces a new and novel approach to ratingbased collaborative filtering. The new technique is most appropriate for e-commerce merchants offering one or more groups of relatively homogeneous items such as compact disks, videos, books, software and the like. In contrast with other known ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
This paper introduces a new and novel approach to ratingbased collaborative filtering. The new technique is most appropriate for e-commerce merchants offering one or more groups of relatively homogeneous items such as compact disks, videos, books, software and the like. In contrast with other known collaborative filtering techniques, the new algorithm is graph-theoretic, based on the twin new concepts of horting and predictability. As is demonstrated in this paper, the technique is fast, scalable, accurate, and requires only a modest learning curve. It makes use of a hierarchical classification scheme in order to introduce context into the rating process, and uses so-called creative links in order to find surprising and atypical items to recommend, perhaps even items which cross the group boundaries. The new technique is one of the key engines of the Intelligent Recommendation Algorithm (IRA) project, now being developed at IBM Research. In addition to several other recommendation engines, IRA contains a situation analyzer to determine the most appropriate mix of engines for a particular e-commerce merchant, as well as an engine for optimizing the placement of advertisements.
Inverted files versus signature files for text indexing
- ACM Transactions on Database Systems
, 1998
"... Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using bo ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using both experimentation and a refined approach to modeling of signature files, and demonstrate that inverted files are distinctly superior to signature files. Not only can inverted files be used to evaluate typical queries in less time than can signature files, but inverted files require less space and provide greater functionality. Our results also show that a synthetic text database can provide a realistic indication of the behavior of an actual text database. The tools used to generate the synthetic database have been made publicly available.
Fast Incremental Indexing for Full-Text Information Retrieval
, 1994
"... Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as informatio ..."
Abstract
-
Cited by 69 (3 self)
- Add to MetaCart
Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as information filtering, operate in dynamic environments that require frequent additions to document collections. We provide this ability using a traditional inverted file index built on top of a persistent object store. The data management facilities of the persistent object store are used to produce efficient incremental update of the inverted lists. We describe our system and present experimental results showing superior incremental indexing and competitive query processing performance. Keywords: full-text document retrieval, incremental indexing, persistent object store, performance 1 Introduction Full-text information retrieval (IR) systems are well established tools for satisfying a user's inf...
A survey of technologies for parsing and indexing digital video
- Journal of visual Communication and image representation
, 1996
"... Abstract–In the future we envision systems that will provide video information delivery services to customers on a very large scale. These systems must provide customers with mechanisms to select programs of their choice from live broadcasts. Customers should also be provided with easy means of brow ..."
Abstract
-
Cited by 64 (8 self)
- Add to MetaCart
Abstract–In the future we envision systems that will provide video information delivery services to customers on a very large scale. These systems must provide customers with mechanisms to select programs of their choice from live broadcasts. Customers should also be provided with easy means of browsing and accessing pre-recorded digital data (e.g., distributed digital multimedia libraries), and downloading data from other information sources. To be viable for such large information sets, these systems must understand customer preferences and tailor the available information to the customer’s needs. To support this vision, a number of issues must be addressed and obstacles overcome. Intuitive interfaces, powerful query formulation and evaluation techniques, comprehensive data models, and flexible presentation functionalities must be developed. To realize these components, an effective query evaluation engine with the capabilities of query resolution in different content-specific formats (e.g., by graphics, by image, by sound) and in different domain-specific models (e.g., database of movies, database of newsclips) should be present. Additionally, the digital video database will require an efficient indexing system for easy access to the stored information. In this paper we discuss existing research trends in this

