Results 1 - 10
of
23
Adaptive Intersection and t-Threshold Problems
, 2002
"... Consider the problem of computing the intersection of k sorted sets. In the comparison model, we prove a new lower bound which depends on the non-deterministic complexity of the instance, and implies that the algorithm of Demaine, Lopez-Ortiz and Munro [2] is usually optimal in this \adaptive" sense ..."
Abstract
-
Cited by 26 (11 self)
- Add to MetaCart
Consider the problem of computing the intersection of k sorted sets. In the comparison model, we prove a new lower bound which depends on the non-deterministic complexity of the instance, and implies that the algorithm of Demaine, Lopez-Ortiz and Munro [2] is usually optimal in this \adaptive" sense. We extend the lower bound and the algorithm to the t-Threshold Problem, which consists in nding the elements which are in at least t of the k sets. These problems are motivated by boolean queries in text database systems.
Faster adaptive set intersections for text searching
- Experimental Algorithms: 5th International Workshop, WEA 2006, Cala Galdana, Menorca
, 2006
"... Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and López-Ortiz [SODA 2000/ALENEX 2001], by us ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and López-Ortiz [SODA 2000/ALENEX 2001], by using a variant of interpolation search. More specifically, our contributions are threefold. First, we corroborate and complete the practical study from Demaine et al. on comparison based intersection algorithms. Second, we show that in practice replacing binary search and galloping (one-sided binary) search [4] by interpolation search improves the performance of each main intersection algorithms. Third, we introduce and test variants of interpolation search: this results in an even better intersection algorithm.
Adaptive searching in succinctly encoded binary relations and tree-structured documents (Extended Abstract)
- THEORETICAL COMPUTER SCIENCE
, 2005
"... This paper deals with succinct representations of data types motivated by applications in posting lists for search engines, in querying XML documents, and in the more general setting (which extends XML) of multi-labeled trees, where several labels can be assigned to each node of a tree. To find th ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
This paper deals with succinct representations of data types motivated by applications in posting lists for search engines, in querying XML documents, and in the more general setting (which extends XML) of multi-labeled trees, where several labels can be assigned to each node of a tree. To find the set of references corresponding to a set of keywords, one typically intersects the list of references associated with each keyword. We view this instead as having a single list of objects [n] = {1,..., n} (the references), each of which has a subset of the labels [σ] = {1,..., σ} (the keywords) associated with it. We are able to find the objects associated with an arbitrary set of keywords in time O(δk lg lg σ) using a data structure requiring only t(lg σ +o(lg σ)) bits, where δ is the number of steps required by a non-deterministic algorithm to check the answer, k is the number of keywords in the query, σ is the size of the set from which the keywords are chosen, and t is the number of associations between references and keywords. The data structure is succinct in that it differs from the space needed to write down all t occurrences of keywords by only a lower order term. An XML document is, for our purpose, a labeled rooted tree. We deal primarily with “non-recursive labeled trees”, where no label occurs more than once on any root to leaf path. We find the set of nodes which path from the root include a set of keywords in the same time, O(δk lg lg σ), on a representation of the tree using essentially minimum space, 2n + n(lg σ + o(lg σ)) bits, where n is the number of nodes in the tree. If we permit nodes to have multiple
Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences
- In Proceedings of 12th International Conference on String Processing and Information Retrieval (SPIRE
, 2005
"... Abstract. This work presents an experimental comparison of intersection algorithms for sorted sequences, including the recent algorithm of Baeza-Yates. This algorithm performs on average less comparisons than the total number of elements of both inputs (n and m respectively) when n = αm (α> 1). We c ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Abstract. This work presents an experimental comparison of intersection algorithms for sorted sequences, including the recent algorithm of Baeza-Yates. This algorithm performs on average less comparisons than the total number of elements of both inputs (n and m respectively) when n = αm (α> 1). We can find applications of this algorithm on query processing in Web search engines, where large intersections, or differences, must be performed fast. In this work we concentrate in studying the behavior of the algorithm in practice, using for the experiments test data that is close to the actual conditions of its applications. We compare the efficiency of the algorithm with other intersection algorithm and we study different optimizations, showing that the algorithm is more efficient than the alternatives in most cases, especially when one of the sequences is much larger than the other. 1
Querying Multiple Sets of Discovered Rules
- In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02
, 2002
"... Rule mining is an important data mining task that has been applied to numerous real-world applications. Often a rule mining system generates a large number of rules and only a small subset of them is really useful in applications. Although there exist some systems allowing the user to query the disc ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Rule mining is an important data mining task that has been applied to numerous real-world applications. Often a rule mining system generates a large number of rules and only a small subset of them is really useful in applications. Although there exist some systems allowing the user to query the discovered rules, they are less suitable for complex ad hoc querying of multiple data mining rulebases to retrieve interesting rules. In this paper, we propose a new powerful rule query language Rule-QL for querying multiple rulebases that is modeled after SQL and has rigorous theoretical foundations of a rule-based calculus. In particular, we first propose a rule-based calculus RC based on the first-order logic, and then present the language Rule-QL that is at least as expressive as the safe fragment of RC. We also propose a number of efficient query evaluation techniques for Rule-QL and test them experimentally on some representative queries to demonstrate the feasibility of Rule-QL.
An Experimental Investigation of Set Intersection Algorithms for Text Searching ⋆
"... Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the inte ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, López-Ortiz and Munro [ALENEX 2001], and from Baeza-Yates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data set from Baeza-Yates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that value-based search algorithms perform well in posting lists in terms of the number of comparisons performed. 1
Alternation and Redundancy Analysis of the Intersection Problem
"... The intersection of sorted arrays problem has applications in search engines such as Google. Previous work propose and compare deterministic algorithms for this problem, in an adaptive analysis based on the encoding size of a certificate of the result (cost analysis). We define the alternation analy ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
The intersection of sorted arrays problem has applications in search engines such as Google. Previous work propose and compare deterministic algorithms for this problem, in an adaptive analysis based on the encoding size of a certificate of the result (cost analysis). We define the alternation analysis, based on the non-deterministic complexity of an instance. In this analysis we prove that there is a deterministic algorithm asymptotically performing as well as any randomized algorithm in the comparison model. We define the redundancy analysis, based on a measure of the internal redundancy of the instance. In this analysis we prove that any algorithm optimal in the redundancy analysis is optimal in the alternation analysis, but that there is a randomized algorithm which performs strictly better than any deterministic algorithm in the comparison model. Finally, we describe how those results can be extended beyond the comparison model.
An Efficient Indexing and Search Technique for Multimedia Databases
- SIGIR Multimedia Information Retrieval Workshop
, 2003
"... We present a novel index-based approach for searching multimedia databases by content. Our approach integrates methods from classical full-text retrieval with the mathematical concept of groups acting on sets. This yields a flexible framework applicable to a wide range of content-based search probl ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We present a novel index-based approach for searching multimedia databases by content. Our approach integrates methods from classical full-text retrieval with the mathematical concept of groups acting on sets. This yields a flexible framework applicable to a wide range of content-based search problems such as audio- or image-identification. We propose space-e#cient indexing methods as well as very fast fault-tolerant searching algorithms. In contrast to other approaches, our query response times decrease with incresing query complexity. As a further main benefit, a concept of partial matches is an integral part of our technique. We demonstrate the capabilities of our approach with examples from content-based music- and image-retrieval.
Multi-User File System Search
, 2007
"... Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when generating the results to a search query. The goal of this thesis is to close the gap between information retrieval research and the requirements exacted by these real-life applications. The algorithms and data structures presented in this thesis can be used to implement a file system search engine that is able to react to changes in the file system by updating its index data in real time. File changes (in-sertions, deletions, or modifications) are reflected by the search results within a few seconds,
Adaptive search algorithm for patterns, in succinctly encoded XML
, 2006
"... Abstract. We propose an adaptive algorithm for context queries (queries expressed as preorder and ancestordescendant relations on labeled nodes), which can be used to find patterns in XML documents. Our algorithm takes advantage of the correlation between terms of the query without any preprocessed ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. We propose an adaptive algorithm for context queries (queries expressed as preorder and ancestordescendant relations on labeled nodes), which can be used to find patterns in XML documents. Our algorithm takes advantage of the correlation between terms of the query without any preprocessed information, and it runs in time (kd(lg lg min(n,s)+lg lg(r))) in the RAM model, where k is the number of terms in the query, d is the non-deterministic complexity of the query on the multi-labeled tree (i.e. the minimum number of operations required to check the answer to the query), n is the number of nodes in the tree, s is the number of relations between nodes and labels, and r is the maximal number of nodes matching a label on any rooted path in the tree.

