## Structural selectivity estimation for XML documents (2007)

### Cached

### Download Links

- [wwwcs.uni-paderborn.de]
- [www2.cs.uni-paderborn.de]
- [www.cse.unsw.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | In ICDE |

Citations: | 10 - 3 self |

### BibTeX

@INPROCEEDINGS{Fisher07structuralselectivity,

author = {Damien K. Fisher},

title = {Structural selectivity estimation for XML documents},

booktitle = {In ICDE},

year = {2007}

}

### OpenURL

### Abstract

Estimating the selectivity of queries is a crucial problem in database systems. Virtually all database systems rely on the use of selectivity estimates to choose amongst the many possible execution plans for a particular query. In terms of XML databases, the problem of selectivity estimation of queries presents new challenges: many evaluation operators are possible, such as simple navigation, structural joins, or twig joins, and many different indexes are possible ranging from traditional B-trees to complicated XML-specific graph indexes. A new synopsis for XML documents is introduced which can be effectively used to estimate the selectivity of complex path queries. The synopsis is based on a lossy compression of the document tree that underlies the XML document, and can be computed in one pass from the document. It has several advantages over existing approaches: (1) it allows one to estimate the selectivity of queries containing all XPath axes, including the order-sensitive ones, (2) the estimator returns a range within which the actual selectivity is guaranteed to lie, with the size of this range implicitly providing a confidence measure of the estimate, and (3) the synopsis can be incrementally updated to reflect changes in the XML database. 1

### Citations

273 | Efficient algorithms for processing XPath queries
- Gottlob, Koch, et al.
- 2002
(Show Context)
Citation Context ...rlying document. As we shall see in our experiments, our construction cost is between 50 and 100 times faster than for other synopses. • Our synopsis can give selectivity estimates for any Core XPath =-=[9]-=- query, including those which make use of order-sensitive axes. • Unlike other selectivity estimation strategies, our approach returns a range within which the exact selectivity is guaranteed to lie. ... |

169 |
XML path language (XPath) version 1.0
- Clark, DeRose
- 1999
(Show Context)
Citation Context ...ent — therefore, for convenience we write λE(q) =λE(〈PARENT(q),q〉). One of the vertices of Q, mQ ∈ VQ, is the match node (cf. the boxed node in Fig.2(a)). The semantics of an XPath query is wellknown =-=[7]-=-, and so we only briefly summarize it here. An embedding of a query Q in a document D is a tree hoFigure 1. A sample XML document. momorphism h : VQ → VD such that, for every node v of Q, h(v) has the... |

125 | An algorithm for optimal lambda calculus reductions
- Lamping
- 1990
(Show Context)
Citation Context ...pattern “c(d(” appears three times in the tree. The idea of sharing tree patterns gave rise to the notion of sharing graphs, which were studied in the context of optimal reductions of lambda-calculus =-=[10]-=-. The problem of finding a smallest sharing graph for a given tree is NP-complete. The first approximation algorithm for finding a small sharing graphsis the BPLEX algorithm of [5]. Instead of sharing... |

123 | XPath: Looking Forward
- Olteanu, Meuss, et al.
- 2002
(Show Context)
Citation Context ... can be handled in an analogous fashion to the others. The remaining axes can be divided into forward and reverse axes: in this paper, we only need to consider only the forward axes, as Olteanu et al =-=[14]-=- have demonstrated that any query involving reverse axes can be rewritten into one using only forward axes. Additionally, it is trivial to rewrite the descendant axis in terms of the descendant-or-sel... |

96 | Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. VLDB
- Aboulnaga, Alameldeen, et al.
- 2001
(Show Context)
Citation Context ...tem, being able to accurately estimate the result size of the sub-expressions in a query is of great practical importance. There has been a lot of work on this problem in the context of XML databases =-=[1, 12, 6, 9, 23, 20, 15, 16, 17, 18, 21]-=-. All previous work suffers from some combination of the following problems: Expensive construction: A problem with many techniques is that synopsis construction is extremely expensive. Any algorithm ... |

58 | Structure and Value Synopses for XML Data Graphs - Polyzotis, Garofalakis |

55 | StatiX: making XML count
- Freire, Haritsa, et al.
(Show Context)
Citation Context ...tem, being able to accurately estimate the result size of the sub-expressions in a query is of great practical importance. There has been a lot of work on this problem in the context of XML databases =-=[1, 12, 6, 9, 23, 20, 15, 16, 17, 18, 21]-=-. All previous work suffers from some combination of the following problems: Expensive construction: A problem with many techniques is that synopsis construction is extremely expensive. Any algorithm ... |

53 | Statistical Synopses for Graph-Structured XML Databases - Polyzotis, Garofalakis - 2002 |

50 | Estimating answer sizes for xml queries
- Wu, Patel, et al.
- 2002
(Show Context)
Citation Context ...fiers. Ramanath et al [19] extended StatiX with updates, but their work still suffers from the same limitations. Research upon the estimation of result sizes for the structural join operator, such as =-=[24,27]-=-, is also relevant to selectivity estimation for path expressions of the form //p 1//p 2. Unfortunately, these results cannot be easily generalized to other path expressions. Polyzotis and Garofalakis... |

47 | XBench benchmark and performance testing of XML DBMSs
- Yao, ¨Ozsu, et al.
- 2004
(Show Context)
Citation Context ...Average F/B (MB) Count Depth Depth Size DBLP [11] 43.61 1103703 5 3.00 1158 SwissProt [2] 30.29 756329 6 4.39 21441 XMark [22] 5.34 78414 12 5.56 35558 PSD [26] 683.64 21305818 7 5.45 1944543 Catalog =-=[28]-=- 10.36 225194 8 5.65 235 Table 1: Characteristics of experimental data sets. For our data sets, we chose DBLP [11], XMark [22], SwissProt [2], and the Protein Sequence Database [26]. These data sets h... |

44 | Approximate xml query answers
- Polyzotis, Garofalakis, et al.
- 2004
(Show Context)
Citation Context ...ds, using a synopsis size of about 62 KB (0.24%) for SwissProt. We obtained an implementation of the TreeSketch estimation structure, which allows us to give a more direct comparison with the work of =-=[17]-=- (to our knowledge, this is the most competitive XML selectivity estimator currently available). We compared our work with this implementation using the XMark database, however, we had to slightly sim... |

40 | Path queries on compressed XML
- Buneman, Grohe, et al.
- 2003
(Show Context)
Citation Context ...nces of equal subtrees and to replace them by pointers to a single occurrence of the subtree. In this way, the minimal unique DAG (directed acyclic graph) of a tree can be computed in linear time. In =-=[4]-=- this idea was applied to XML document trees, and it was shown that for most document trees, the size of the minimal DAG is approximately 10% of the size of the original tree (where size is measured a... |

40 | Bloom Histogram: Path Selectivity Estimation for XML Data with Updates
- Wang, Jiang, et al.
(Show Context)
Citation Context ...s. These heuristics, while based on well-justified assumptions in many cases, do not provide any guarantee of accuracy, and hence the computed estimate can be wildly inaccurate. With the exception of =-=[25]-=-, no previous techniquesgives the user any sort of confidence measure on the result. In this work, we extend recent work on the lossless compression of XML [5] to the problem of selectivity estimation... |

35 | XPathLearner: An on-line self-tuning markov histogram for XML path selectivity estimation
- LIM, WANG, et al.
(Show Context)
Citation Context ...ensive; their experiments also demonstrate that their schemes have inconsistent performance. The idea of using a Markov table is extended to adaptive selectivity estimation by the XPathLearner system =-=[12]-=-, which uses feedback from the query processor. The first paper to study the problem of selectivity estimation for more complicated queries is that of Chen et al [6]; they use pruned suffix trees to e... |

34 | Efficient memory representation of xml document trees
- Busatto, Lohrey, et al.
- 2008
(Show Context)
Citation Context ...ildly inaccurate. With the exception of [21], no previous technique gives the user any sort of confidence measure on the result. In this work, we extend recent work on the lossless compression of XML =-=[5]-=- to the problem of selectivity estimation. Our work has the following advantages over previous approaches: • Our synopsis can be constructed in a single pass of the underlying document. As we shall se... |

26 |
et al., "The Universal Protein Resource (UniProt
- Bairoch, Apweiler, et al.
- 2005
(Show Context)
Citation Context ...lgorithm was run with maximal rank 5, maximal right-hand side size 20, and window size 1000, cf. [5] for more details on these parameters. For our data sets, we chose DBLP [11], XMark [19], SwissProt =-=[2]-=-, and the Protein Sequence Database [22]. These data sets have intrinsically different structures, ranging from the simplest (DBLP) to the most complicated (XMark) The following table gives the salien... |

21 | Selectivity estimation for xml twigs
- Polyzotis, Garofalakis, et al.
- 2004
(Show Context)
Citation Context ...0.2 0.15 0.1 0.05 Update Performance Insertions only 80% Insertions, 20% Deletions 0 0 500 1000 1500 2000 2500 Number of Updates (a) Updates al [6], the XSketch structural synopsis of Polyzotis et al =-=[18]-=-, and StatiX [8]. The authors of [6] and [8] were kind enough to provide us with an implementation; unfortunately, we did not succeed in getting to run the implementations on our query workloads, even... |

20 |
Finite tree automata with cost functions, Theoret
- Seidl
(Show Context)
Citation Context ...ment is straightforward, as we have seen above. However, in the context of selectivity estimation, we do not want to test acceptance, but instead want to return the size of result of the query. Seidl =-=[23]-=- developed a framework for finite tree automata with cost functions which addresses such problems: each transition in the automaton is assigned a cost, and the task is then to find the “cheapest” acce... |

16 |
et al. XMark: A Benchmark for XML Data Management
- Schmidt, Waas, et al.
- 2002
(Show Context)
Citation Context ...Section 4.1 for an explanation of these parameters. Data Set Size Element Max Average F/B (MB) Count Depth Depth Size DBLP [11] 43.61 1103703 5 3.00 1158 SwissProt [2] 30.29 756329 6 4.39 21441 XMark =-=[22]-=- 5.34 78414 12 5.56 35558 PSD [26] 683.64 21305818 7 5.45 1944543 Catalog [28] 10.36 225194 8 5.65 235 Table 1: Characteristics of experimental data sets. For our data sets, we chose DBLP [11], XMark ... |

15 |
The digital bibliography & library project. URL (valid as of October 3, 2008). Ilaria Bartolini is currently an Assistant Professor with the DEIS department of the University of Bologna (Italy). She graduated
- Ley
- 1997
(Show Context)
Citation Context ...and side 20, and window size 40000 (1000 in the case of updates), cf. end of Section 4.1 for an explanation of these parameters. Data Set Size Element Max Average F/B (MB) Count Depth Depth Size DBLP =-=[11]-=- 43.61 1103703 5 3.00 1158 SwissProt [2] 30.29 756329 6 4.39 21441 XMark [22] 5.34 78414 12 5.56 35558 PSD [26] 683.64 21305818 7 5.45 1944543 Catalog [28] 10.36 225194 8 5.65 235 Table 1: Characteris... |

15 | Containment join size estimation: models and methods
- Wang, Jiang, et al.
- 2003
(Show Context)
Citation Context ...fiers. Ramanath et al [19] extended StatiX with updates, but their work still suffers from the same limitations. Research upon the estimation of result sizes for the structural join operator, such as =-=[24,27]-=-, is also relevant to selectivity estimation for path expressions of the form //p 1//p 2. Unfortunately, these results cannot be easily generalized to other path expressions. Polyzotis and Garofalakis... |

8 | A General Framework for Estimating XML Query Cardinality
- Sartiani
- 2003
(Show Context)
Citation Context ...eness of their approach, their technique requires the combination of two separate sketch structures, and the effect of the interaction of these structures on the estimation error is unclear. Sartiani =-=[20,21]-=- has developed a general framework which can extend existing work on selectivity estimation to the problem of estimation for XQuery. As his technique is quite general, it can be applied to the work of... |

7 |
et al. The Protein Information Resource
- Wu
- 2003
(Show Context)
Citation Context ...these parameters. Data Set Size Element Max Average F/B (MB) Count Depth Depth Size DBLP [11] 43.61 1103703 5 3.00 1158 SwissProt [2] 30.29 756329 6 4.39 21441 XMark [22] 5.34 78414 12 5.56 35558 PSD =-=[26]-=- 683.64 21305818 7 5.45 1944543 Catalog [28] 10.36 225194 8 5.65 235 Table 1: Characteristics of experimental data sets. For our data sets, we chose DBLP [11], XMark [22], SwissProt [2], and the Prote... |

5 |
et al. Counting twig matches in a tree
- Chen
- 2001
(Show Context)
Citation Context ...tem, being able to accurately estimate the result size of the sub-expressions in a query is of great practical importance. There has been a lot of work on this problem in the context of XML databases =-=[1, 12, 6, 9, 23, 20, 15, 16, 17, 18, 21]-=-. All previous work suffers from some combination of the following problems: Expensive construction: A problem with many techniques is that synopsis construction is extremely expensive. Any algorithm ... |

4 | Tree automata and XPath on compressed trees
- Lohrey, Maneth
- 2005
(Show Context)
Citation Context ...rammars over other compressed structures are: (1) they can be represented in a highly succinct way (see Section 7), and (2) they can be queried in a direct and natural way without prior decompression =-=[13]-=-. In particular, it is shown in Section 5 how to translate XPath queries into certain tree automata which can be executed on SLT grammars. 4.1 Tree Compression using SLT Grammars Most XML documents ar... |

2 |
IMAX: The big picture of dynamic XML statistics
- RAMANATH, ZHANG, et al.
- 2005
(Show Context)
Citation Context ...ver, these histograms are built over the object identifiers of the nodes, which means that the quality of their estimation is highly dependent on the distribution of these identifiers. Ramanath et al =-=[19]-=- extended StatiX with updates, but their work still suffers from the same limitations. Research upon the estimation of result sizes for the structural join operator, such as [24,27], is also relevant ... |

1 |
et al. Two simplified algorithms for maintaining order in a list
- Bender
- 2002
(Show Context)
Citation Context ...e synopsis from scratch. For larger synopses, we split the encoding into an array of blocks, leaving padding in each block. A standard ordered file maintenance algorithm, such as that of Bender et al =-=[3]-=- can then be used to speed up insertions and deletions (for an array of n elements, we can insert and delete elements maintaining the order of the array in O(log 2 n) time). 7 Experiments In this sect... |