Results 1 -
3 of
3
Estimating the selectivity of approximate string queries
- ACM Trans. Database Syst
"... Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse st ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures. We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing; textual databases; H.2.8 [Database Management]: Database Applications—Statistical databases
We proposed to organize the CleanDB workshop [6]
"... as a forum focusing on the issues to maintain and improve the “Quality of Data ” (QoD) toward clean databases. The existence of poor or erroneous data in databases causes the so-called “Garbage-in, Garbageout” problem. For any mission-critical analysis or applications, the first and foremost task to ..."
Abstract
- Add to MetaCart
(Show Context)
as a forum focusing on the issues to maintain and improve the “Quality of Data ” (QoD) toward clean databases. The existence of poor or erroneous data in databases causes the so-called “Garbage-in, Garbageout” problem. For any mission-critical analysis or applications, the first and foremost task to do is to improve the quality of data. However, as the sources of data become diverse, their formats become heterogeneous, and the volume of data grows rapidly, maintaining and improving the quality of such data gets harder. To address these challenging issues, the CleanDB workshop solicited papers on database-centric data quality problems and solutions. The program committee consisted of 27 international members. Each of 21 submissions that we have received was reviewed by at least two program committee members. The workshop finally accepted 7 full papers and 2 short papers to be presented in a one-day program. The covered topics include XML object identification, quality measures, sensor-data cleaning, or data cleaning. In addition, the program also included an invited talk by Divesh Srivastava from AT&T Labs – Research, USA. 1 The summary here are taken and modified from abstracts or conclusions of actual papers. 2 Technical Program The accepted papers were divided into three technical sessions. The papers can be downloaded from the workshop website [3], which has also additional information about the program.
Data Quality of Native XML Databases in the Healthcare Domain
"... As XML data is being widely adopted as a data and object exchange format for both structured and semi structured data, the need for quality control and measurement is only to be expected. This can be attributed to the increase in the need for data quality metrics in traditional databases over the pa ..."
Abstract
- Add to MetaCart
As XML data is being widely adopted as a data and object exchange format for both structured and semi structured data, the need for quality control and measurement is only to be expected. This can be attributed to the increase in the need for data quality metrics in traditional databases over the past decade. The traditional model provide constraints mechanisms and features to control quality defects but unfortunately these methods are not foolproof. This report reviews work on data quality in both database and management research areas. The review includes (i) the exploration into the notion of data quality, its definitions, metrics, control and improvement in data and information sets and (ii) investigation of the techniques which used in traditional databases like relational and object databases where most focus and resource has been directed. In spite of the wide adoption of XML data since its inception, the exploration does not only show a huge gap between research works of data quality in relational databases and XML databases but also show how very little support database systems provide in giving a measure of the quality of the data they hold. This inducts the need to formularize mechanisms and techniques for embedding data quality control and metrics into XML data sets. It also presents the viability of a process based approach to data quality measurement with suitable techniques, applicable in a dynamic decision environments with multidimensional data and heterogeneous sources. This will involve modelling the interdependencies and categories of the attributes of data quality generally referred to as data quality dimensions and the adoption of a formal means like process algebra, fuzzy logic and any other appropriate approaches. The attempt is contextualised using the healthcare domain as it bears all the required characteristics.