Results 1 - 10
of
295
Data Mining: An Overview from Database Perspective
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 1996
"... Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have sh ..."
Abstract
-
Cited by 532 (26 self)
- Add to MetaCart
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.
Why and Where: A Characterization of Data Provenance
- In ICDT
, 2001
"... With the proliferation of database views and curated databases, the issue of data provenance # where a piece of data came from and the process by which it arrived in the database # is becoming increasingly important, especially in scienti#c databases where understanding provenance is crucial to ..."
Abstract
-
Cited by 430 (18 self)
- Add to MetaCart
(Show Context)
With the proliferation of database views and curated databases, the issue of data provenance # where a piece of data came from and the process by which it arrived in the database # is becoming increasingly important, especially in scienti#c databases where understanding provenance is crucial to the accuracy and currency of data. In this paper we describe an approach to computing provenance when the data of interest has been created by a database query.We adopt a syntactic approach and present results for a general data model that applies to relational databases as well as to hierarchical data such as XML. A novel aspect of our work is a distinction between #why" provenance #refers to the source data that had some in#uence on the existence of the data# and #where" provenance #refers to the location#s# in the source databases from which the data was extracted#.
An overview of data warehousing and OLAP technology
- SIGMOD Record
, 1997
"... Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offering ..."
Abstract
-
Cited by 386 (3 self)
- Add to MetaCart
Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996. 1.
Maintenance of Materialized Views: Problems, Techniques, and Applications
, 1995
"... In this paper we motivate and describe materialized views, their applications, and the problems and techniques for their maintenance. We present a taxonomy of view maintenanceproblems basedupon the class of views considered, upon the resources used to maintain the view, upon the types of modi#cati ..."
Abstract
-
Cited by 314 (9 self)
- Add to MetaCart
In this paper we motivate and describe materialized views, their applications, and the problems and techniques for their maintenance. We present a taxonomy of view maintenanceproblems basedupon the class of views considered, upon the resources used to maintain the view, upon the types of modi#cations to the base data that areconsidered during maintenance, and whether the technique works for all instances of databases and modi#cations. We describe some of the view maintenancetechniques proposed in the literature in terms of our taxonomy. Finally, we consider new and promising application domains that are likely to drive work in materialized views and view maintenance. 1 Introduction What is a view? A view is a derived relation de#ned in terms of base #stored# relations. A view thus de#nes a function from a set of base tables to a derived table; this function is typically recomputed every time the view is referenced. What is a materialized view? A view can be materialized by storin...
Research Problems in Data Warehousing
, 1995
"... The topic of data warehousing encompasses architectures, algorithms, and tools for bringing together selected data from multiple databases or other information sources into a single repository, called a data warehouse, suitable for direct querying or analysis. In recent years data warehousing has be ..."
Abstract
-
Cited by 297 (9 self)
- Add to MetaCart
(Show Context)
The topic of data warehousing encompasses architectures, algorithms, and tools for bringing together selected data from multiple databases or other information sources into a single repository, called a data warehouse, suitable for direct querying or analysis. In recent years data warehousing has become a prominent buzzword in the database industry, but attention from the database research community has been limited. In this paper we motivate the concept of a data warehouse, we outline a general data warehousing architecture, and we propose a number of technical issues arising from the architecture that we believe are suitable topics for exploratory research. 1 Introduction Providing integrated access to multiple, distributed, heterogeneous databases and other information sources has become one of the leading issues in database research and industry #6#. In the research community, most approaches to the data integration problem are based on the following very general two-step process...
Selection of Views to Materialize in a Data Warehouse
, 1997
"... . A data warehouse stores materialized views of data from one or more sources, with the purpose of efficiently implementing decisionsupport or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The ..."
Abstract
-
Cited by 246 (5 self)
- Add to MetaCart
. A data warehouse stores materialized views of data from one or more sources, with the purpose of efficiently implementing decisionsupport or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The goal is to select an appropriate set of views that minimizes total query response time and the cost of maintaining the selected views, given a limited amount of resource, e.g., materialization time, storage space etc. In this article, we develop a theoretical framework for the general problem of selection of views in a data warehouse. We present competitive polynomial-time heuristics for selection of views to optimize total query response time, for some important special cases of the general data warehouse scenario, viz.: (i) an AND view graph, where each query/view has a unique evaluation, and (ii) an OR view graph, in which any view can be computed from any one of its related views, e.g.,...
Synchronizing a database to Improve Freshness
, 1999
"... In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more di#cult to maintain the copy "fresh," making it crucial to synchronize the copy e#ectively. We define two freshness metrics, chang ..."
Abstract
-
Cited by 195 (17 self)
- Add to MetaCart
(Show Context)
In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more di#cult to maintain the copy "fresh," making it crucial to synchronize the copy e#ectively. We define two freshness metrics, change models of the underlying data, and synchronization policies. We analytically study how effective the various policies are. We also experimentally verify our analysis, based on data collected from 270 web sites for more than 4 months, and we show that our new policy improves the "freshness" very significantly compared to current policies in use.
Change Detection in Hierarchically Structured Information
- In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data
, 1996
"... {chaw,anand,hector, widom}%s.stanford. edu Detecting and representing changes to data is important for active databases, data warehousing, view maintenance, and version and configuration management. Most previous work in change management has dealt with flat-file and relational data; we focus on hie ..."
Abstract
-
Cited by 183 (11 self)
- Add to MetaCart
{chaw,anand,hector, widom}%s.stanford. edu Detecting and representing changes to data is important for active databases, data warehousing, view maintenance, and version and configuration management. Most previous work in change management has dealt with flat-file and relational data; we focus on hierarchically structured data. Since in many cases changes must be computed from old and new versions of the data, we define the hierarchical change detection problem as the problem of finding a “minimum-cost edit script ” that transforms one data tree to another, and we present efficient algorithms for computing such an edit script. Our algorithms make use of some key domain characteristics to acKleve substantially better performance than previous, general-purpose algorithms. We study the performance of our algorithms both analytically and empirically, and we describe the application of our techniques to hierarchically structured documents. 1
Continual Queries for Internet Scale Event-Driven Information Delivery
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 1999
"... In this paper we introduce the concept of continual queries, describe the design of a distributed event-driven continual query system -- OpenCQ, and outline the initial implementation of OpenCQ on top of the distributed interoperable information mediation system DIOM [21, 19]. Continual queries a ..."
Abstract
-
Cited by 179 (14 self)
- Add to MetaCart
In this paper we introduce the concept of continual queries, describe the design of a distributed event-driven continual query system -- OpenCQ, and outline the initial implementation of OpenCQ on top of the distributed interoperable information mediation system DIOM [21, 19]. Continual queries are standing queries that monitor update of interest and return results whenever the update reaches specified thresholds. In OpenCQ, users may specify to the system the information they would like to monitor (such as the events or the update thresholds they are interested in). Whenever the information of interest becomes available, the system immediately delivers it to the relevant users; otherwise, the system continually monitors the arrival of the desired information and pushes it to the relevant users as it meets the specified update thresholds. In contrast to conventional pull-based data management systems such as DBMSs and Web search engines, OpenCQ exhibits two important featu...
Estimating Frequency of Change
- ACM TRANSACTIONS ON INTERNET TECHNOLOGY
, 2000
"... Many online data sources are updated autonomously and independently. In this paper, we make the case for estimating the change frequency of the data, to improve web crawlers, web caches and to help data mining. We first identify various scenarios, where different applications have different requirem ..."
Abstract
-
Cited by 149 (9 self)
- Add to MetaCart
Many online data sources are updated autonomously and independently. In this paper, we make the case for estimating the change frequency of the data, to improve web crawlers, web caches and to help data mining. We first identify various scenarios, where different applications have different requirements on the accuracy of the estimated frequency. Then we develop several "frequency estimators" for the identified scenarios. In developing the estimators, we analytically show how precise/effective the estimators are, and we show that the estimators that we propose can improve precision significantly.