Results 1 - 10
of
50
QPipe: A Simultaneously Pipelined Relational Query Engine
- In Proc. SIGMOD
, 2005
"... Relational DBMS typically execute concurrent queries independently by invoking a set of operator instances for each query. To exploit common data retrievals and computation in concurrent queries, researchers have proposed a wealth of techniques, ranging from buffering disk pages to constructing mate ..."
Abstract
-
Cited by 35 (10 self)
- Add to MetaCart
Relational DBMS typically execute concurrent queries independently by invoking a set of operator instances for each query. To exploit common data retrievals and computation in concurrent queries, researchers have proposed a wealth of techniques, ranging from buffering disk pages to constructing materialized views and optimizing multiple queries. The ideas proposed, however, are inherently limited by the query-centric philosophy of modern engine designs. Ideally, the query engine should proactively coordinate same-operator execution among concurrent queries, thereby exploiting common accesses to memory and disks as well as common intermediate result computation.
Column-Stores vs. Row-Stores: How Different Are They Really
- In SIGMOD
, 2008
"... There has been a significant amount of excitement and recent work on column-oriented database systems (“column-stores”). These database systems have been shown to perform more than an order of magnitude better than traditional row-oriented database systems (“row-stores”) on analytical workloads such ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
There has been a significant amount of excitement and recent work on column-oriented database systems (“column-stores”). These database systems have been shown to perform more than an order of magnitude better than traditional row-oriented database systems (“row-stores”) on analytical workloads such as those found in data warehouses, decision support, and business intelligence applications. The elevator pitch behind this performance difference is straightforward: column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query. This simplistic view leads to the assumption that one can obtain the performance benefits of a column-store using a row-store: either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this paper, we demonstrate that this assumption is false. We compare the performance of a commercial row-store under a variety of different configurations with a column-store and show that the row-store performance is significantly slower on a recently proposed data warehouse benchmark. We then analyze the performance difference and show that there are some important differences between the two systems at the query executor level (in addition to the obvious differences at the storage layer level). Using the column-store, we then tease apart these differences, demonstrating the impact on performance of a variety of column-oriented query execution techniques, including vectorized query processing, compression, and a new join algorithm we introduce in this paper. We conclude that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach.
Plug and Play with Query Algebras: SECONDO -- A Generic DBMS Development Environment
, 2000
"... We present SECONDO, a new generic environment supporting the implementation of database systems for a wide range of data models and query languages. On the one hand, this framework is more flexible than common extensible and object-relational systems, offering the full extensibility of second-order ..."
Abstract
-
Cited by 28 (12 self)
- Add to MetaCart
We present SECONDO, a new generic environment supporting the implementation of database systems for a wide range of data models and query languages. On the one hand, this framework is more flexible than common extensible and object-relational systems, offering the full extensibility of second-order signature, the formal basis for data and query language definitions in SECONDO. On the other hand, it is much more complete and structured than database system toolkits. Extensibility is provided by the concept of algebra modules defining and implementing new types (type constructors, in fact) and operators. Support functions are used to register them with the system frame. After a review of second-order signature essentials, this paper presents the system functionality, given by a uniform set of user commands valid for all data models, and the extensible system architecture. All common DBMS features are implemented in the system frame; only purely data model dependent functionality is coded in algebra modules, supported by a variety of tools. Furthermore, we describe the key strategies for extensible query processing in the SECONDO environment and explain the structure of algebra modules.
Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience
- IN VLDB '09: PROCEEDINGS OF THE VLDB ENDOWMENT
, 2009
"... Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations. Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation. This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution. 1.
AmbientDB: relational query processing in a P2P network
- In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P), LNCS 2788
, 2003
"... Abstract. A new generation of applications running on a network of nodes, that share data on an ad-hoc basis, will benefit from data management services including powerful querying facilities. In this paper, we introduce the goals, assumptions and architecture of AmbientDB, a new peer-to-peer (P2P) ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Abstract. A new generation of applications running on a network of nodes, that share data on an ad-hoc basis, will benefit from data management services including powerful querying facilities. In this paper, we introduce the goals, assumptions and architecture of AmbientDB, a new peer-to-peer (P2P) DBMS prototype developed at CWI. Our focus is on the query processing facilities of AmbientDB, that are based on a three-level translation of a global query algebra into multiwave stream processing plans, distributed over an ad-hoc P2P network. We illustrate the usefulness of our system by outlining how it eases construction of a music player that generates intelligent playlists with collaborative filtering over distributed music logs. Finally, we show how the use of Distributed Hash Tables (DHT) at the basis of AmbientDB allows applications like the P2P music player to scale to large amounts of nodes. 1
A Dataflow Approach to Agent-based Information Management
- IN PROCEEDINGS OF THE 2000 INTERNATIONAL CONFERENCE OF ON ARTIFICIAL INTELLIGENCE, LAS VEGAS, NV
, 2000
"... Recent research has made it possible to build information agents that retrieve and integrate information from the World Wide Web. Although there now exist solutions for modeling Web sources, query planning, and information extraction, less attention has been given to the problem of optimizing agent ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Recent research has made it possible to build information agents that retrieve and integrate information from the World Wide Web. Although there now exist solutions for modeling Web sources, query planning, and information extraction, less attention has been given to the problem of optimizing agent execution. In this paper, we describe Theseus, an efficient plan execution system for information agents. Through its pipelined, dataflow-style architecture, Theseus offers a high degree of parallelism and asynchronous information routing during execution. Theseus differs from prior work in reactive planning systems and parallel databases because it gathers information from the Web, a domain where information retrieval is a problem that is network-bound and is often based on interleaved data gathering and navigation. The Theseus plan language and architecture directly address these issues, resulting in an efficient execution system.
Spatial Database Programming Using SAND
, 1996
"... SAND (Spatial and Non-spatial Data) is an interactive environment that enables the development of spatial database applications. It was designed as a tool for rapid prototyping of algorithms and query evaluation plans dealing with spatial and nonspatial data. In this paper we give an overview of SAN ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
SAND (Spatial and Non-spatial Data) is an interactive environment that enables the development of spatial database applications. It was designed as a tool for rapid prototyping of algorithms and query evaluation plans dealing with spatial and nonspatial data. In this paper we give an overview of SAND's architecture and illustrate how typical spatial and non-spatial queries can be processed by means of short code fragments. Keywords: Spatial databases, GIS, query optimization. 1. Introduction The design of spatial database applications involves many stages. The first stage is choosing a proper development environment which entails the appraisal of the many existing software packages in which will supply the basic facilities needed for the task. The software components most commonly used in such applications are programs or libraries specialized in performing operations on spatial data, while non-spatial data is frequently handled by database management systems (DBMS). In fact, this co...
Customizable Parallel Execution of Scientific Stream
, 2005
"... Scientific applications require processing highvolume on-line streams of numerical data from instruments and simulations. We present an extensible stream database system that allows scalable and flexible continuous queries on such streams. Application dependent streams and query functions are ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Scientific applications require processing highvolume on-line streams of numerical data from instruments and simulations. We present an extensible stream database system that allows scalable and flexible continuous queries on such streams. Application dependent streams and query functions are defined through an object-relational model. Distributed execution plans for continuous queries are described as high-level data flow distribution templates.
Query execution in column-oriented database systems
, 2008
"... There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it perfo ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it performs best on the most common application for database systems: business transactional data processing. However, there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read through the data to gain new insight and use it to drive decision making and planning. In this dissertation, we study the problem of poor performance of row-by-row data layout for these emerging
Experience with SAND-Tcl: A Scripting Tool for Spatial Databases
"... The use of scripting makes it possible to overcome many important difficulties in the development of database applications. By extending a general-purpose scripting language with constructs derived both from the database kernel and from the intended application domain, issues such as query processin ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
The use of scripting makes it possible to overcome many important difficulties in the development of database applications. By extending a general-purpose scripting language with constructs derived both from the database kernel and from the intended application domain, issues such as query processing and user interfacing can be approached in an economical and flexible way. This is illustrated by describing our experience with SAND-Tcl, a scripting tool developed by us for building spatial database applications. SAND-Tcl is an extension of the Tcl embedded scripting language with the constructs of the SAND environment for developing applications involving both spatial and non-spatial data. SANDTcl acts as a "glue" to hold together all the subsystems of SAND. In fact, query evaluation plans are SAND-Tcl programs (or scripts) which are written on-the-fly by SAND in response to a query defined by the user. This permits the rapid prototyping of algorithms and makes SAND a useful tool both f...

