Results 1 - 10
of
41
Pig Latin: A Not-So-Foreign Language for Data Processing
"... There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively e ..."
Abstract
-
Cited by 196 (10 self)
- Add to MetaCart
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use. 1.
PNUTS: Yahoo!’s hosted data serving platform
- IN PROC. 34TH VLDB
, 2008
"... We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistenc ..."
Abstract
-
Cited by 60 (2 self)
- Add to MetaCart
We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.
Benchmarking cloud serving systems with ycsb
- SoCC
"... While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address “cloud OLTP ” applications, though they typically do not ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address “cloud OLTP ” applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being applied to a diverse range of applications that differ considerably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of applesto-apples performance comparisons, makes it difficult to understand the tradeoffs between systems and the workloads for which they are suited. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!’s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the development of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible—it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.
Building a Database on S3
- SIGMOD
, 2008
"... There has been a great deal of hype about Amazon’s simple storage service (S3). S3 provides infinite scalability and high availability at low cost. Currently, S3 is used mostly to store multi-media documents (videos, photos, audio) which are shared by a community of people and rarely updated. The pu ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
There has been a great deal of hype about Amazon’s simple storage service (S3). S3 provides infinite scalability and high availability at low cost. Currently, S3 is used mostly to store multi-media documents (videos, photos, audio) which are shared by a community of people and rarely updated. The purpose of this paper is to demonstrate the opportunities and limitations of using S3 as a storage system for general-purpose database applications which involve small objects and frequent updates. Read, write, and commit protocols are presented. Furthermore, the cost ($), performance, and consistency properties of such a storage system are studied.
Boom analytics: exploring data-centric, declarative programming for the cloud
- In EuroSys
, 2010
"... Building and debugging distributed software remains extremely difficult. We conjecture that by adopting a datacentric approach to system design and by employing declarative programming languages, a broad range of distributed software can be recast naturally in a data-parallel programming model. Our ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
Building and debugging distributed software remains extremely difficult. We conjecture that by adopting a datacentric approach to system design and by employing declarative programming languages, a broad range of distributed software can be recast naturally in a data-parallel programming model. Our hope is that this model can significantly raise the level of abstraction for programmers, improving code simplicity, speed of development, ease of software evolution, and program correctness. This paper presents our experience with an initial largescale experiment in this direction. First, we used the Overlog language to implement a “Big Data ” analytics stack that is API-compatible with Hadoop and HDFS and provides comparable performance. Second, we extended the system with complex distributed features not yet available in Hadoop, including high availability, scalability, and unique monitoring and debugging facilities. We present both quantitative and anecdotal results from our experience, providing some concrete evidence that both data-centric design and declarative languages can substantially simplify distributed systems programming.
Consistability: Describing usually consistent systems
"... Current weak consistency semantics provide worst-case guarantees to clients. These guarantees fail to adequately describe systems that provide varying levels of consistency in the face of distinct failure modes, or that achieve better than worst-case guarantees during normal execution. The inability ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Current weak consistency semantics provide worst-case guarantees to clients. These guarantees fail to adequately describe systems that provide varying levels of consistency in the face of distinct failure modes, or that achieve better than worst-case guarantees during normal execution. The inability to make precise statements about consistency throughout a system’s execution represents a lost opportunity to clearly understand client application requirements and to optimize systems and services appropriately. In this position paper, we motivate the need for and introduce the concept of consistability—a unified metric of consistency and availability. Consistability offers a means of describing, specifying, and discussing how much consistency a usually consistent system provides, and how often it does so. We describe our initial results of applying consistability reasoning to a keyvalue store we are developing and to other recent distributed systems. We also discuss the limitations of our consistability definition. 1
Fawndamentally power-efficient clusters
- In HotOS
, 2009
"... Power is becoming an increasingly large financial and scaling burden for computing and society. The costs of running large data centers are becoming dominated by power and cooling to the degree that companies such as ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Power is becoming an increasingly large financial and scaling burden for computing and society. The costs of running large data centers are becoming dominated by power and cooling to the degree that companies such as
Securing Structured Overlays Against Identity Attacks
"... Abstract—Structured overlay networks can greatly simplify data storage and management for a variety of distributed applications. Despite their attractive features, these overlays remain vulnerable to the Identity attack, where malicious nodes assume control of application components by intercepting ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract—Structured overlay networks can greatly simplify data storage and management for a variety of distributed applications. Despite their attractive features, these overlays remain vulnerable to the Identity attack, where malicious nodes assume control of application components by intercepting and hijacking key-based routing (KBR) requests. Attackers can assume arbitrary application roles such as storage node for a given file, or return falsified contents of an online shopper’s shopping cart. In this paper, we define a generalized form of the Identity attack, and propose a light-weight detection and tracking system that protects applications by redirecting traffic away from attackers. We describe how this attack can be amplified by a Sybil or Eclipse attack, and analyze the costs of performing such an attack. Finally, we present measurements of a deployed overlay that show our techniques to be significantly more light-weight than prior techniques, and highly effective at detecting and avoiding both single node and colluding attacks under a variety of conditions. I.
Transactional storage for geo-replicated systems
- In SOSP
, 2011
"... We describe the design and implementation of Walter, a key-value store that supports transactions and replicates data across distant sites. A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing stro ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We describe the design and implementation of Walter, a key-value store that supports transactions and replicates data across distant sites. A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing strong guarantees within each site. PSI precludes write-write conflicts, so that developers need not worry about conflict-resolution logic. To prevent write-write conflicts and implement PSI, Walter uses two new and simple techniques: preferred sites and counting sets. We use Walter to build a social networking application and port a Twitter-like application.
Towards low latency state machine replication for uncivil wide-area networks
- In Workshop on Hot Topics in System Dependability
, 2009
"... We consider the problem of building state machines in a multi-site environment in which there is lack of trust between sites, but not within a site. This system model recognizes the fact that if a server is attacked, then there are larger issues at play than simply masking the failure of the server. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We consider the problem of building state machines in a multi-site environment in which there is lack of trust between sites, but not within a site. This system model recognizes the fact that if a server is attacked, then there are larger issues at play than simply masking the failure of the server. We describe the design principles of a low-latency Byzantine state machine protocol, called RAM, for this system model. RAM also makes judicious use of attested append-only memory (A2M) and uses a rotating leader design to further reduce latency. 1.

