Pig Latin: A NotSoForeign Language for Data Processing
"... There is a growing need for adhoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively e ..."
Cited by 348 (11 self)
There is a growing need for adhoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural mapreduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the mapreduce paradigm is too lowlevel and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the lowlevel, procedural style of mapreduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an opensource, mapreduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an opensource, Apacheincubator project, and available for general use. 1.
Semantic database modeling: Survey, applications, and research issues
 ACM Computing Surveys
, 1987
"... Most common database management systems represent information in a simple recordbased format. Semantic modeling provides richer data structuring capabilities for database applications. In particular, research in this area has articulated a number of constructs that provide mechanisms for representi ..."
Cited by 225 (3 self)
Most common database management systems represent information in a simple recordbased format. Semantic modeling provides richer data structuring capabilities for database applications. In particular, research in this area has articulated a number of constructs that provide mechanisms for representing structurally complex interrelations among data typically arising in commercial applications. In general terms, semantic modeling complements work on knowledge representation (in artificial intelligence) and on the new generation of database models based on the objectoriented paradigm of programming languages. This paper presents an indepth discussion of semantic data modeling. It reviews the philosophical motivations of semantic models, including the need for highlevel modeling abstractions and the reduction of semantic overloading of data type constructors. It then provides a tutorial introduction to the primary components of semantic models, which are the explicit representation of objects, attributes of and relationships among objects, type constructors for building complex types, ISA relationships, and derived schema components. Next, a survey of the prominent semantic models in the literature is presented. Further, since a broad area of research has developed around semantic modeling, a number of related topics based on these models are discussed, including data languages, graphical interfaces, theoretical investigations, and physical implementation strategies.
Linking data to ontologies
 J. on Data Semantics
, 2008
"... Abstract. Many organizations nowadays face the problem of accessing existing data sources by means of flexible mechanisms that are both powerful and efficient. Ontologies are widely considered as a suitable formal tool for sophisticated data access. The ontology expresses the domain of interest of t ..."
Cited by 142 (57 self)
Abstract. Many organizations nowadays face the problem of accessing existing data sources by means of flexible mechanisms that are both powerful and efficient. Ontologies are widely considered as a suitable formal tool for sophisticated data access. The ontology expresses the domain of interest of the information system at a high level of abstraction, and the relationship between data at the sources and instances of concepts and roles in the ontology is expressed by means of mappings. In this paper we present a solution to the problem of designing effective systems for ontologybased data access. Our solution is based on three main ingredients. First, we present a new ontology language, based on Description Logics, that is particularly suited to reason with large amounts of instances. The second ingredient is a novel mapping language that is able to deal with the socalled impedance mismatch problem, i.e., the problem arising from the difference between the basic elements managed by the sources, namely data, and the elements managed by the ontology, namely objects. The third ingredient is the query answering method, that combines reasoning at the level of the ontology with specific mechanisms for both taking into account the mappings and efficiently accessing the data at the sources.
On The Power Of Languages For The Manipulation Of Complex Objects
 In Proceedings of International Workshop on Theory and Applications of Nested Relations and Complex Objects
, 1993
"... Various models and languages for describing and manipulating hierarchically structured data have been proposed. Algebraic, calculusbased and logicprogramming oriented languages have all been considered. This paper presents a general model for complex objects, and languages for it based on the thre ..."
Cited by 121 (6 self)
Various models and languages for describing and manipulating hierarchically structured data have been proposed. Algebraic, calculusbased and logicprogramming oriented languages have all been considered. This paper presents a general model for complex objects, and languages for it based on the three paradigms. The algebraic language generalizes those presented in the literature; it is shown to be related to the functional style of programming advocated by Backus. The notion of domain independence familiar from relational databases is defined, and syntactic restrictions (referred to as safety conditions) on calculus queries are formulated, that guarantee domain independence. The main results are: The domainindependent calculus, the safe calculus, the algebra, and the logicprogramming oriented language have equivalent expressive power. In particular, recursive queries, such as the transitive closure, can be expressed in each of the languages. For this result, the algebra needs the pow...
Schema Equivalence in Heterogeneous Systems: Bridging Theory and Practice
, 1993
"... Current theoretical work offers measures of schema equivalence based on the information capacity of schemas. This work is based on the existence of abstract functions satisfying various restrictions between the sets of all instances of two schemas. In considering schemas that arise in practice, howe ..."
Cited by 62 (2 self)
Current theoretical work offers measures of schema equivalence based on the information capacity of schemas. This work is based on the existence of abstract functions satisfying various restrictions between the sets of all instances of two schemas. In considering schemas that arise in practice, however, it is not clear how to reason about the existence of such abstract functions. Further, these notions of equivalence tend to be too liberal in that schemas are often considered equivalent when a practitioner would consider them to be different. As a result, practical integration methodologies have not utilized this theoretical foundation and most of them have relied on adhoc approaches. We present results that seek to bridge this gap. First, we consider the problem of deciding information capacity equivalence and dominance of schemas that occur in practice, i.e., those that can express inheritance and simple integrity constraints. We show that this problem is undecidable. This undecidab...
Thematic Map Modeling
, 1989
"... We study here how to provide the designer of geographic databases with a database query language extensible and customizable. The model presented here is a first step toward a high level spatial query language adapted to the manipulation of thematic maps. For this, we take as an example a toy applic ..."
Cited by 57 (6 self)
We study here how to provide the designer of geographic databases with a database query language extensible and customizable. The model presented here is a first step toward a high level spatial query language adapted to the manipulation of thematic maps. For this, we take as an example a toy application on thematic maps, and show by using a complex objects algebra that application dependent geometric operations can be expressed through an extension of the replace operator of [AB88].
Structured objects: Modeling and reasoning
 Proc. of DOOD95
, 1995
"... Abstract. One distinctive characteristic of objectoriented data models over traditional database systems is that they provide more expressive power in schema de nition. Nevertheless, the de ning power of objectoriented models is still somewhat limited, mainly because it is commonly accepted that pa ..."
Cited by 51 (34 self)
Abstract. One distinctive characteristic of objectoriented data models over traditional database systems is that they provide more expressive power in schema de nition. Nevertheless, the de ning power of objectoriented models is still somewhat limited, mainly because it is commonly accepted that part of the semantics of the application can be represented within methods. The research work reported in this paper explores the possibility of enhancing the power of objectoriented data models in schema de nition, thus o ering more possibilities to reason about the intension of the database and better supporting data management. We demonstrate our approach by presenting a new data model, called CVL, that extends the usual objectoriented data models with several aspects, including view de nition, recursive structure modeling, navigation of the schema through forward and backward traversal of links (attributes and relations), subsetting of attributes, and cardinality ratio constraints on links. CVL is equipped with sound, complete, and terminating inference procedures, that allow various forms of reasoning to be carried out on the intensional level of the database. 1
The Power of Languages for the Manipulation of Complex Values
 VLDB Journal
, 1995
"... Abstract. Various models and languages for describing and manipulating hierarchically structured data have been proposed. Algebraic, calculusbased, and logicprogramming oriented languages have all been considered. This article presents a general model for complex values (i.e., values with hierarc ..."
Cited by 48 (0 self)
Abstract. Various models and languages for describing and manipulating hierarchically structured data have been proposed. Algebraic, calculusbased, and logicprogramming oriented languages have all been considered. This article presents a general model for complex values (i.e., values with hierarchical structures), and languages for it based on the three paradigms. The algebraic language generalizes those presented in the literature; it is shown to be related to the functional style of programming advocated by Backus (1978). The notion of domain independence (from relational databases) is defined, and syntactic restrictions (referred to as safety conditions) on calculus queries are formulated to guarantee domain independence. The main results are: The domainindependent calculus, the safe calculus, the algebra, and the logicprogramming oriented language have equivalent expressive power. In particular, recursive queries, such as the transitive closure, can be expressed in each of the languages. For this result, the algebra needs the powerset operation. A more restricted version of safety is presented, such that the restricted safe calculus is equivalent to the algebra without the powerset. The results are extended to the case where arbitrary functions and predicates are used in the languages. Key Words. Database, query language, complex value, complex object, database model.
On the expressive power of database queries with intermediate types
 Journal of Computer and System Sciences
, 1991
"... The setheight of a complex object type is defined to be its level of nesting of the set construct. In a query of the complex object calculus which maps a database D to an output type T,anintermediate type is a type which is used by some variable of the query, but which is not present in D or T.Fore ..."
Cited by 44 (2 self)
The setheight of a complex object type is defined to be its level of nesting of the set construct. In a query of the complex object calculus which maps a database D to an output type T,anintermediate type is a type which is used by some variable of the query, but which is not present in D or T.Foreachk, i ≥ 0 we define CALCk,i to be the family of calculus queries mapping from and to types with setheight ≤ k and using intermediate types with setheight ≤ i. In particular, CALC0,0 is the classical relational calculus, and CALC0,1 is equivalent to the family of secondorder (relational) queries. Several results concerning these families of languages are obtained. A primary focus is on the families CALC0,i, which map relations to relations. Upper and lower bounds in terms of hyperexponential time and space on the complexity of these families are provided. The CALC0,i hierarchy does not collapse with respect to expressive power. The union ∪0≤iCALC0,i is exactly the family of elementary queries, i.e., queries with hyperexponential complexity. The expressive power of queries from the complex object calculus interpreted using semantics based on the use of arbitrarily large finite or infinite set of invented values is studied. Under these semantics, the expressive power of the relational calculus is not increased, and the CALC0,i hierarchy collapses at CALC0,1. In general, queries with these semantics may not be computable. We also consider an alternative semantics which yields a family of queries equivalent to the computable queries. 1