Results 1 - 10
of
44
Automating String Processing in Spreadsheets Using Input-Output Examples
"... We describe the design of a string programming/expression language that supports restricted forms of regular expressions, conditionals and loops. The language is expressive enough to represent a wide variety of string manipulation tasks that end-users struggle with. We describe an algorithm based on ..."
Abstract
-
Cited by 75 (24 self)
- Add to MetaCart
(Show Context)
We describe the design of a string programming/expression language that supports restricted forms of regular expressions, conditionals and loops. The language is expressive enough to represent a wide variety of string manipulation tasks that end-users struggle with. We describe an algorithm based on several novel concepts for synthesizing a desired program in this language from input-output examples. The synthesis algorithm is very efficient taking a fraction of a second for various benchmark examples. The synthesis algorithm is interactive and has several desirable features: it can rank multiple solutions and has fast convergence, it can detect noise in the user input, and it supports an active interaction model wherein the user is prompted to provide outputs on inputs that may have multiple computational interpretations. The algorithm has been implemented as an interactive add-in for Microsoft Excel spreadsheet system. The prototype tool has met the golden test- it has synthesized part of itself, and has been used to solve problems beyond author’s imagination.
binpac: A yacc for Writing Application Protocol Parsers
- In submission
, 2006
"... A key step in the semantic analysis of network traffic is to parse the traffic stream according to the high-level protocols it contains. This process transforms raw bytes into structured, typed, and semantically meaningful data fields that provide a high-level representation of the traffic. However, ..."
Abstract
-
Cited by 64 (4 self)
- Add to MetaCart
(Show Context)
A key step in the semantic analysis of network traffic is to parse the traffic stream according to the high-level protocols it contains. This process transforms raw bytes into structured, typed, and semantically meaningful data fields that provide a high-level representation of the traffic. However, constructing protocol parsers by hand is a tedious and error-prone affair due to the complexity and sheer number of application protocols. This paper presents binpac, a declarative language and compiler designed to simplify the task of constructing robust and efficient semantic analyzers for complex network protocols. We discuss the design of the binpac language and a range of issues in generating efficient parsers from high-level specifications. We have used binpac to build several protocol parsers for the “Bro” network intrusion detection system, replacing some of its existing analyzers (handcrafted in C++), and supplementing its operation with analyzers for new protocols. We can then use Bro’s powerful scripting language to express application-level analysis of network traffic in high-level terms that are both concise and expressive. binpac is now part of the open-source Bro distribution.
From dirt to shovels: Fully automatic tool generation from ad hoc data
- In POPL
, 2008
"... An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular bas ..."
Abstract
-
Cited by 45 (11 self)
- Add to MetaCart
(Show Context)
An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler and automatically generate fully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearly in the size of the training data — completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5 % of the available data. 1.
The Power of Pi
, 2008
"... This paper exhibits the power of programming with dependent types by dint of embedding three domain-specific languages: Cryptol, a language for cryptographic protocols; a small data description language; and relational algebra. Each example demonstrates particular design patterns inherent to depende ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
This paper exhibits the power of programming with dependent types by dint of embedding three domain-specific languages: Cryptol, a language for cryptographic protocols; a small data description language; and relational algebra. Each example demonstrates particular design patterns inherent to dependently-typed programming. Documenting these techniques paves the way for further research in domain-specific embedded type systems.
Semantic subtyping with an SMT solver
, 2010
"... We study a first-order functional language with the novel combination of the ideas of refinement type (the subset of a type to satisfy a Boolean expression) and type-test (a Boolean expression testing whether a value belongs to a type). Our core calculus can express a rich variety of typing idioms; ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
(Show Context)
We study a first-order functional language with the novel combination of the ideas of refinement type (the subset of a type to satisfy a Boolean expression) and type-test (a Boolean expression testing whether a value belongs to a type). Our core calculus can express a rich variety of typing idioms; for example, intersection, union, negation, singleton, nullable, variant, and algebraic types are all derivable. We formulate a semantics in which expressions denote terms, and types are interpreted as first-order logic formulas. Subtyping is defined as valid implication between the semantics of types. The formulas are interpreted in a specific model that we axiomatize using standard first-order theories. On this basis, we present a novel type-checking algorithm able to eliminate many dynamic tests and to detect many errors statically. The key idea is to rely on an SMT solver to compute subtyping efficiently. Moreover, interpreting types as formulas allows us to call the SMT solver at run-time to compute instances of types.
PADS/ML: A Functional Data Description Language
, 2007
"... Massive amounts of useful data are stored and processed in ad hoc formats for which common tools like parsers, printers, query engines and format converters are not readily available. In this paper, we explain the design and implementation of PADS/ML, a new language and system that facilitates the g ..."
Abstract
-
Cited by 25 (12 self)
- Add to MetaCart
Massive amounts of useful data are stored and processed in ad hoc formats for which common tools like parsers, printers, query engines and format converters are not readily available. In this paper, we explain the design and implementation of PADS/ML, a new language and system that facilitates the generation of data processing tools for ad hoc formats. The PADS/ML design includes features such as dependent, polymorphic and recursive datatypes, which allow programmers to describe the syntax and semantics of ad hoc data in a concise, easy-to-read notation. The PADS/ML implementation compiles these descriptions into ML structures and functors that include types for parsed data, functions for parsing and printing, and auxiliary support for user-specified, format-dependent and format-independent tool generation.
The PADS Project: An Overview
"... The goal of the PADS project, which started in 2001, is to make it easier for data analysts to extract useful information from ad hoc data files. This paper does not report new results, but rather gives an overview of the project and how it helps bridge the gap between the unmanaged world of ad hoc ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The goal of the PADS project, which started in 2001, is to make it easier for data analysts to extract useful information from ad hoc data files. This paper does not report new results, but rather gives an overview of the project and how it helps bridge the gap between the unmanaged world of ad hoc data and the managed world of typed programming languages and databases. In particular, the paper reviews the design of PADS data description languages, describes the generated parsing tools and discusses the importance of meta-data. It also sketches the formal semantics, discusses useful tools and how can they can be generated automatically from PADS descriptions, and describes an inferencing system that can learn useful PADS descriptions from positive examples of the data format.
Systematic Characterisation of Objects in Digital Preservation: The eXtensible Characterisation Languages
"... Abstract: During the last decades, digital objects have become the primary medium to create, shape, and exchange information. However, in contrast to analog objects such as books that directly represent their content, digital objects are not usable without a corresponding technical environment. The ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
Abstract: During the last decades, digital objects have become the primary medium to create, shape, and exchange information. However, in contrast to analog objects such as books that directly represent their content, digital objects are not usable without a corresponding technical environment. The fast changes in these environments and in formats and technologies mean that digital documents have a short lifespan before they become obsolete. Digital preservation, i.e. actions to ensure longevity of digital information, thus has become a pressing challenge. The dominant strategies prevailing today are migration and emulation; for each strategy, different tools are available. When converting an object to a different representation, a validation of the content is needed to verify that the transformed objects are still authentically representing the same intellectual content. This validation so far is largely done manually, which is infeasible for large collections. Preservation planning supports decision makers in reaching accountable decisions by evaluating potential strategies against well-defined requirements. Especially the evaluation of different migration tools for digital preservation has to rely on validating the
Domain-Specific Languages for Digital Forensics
"... Due to strict deadlines, custom requirements for nearly every case and the scale of digital forensic investigations, forensic software needs to be extremely flexible. There is a clear separation between different types of knowledge in the domain, making domain-specific languages (DSLs) a possible so ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
Due to strict deadlines, custom requirements for nearly every case and the scale of digital forensic investigations, forensic software needs to be extremely flexible. There is a clear separation between different types of knowledge in the domain, making domain-specific languages (DSLs) a possible solution for these applications. To determine their effectiveness, DSL-based systems must be implemented and compared to the original systems. Furthermore, existing systems must be migrated to these DSL-based systems to preserve the knowledge that has been encoded in them over the years. Finally, a cost analysis must be made to determine whether these DSL-based systems are a good investment.
Semantics and Algorithms for Data-dependent Grammars
"... Traditional parser generation technologies are incapable of handling the demands of modern programmers. In this paper, we present the design and theory of a new parsing engine, YAKKER, capable of handling the requirements of modern applications including (1) full scannerless context-free grammars wi ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
(Show Context)
Traditional parser generation technologies are incapable of handling the demands of modern programmers. In this paper, we present the design and theory of a new parsing engine, YAKKER, capable of handling the requirements of modern applications including (1) full scannerless context-free grammars with (2) regular expressions as right-hand sides for defining nonterminals. YAKKER also includes (3) facilities for binding variables to intermediate parse results and (4) using such bindings within arbitrary constraints to control parsing. These facilities allow the kind of data-dependent parsing commonly needed in systems applications, particularly those that operate over binary data. In addition, (5) nonterminals may be parameterized by arbitrary values, which gives the system good modularity and abstraction properties in the presence of data-dependent parsing. Finally, (6) legacy parsing libraries, such as sophisticated libraries for dates and times, may be directly incorporated into parser specifications. We illustrate the importance and utility of this rich format specification language by presenting its use on examples ranging from difficult programming language grammars to web server logs to binary data specification. We also show that our grammars have important compositionality properties and explain why such properties are important in modern applications such as automatic grammar induction. In terms of technical contributions, we provide a traditional high-level semantics for our new grammar formalization and show how to compile these grammars into nondeterministic automata. These automata are stack-based, somewhat like conventional pushdown automata, but are also equipped with environments to track data-dependent parsing state. We prove the correctness of our translation of data-dependent grammars into these new automata and then show how to implement the automata efficiently using a variation of Earley’s parsing algorithm. 1.