Results 1 - 10
of
23
Curated databases
- PODS'08
, 2008
"... Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries – dictionaries, encyclopedias, gazetteers etc. – are now curated databases. Since it is now easy to publish databa ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries – dictionaries, encyclopedias, gazetteers etc. – are now curated databases. Since it is now easy to publish databases on the web, there has been an explosion in the number of new curated databases used in scientific research. The value of curated databases lies in the organization and the quality of the data they contain. Like the paper reference works they have replaced, they usually represent the efforts of a dedicated group of people to produce a definitive description of some subject area. Curated databases present a number of challenges for database research. The topics of annotation, provenance, and citation are central, because curated databases are heavily cross-referenced with, and include data from, other databases, and much of the work of a curator is annotating existing data. Evolution of structure is important because these databases often evolve from semistructured representations, and because they have to accommodate new scientific discoveries. Much of the work in these areas is in its infancy, but it is beginning to provide suggest new research for both theory and practice. We discuss some of this research and emphasize the need to find appropriate models of the processes associated with curated databases.
Tupni: Automatic Reverse Engineering of Input Formats
- In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS
, 2008
"... Recent work has established the importance of automatic reverse engineering of protocol or file format specifications. However, the formats reverse engineered by previous tools have missed important information that is critical for security applications. In this paper, we present Tupni, a tool that ..."
Abstract
-
Cited by 28 (2 self)
- Add to MetaCart
Recent work has established the importance of automatic reverse engineering of protocol or file format specifications. However, the formats reverse engineered by previous tools have missed important information that is critical for security applications. In this paper, we present Tupni, a tool that can reverse engineer an input format with a rich set of information, including record sequences, record types, and input constraints. Tupni can generalize the format specification over multiple inputs. We have implemented a prototype of Tupni and evaluated it on 10 different formats: five file formats (WMF, BMP, JPG, PNG and TIF) and five network protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all record sequences in the test inputs. We also show that, by aggregating over multiple WMF files, Tupni can derive a more complete format specification for WMF. Furthermore, we demonstrate the utility of Tupni by using the rich information it provides for zeroday vulnerability signature generation, which was not possible with previous reverse engineering tools.
Detecting Large-Scale System Problems by Mining Console Logs
"... Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of informat ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software’s internals. 1
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
"... Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users ’ inputs and file content d ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users ’ inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors. Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called Sher-Log, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log’s semantics. It infers both control and data value information regarding to the failed execution. We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.
Automating String Processing in Spreadsheets Using Input-Output Examples
"... We describe the design of a string programming/expression language that supports restricted forms of regular expressions, conditionals and loops. The language is expressive enough to represent a wide variety of string manipulation tasks that end-users struggle with. We describe an algorithm based on ..."
Abstract
-
Cited by 17 (9 self)
- Add to MetaCart
We describe the design of a string programming/expression language that supports restricted forms of regular expressions, conditionals and loops. The language is expressive enough to represent a wide variety of string manipulation tasks that end-users struggle with. We describe an algorithm based on several novel concepts for synthesizing a desired program in this language from input-output examples. The synthesis algorithm is very efficient taking a fraction of a second for various benchmark examples. The synthesis algorithm is interactive and has several desirable features: it can rank multiple solutions and has fast convergence, it can detect noise in the user input, and it supports an active interaction model wherein the user is prompted to provide outputs on inputs that may have multiple computational interpretations. The algorithm has been implemented as an interactive add-in for Microsoft Excel spreadsheet system. The prototype tool has met the golden test- it has synthesized part of itself, and has been used to solve problems beyond author’s imagination.
Bidirectional Transformations: A Cross-Discipline Perspective GRACE meeting notes, state of the art, and outlook
"... was held in December 2008 near Tokyo, Japan. The meeting brought together researchers and practitioners from a variety of subdisciplines of computer science to share research efforts and help create a new community. In this report, we survey the state of the art and summarize the technical presentat ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
was held in December 2008 near Tokyo, Japan. The meeting brought together researchers and practitioners from a variety of subdisciplines of computer science to share research efforts and help create a new community. In this report, we survey the state of the art and summarize the technical presentations delivered at the meeting. We also describe some insights gathered from our discussions and introduce a new effort to establish a benchmark for bidirectional transformations. 1
Table Transformations from Examples
, 2011
"... Every day, millions of computer end-users need to perform tasks over large, tabular data, yet lack the programming knowledge to do such tasks automatically. In this work, we present an automatic technique that takes from a user an example of how the user needs to transform a table of data, and provi ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Every day, millions of computer end-users need to perform tasks over large, tabular data, yet lack the programming knowledge to do such tasks automatically. In this work, we present an automatic technique that takes from a user an example of how the user needs to transform a table of data, and provides to the user a program that implements the transformation described by the example. In particular, we present a language of programs TableProg that can describe transformations that real users require. We then present an algorithm ProgFromEx that takes an example input and output table, and infers a program in TableProg that implements the transformation described by the example. When the program is applied to the example input, it reproduces the example output. When the program is applied to another, potentially larger, table with a “similar” layout as the example input table, then the program produces
Incremental Learning of System Log Formats
"... System logs come in a large and evolving variety of formats, many of which are semi-structured and/or non-standard. As a consequence, off-the-shelf tools for processing such logs often do not exist, forcing analysts to develop their own tools, which is costly and time-consuming. In this paper, we pr ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
System logs come in a large and evolving variety of formats, many of which are semi-structured and/or non-standard. As a consequence, off-the-shelf tools for processing such logs often do not exist, forcing analysts to develop their own tools, which is costly and time-consuming. In this paper, we present an incremental algorithm that automatically infers the format of system log files. From the resulting format descriptions, we can generate a suite of data processing tools automatically. The system can handle large-scale data sources whose formats evolve over time. Furthermore, it allows analysts to modify inferred descriptions as desired and incorporates those changes in future revisions. 1
LearnPADS: Automatic tool generation from ad hoc data
- In SIGMOD
, 2008
"... In this demonstration, we will present LEARNPADS, a fully automatic system for generating ad hoc data processing tools. When presented with a collection of ad hoc data, the system (1) analyzes the data, (2) infers a PADS [4, 5] description, (3) generates parser, printer, validation and traversal lib ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In this demonstration, we will present LEARNPADS, a fully automatic system for generating ad hoc data processing tools. When presented with a collection of ad hoc data, the system (1) analyzes the data, (2) infers a PADS [4, 5] description, (3) generates parser, printer, validation and traversal libraries and (4) links these libraries with format-independent tool suites to form stand-alone applications. These applications provide statistical analysis, XML conversion, CSV conversion, the ability to query with the Galax XQuery engine [3], and the ability to graph selected data elements, all directly from ASCII ad hoc data without human intervention. SIGMOD attendees will see both the user experience with LEARN-PADS and the internals of the multi-phase inference algorithm which lies at the heart of the system. 1.
Semantics and Algorithms for Data-dependent Grammars
"... Traditional parser generation technologies are incapable of handling the demands of modern programmers. In this paper, we present the design and theory of a new parsing engine, YAKKER, capable of handling the requirements of modern applications including (1) full scannerless context-free grammars wi ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Traditional parser generation technologies are incapable of handling the demands of modern programmers. In this paper, we present the design and theory of a new parsing engine, YAKKER, capable of handling the requirements of modern applications including (1) full scannerless context-free grammars with (2) regular expressions as right-hand sides for defining nonterminals. YAKKER also includes (3) facilities for binding variables to intermediate parse results and (4) using such bindings within arbitrary constraints to control parsing. These facilities allow the kind of data-dependent parsing commonly needed in systems applications, particularly those that operate over binary data. In addition, (5) nonterminals may be parameterized by arbitrary values, which gives the system good modularity and abstraction properties in the presence of data-dependent parsing. Finally, (6) legacy parsing libraries, such as sophisticated libraries for dates and times, may be directly incorporated into parser specifications. We illustrate the importance and utility of this rich format specification language by presenting its use on examples ranging from difficult programming language grammars to web server logs to binary data specification. We also show that our grammars have important compositionality properties and explain why such properties are important in modern applications such as automatic grammar induction. In terms of technical contributions, we provide a traditional high-level semantics for our new grammar formalization and show how to compile these grammars into nondeterministic automata. These automata are stack-based, somewhat like conventional pushdown automata, but are also equipped with environments to track data-dependent parsing state. We prove the correctness of our translation of data-dependent grammars into these new automata and then show how to implement the automata efficiently using a variation of Earley’s parsing algorithm. 1.

