Monadic Datalog and the Expressive Power of Languages for Web Information Extraction
 J. ACM
, 2002
Research on information extraction from Web pages (wrapping) has seen much activity in recent times (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping. In this paper, we first study monadic datalog as a wrapping language (over ranked or unranked tree structures). Using previous work by Neven and Schwentick, we show that this simple language is equivalent to full monadic second order logic (MSO) in its ability to specify wrappers. We believe that MSO has the right expressiveness required for Web information extraction and thus propose MSO as a yardstick for evaluating and comparing wrappers. Using the above result, we study the kernel fragment Elog of the Elog wrapping language used in the Lixto system (a visual wrapper generator). The striking fact here is that Elog exactly captures MSO, yet is easier to use. Indeed, programs in this language can be entirely visually specified. We also formally compare Elog to other wrapping languages proposed in the literature.
The Regular Viewpoint on PAProcesses
 Theoretical Computer Science
, 1999
PA is the process algebra allowing nondeterminism, sequential and parallel compositions, and recursion. We suggest viewing PAprocesses as trees, and using treeautomata techniques for verification problems on PA. Our main result is that the set of iterated predecessors of a regular set of PAprocesses is a regular tree language, and similarly for iterated successors. Furthermore, the corresponding treeautomata can be built effectively in polynomialtime. This has many immediate applications to verification problems for PAprocesses, among which a simple and general modelchecking algorithm.
TreeWalking Pebble Automata
 Jewels are forever, contributions to Theoretical Computer Science in honor of Arto Salomaa
, 1999
this paper is to investigate the power of treewalking automata with pebbles. Obviously, the unrestricted use of pebbles leads to a class of tree languages much larger than the regular tree languages, in fact to all tree languages in NSPACE(logn). Thus, we restrict the automaton to the recursive use of pebbles, in the sense that the life times of pebbles, i.e., the times between dropping a pebble and lifting it again, are properly nested. A similar, but stronger, nesting requirement is studied in [13] for 2way automata on strings. We prove in Section 5 that our restriction indeed guarantees that all tree languages recognized by the treewalking pebble automaton are regular, but we conjecture that the automaton is not powerful enough to recognize all regular tree languages. In Section 6 we generalize the notion of pebble to that of a \setpebble", in such a way that the treewalking setpebble automaton recognizes exactly the regular tree languages.
Bottomup and Topdown Tree Series Transformations
 J. Autom. Lang. Combin
, 2000
We generalize bottomup tree transducers and topdown tree transducers to the concept of bottomup tree series transducer and topdown tree series transducer, respectively, by allowing formal tree series as output rather than trees, where a formal tree series is a mapping from output trees to some semiring. We associate two semantics with a tree series transducer: a mapping which transforms trees into tree series (for short: tree to tree series transformation or tts transformation), and a mapping which transforms tree series into tree series (for short: tree series transformation or tsts transformation). We show that the standard case of tree transducers is reobtained by choosing the boolean semiring under the tts semantics. Also, for each of the two types of tree series transducers and for both types of semantics, we prove a characterization which generalizes in a straightforward way the corresponding characterization result for the underlying tree transducer class. Mo...
Caterpillars: A Context Specification Technique
 Markup Languages
, 2000
We present a novel, yet simple, technique for the specification of context in structured documents that we call caterpillar expressions. Although we are primarily applying this technique in the specification of contextdependent style sheets for HTML, SGML and XML documents, it can also be used for query specification for structured documents, as we shall demonstrate, and for the specification of computer program transformations. From a conceptual point of view, structured documents are trees, and one of the oldest and bestestablished techniques to process trees and, hence, structured documents are tree automata. We present a number of theoretical results that allow us to compare the expressive power of caterpillar expressions and caterpillar automata, their companions, to the expressive power of tree automata. In particular, we demonstrate that each caterpillar expression describes a regular tree language that is, hence, recognizable by a tree automaton. Finally, we empl...
Frontiers of tractability for typechecking simple XML transformations
 PODS
, 2004
Typechecking consists of statically verifying whether the output of an XML transformation is always conform to an output type for documents satisfying a given input type. We focus on complete algorithms which always produce the correct answer. We consider topdown XML transformations incorporating XPath expressions and abstract document types by grammars and tree automata. By restricting schema languages and transformations, we identify several practical settings for which typechecking is in polynomial time. Moreover, the resulting framework provides a rather complete picture as we show that most scenarios can not be enlarged without rendering the typechecking problem intractable. So, the present research sheds light on when to use fast complete algorithms and when to reside to sound but incomplete ones.
Extensions of Attribute Grammars for Structured Document Queries
, 1999
Document specification languages like for instance XML, model documents using extended contextfree grammars. These differ from standard contextfree grammars in that they allow arbitrary regular expressions on the righthand side of productions. To query such documents, we introduce a new form of attribute grammars (extended AGs) that work directly over extended contextfree grammars rather than over standard contextfree grammars. Viewed as a query language, extended AGs are particularly relevant as they can take into account the inherent order of the children of a node in a document.
Typechecking TopDown Uniform Unranked Tree Transducers
 in 9th International Conference on Database Theory, ser. LNCS
We investigate the typechecking problem for XML queries: statically verifying that every answer to a query conforms to a given output schema, for inputs satisfying a given input schema. As typechecking quickly turns undecidable for query languages capable of testing equality of data values, we return to the limited framework where we abstract XML documents as labeled ordered trees. We focus on simple topdown recursive transformations motivated by XSLT and structural recursion on trees. We parameterize the problem by several restrictions on the transformations (deleting, nondeleting, bounded width) and consider both tree automata and DTDs as output schemas. The complexity of the typechecking problems in this scenario range from ptime to exptime.
Efficient memory representation of XML documents
 In DBPL
, 2005
Abstract. Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. Here a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high regularity in XML documents by “compressing ” their tree structure; the latter means to detect and remove repetitions of tree patterns. The functionality of basic tree operations, like traversal along edges, is preserved in the compressed representation. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. For certain tasks like validation against an XML type or checking equality of documents, the representation allows for provably more efficient algorithms than those running on conventional representations. 1