FiniteState Transducers in Language and Speech Processing
 Computational Linguistics
, 1997
"... Finitestate machines have been used in various domains of natural language processing. We consider here the use of a type of transducers that supports very efficient programs: sequential transducers. We recall classical theorems and give new ones characterizing sequential stringtostring transducer ..."
Finitestate machines have been used in various domains of natural language processing. We consider here the use of a type of transducers that supports very efficient programs: sequential transducers. We recall classical theorems and give new ones characterizing sequential stringtostring transducers. Transducers that output weights also play an important role in language and speech processing. We give a specific study of stringtoweight transducers, including algorithms for determinizing and minimizing these transducers very efficiently, and characterizations of the transducers admitting determinization and the corresponding algorithms. Some applications of these algorithms in speech recognition are described and illustrated. 1.
Visibly pushdown languages
, 2004
"... Abstract. We study congruences on words in order to characterize the class of visibly pushdown languages (Vpl), a subclass of contextfree languages. For any language L, we define a natural congruence on words that resembles the syntactic congruence for regular languages, such that this congruence i ..."
Abstract. We study congruences on words in order to characterize the class of visibly pushdown languages (Vpl), a subclass of contextfree languages. For any language L, we define a natural congruence on words that resembles the syntactic congruence for regular languages, such that this congruence is of finite index if, and only if, L is a Vpl. We then study the problem of finding canonical minimal deterministic automata for Vpls. Though Vpls in general do not have unique minimal automata, we consider a subclass of VPAs called kmodule singleentry VPAs that correspond to programs with recursive procedures without input parameters, and show that the class of wellmatched Vpls do indeed have unique minimal kmodule singleentry automata. We also give a polynomial time algorithm that minimizes such kmodule singleentry VPAs. 1 Introduction The class of visibly pushdown languages (Vpl), introduced in [1], is a subclassof contextfree languages accepted by pushdown automata in which the input letter determines the type of operation permitted on the stack. Visibly pushdown languages are closed under all boolean operations, and problems such as inclusion, that are undecidable for contextfree languages, are decidable for Vpl. Vpls are relevant to several applications that use contextfree languages suchas the modelchecking of software programs using their pushdown models [13]. Recent work has shown applications in other contexts: in modeling semanticsof effects in processing XML streams [4], in game semantics for programming languages [5], and in identifying larger classes of pushdown specifications thatadmit decidable problems for infinite games on pushdown graphs [6].
Automata and coinduction (an exercise in coalgebra
 LNCS
, 1998
"... The classical theory of deterministic automata is presented in terms of the notions of homomorphism and bisimulation, which are the cornerstones of the theory of (universal) coalgebra. This leads to a transparent and uniform presentation of automata theory and yields some new insights, amongst which ..."
The classical theory of deterministic automata is presented in terms of the notions of homomorphism and bisimulation, which are the cornerstones of the theory of (universal) coalgebra. This leads to a transparent and uniform presentation of automata theory and yields some new insights, amongst which coinduction proof methods for language equality and language inclusion. At the same time, the present treatment of automata theory may serve as an introduction to coalgebra.
Complete inverted files for efficient text retrieval and analysis
 Journal of the ACM
, 1987
"... Abstract. Given a finite set of texts S = (wi, *.., wk) over some fixed finite alphabet 2, a complete inverted tile for S is an abstract data type that provides the functionsfind ( which returns the longest prefix of w that occurs (as a subword of a word) in S, freq(w), which returns the number of t ..."
Abstract. Given a finite set of texts S = (wi, *.., wk) over some fixed finite alphabet 2, a complete inverted tile for S is an abstract data type that provides the functionsfind ( which returns the longest prefix of w that occurs (as a subword of a word) in S, freq(w), which returns the number of times w occurs in S, and locations(w), which returns the set of positions where w occurs in S. A data structure. that implements a complete inverted file for S that occupies linear space and can be built in linear time, using the uniformcost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the s&ix tree.
Characterizing the Behavior of a Program Using MultipleLength Ngrams
, 2000
"... Some recent advances in intrusion detection are based on detecting anomalies in program behavior, as characterized by the sequence of kernel calls the program makes. Specifically, traces of kernel calls are collected during a training period. The substrings of fixed length N (for some N) of those tr ..."
Some recent advances in intrusion detection are based on detecting anomalies in program behavior, as characterized by the sequence of kernel calls the program makes. Specifically, traces of kernel calls are collected during a training period. The substrings of fixed length N (for some N) of those traces are called Ngrams. The set of Ngrams occurring during normal execution has been found to discriminate effectively between normal behavior of a program and the behavior of the program under attack. The Ngram characterization, while effective, requires the user to choose a suitable value for N. This paper presents an alternative characterization, as a finite state machine whose states represent predictive sequences of different lengths. An algorithm is presented to construct the finite state machine from training data, based on traditional stringprocessing data structures but employing some novel techniques.
From Regular Expressions to DFA's Using Compressed NFA's
 Theoretical Computer Science
, 1992
The skstrings method for inferring PFSA
 In Proceedings of the
, 1997
"... We describe a simple, fast and easy to implement recursive algorithm with four alternate intuitive heuristics for inferring Probabilistic Finite State Automata. The algorithm is an extension for stochastic machines of the ktails method introduced in 1972 by Biermann and Feldman for nonstochastic m ..."
We describe a simple, fast and easy to implement recursive algorithm with four alternate intuitive heuristics for inferring Probabilistic Finite State Automata. The algorithm is an extension for stochastic machines of the ktails method introduced in 1972 by Biermann and Feldman for nonstochastic machines. Experiments comparing the two are done and benchmark results are also presented. It is also shown that skstrings performs better than ktails at least in inferring small automata. Introduction When given a finite number of examples of the behaviour of a probabilistic state determined machine, it is possible to imagine methods by which we can infer its structure. Ideally, we would like to identify the exact automaton which generated the strings. But it is impossible to do this from the behaviour of the machine because more than one nonminimal machine may generate the same language. This paper is concerned not with identifing the generating machine, which is demonstratably impossib...
A Taxonomy of Finite Automata Minimization Algorithms
, 1993
"... This paper presents a taxonomy of finite automata minimization algorithms. Brzozowski's elegant minimization algorithm differs from all other known minimization algorithms, and is derived separately. All of the remaining algorithms depend upon computing an equivalence relation on states. We define t ..."
This paper presents a taxonomy of finite automata minimization algorithms. Brzozowski's elegant minimization algorithm differs from all other known minimization algorithms, and is derived separately. All of the remaining algorithms depend upon computing an equivalence relation on states. We define the equivalence relation, the partition that it induces, and its complement. Additionally, some useful properties are derived. It is shown that the equivalence relation is the greatest fixed point of an equation, providing a useful characterization of the required computation. We derive an upperbound on the number of approximation steps required to compute the fixed point. Algorithms computing the equivalence relation (or the partition, or its complement) are derived systematically in the same framework. The algorithms include Hopcroft's, several algorithms from textbooks (including Hopcroft and Ullman's [HU79], Wood's [Wood87], and Aho, Sethi, and Ullman's [ASU86]), and several new algorith...