NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions
 In Proceedings of the 7th International Symposium on String Processing and Information Retrieval
, 2000
Abstract

A conservative extension to traditional nondeterministic finite automata is proposed to keep track of the positions in the input string for the last uses of selected transitions, by adding "tags" to transitions. The resulting automata are reminiscent of nondeterministic Mealy machines. Formal semantics of automata with tagged transitions is given. An algorithm is given to convert these augmented automata to corresponding deterministic automata, which can be used to process strings efficiently. Application to regular expressions is discussed, explaining how the algorithms can be used to implement for example substring addressing and a lookahead operator, and some informal comparison to other widely used algorithms is done.
Efficient Submatch Addressing for Regular Expressions
, 2001
"... String pattern matching in its different forms is an important topic in theoretical computer science. This thesis concentrates on the problem of regular expression matching with submatch addressing, where the position and extent of the substrings matched by given subexpressions must be provided. The ..."
Abstract

String pattern matching in its different forms is an important topic in theoretical computer science. This thesis concentrates on the problem of regular expression matching with submatch addressing, where the position and extent of the substrings matched by given subexpressions must be provided. The algorithms in widespread use at the time either take exponential worstcase time to find a match, can handle only a subset of all regular expressions, or use space proportional to the length of the input string where constant space would suffice. This thesis proposes a new method for solving the submatch addressing problem using nondeterministic finite automata with transitions augmented by copyonwrite update operations. The resulting algorithm makes a single pass over the input string, always using time linearly proportional to the input. Space consumption depends only on the used regular expression, and not on the input string. To the author's knowledge, this is a new result. A prototype of a POSIX.2 compatible regular expression matcher using the algorithm was done. Benchmarking results indicate that the prototype compares favorably against some popular implementations. Furthermore, absence of exponential or polynomial time worst cases makes it possible to use any regular expression without performance problems, which is not the case with previous implementations or algorithms.