Efficient Submatch Addressing for Regular Expressions
user correction - Legacy Corrections
Helsinki University of Technology; Departm
String pattern matching in its different forms is an important topic in theoretical computer science. This thesis concentrates on the problem of regular expression matching with submatch addressing, where the position and extent of the substrings matched by given subexpressions must be provided. The algorithms in widespread use at the time either take exponential worst-case time to find a match, can handle only a subset of all regular expressions, or use space proportional to the length of the input string where constant space would suffice. This thesis proposes a new method for solving the submatch addressing problem using nondeterministic finite automata with transitions augmented by copy-on-write update operations. The resulting algorithm makes a single pass over the input string, always using time linearly proportional to the input. Space consumption depends only on the used regular expression, and not on the input string. To the author's knowledge, this is a new result. A prototype of a POSIX.2 compatible regular expression matcher using the algorithm was done. Benchmarking results indicate that the prototype compares favorably against some popular implementations. Furthermore, absence of exponential or polynomial time worst cases makes it possible to use any regular expression without performance problems, which is not the case with previous implementations or algorithms.