String pattern matching in its different forms is an important topic in theoretical computer science. This thesis concentrates on the problem of regular expression matching with submatch addressing, where the position and extent of the substrings matched by given subexpressions must be provided. The algorithms in widespread use at the time either take exponential worst-case time to find a match, can handle only a subset of all regular expressions, or use space proportional to the length of the input string where constant space would suffice. This thesis proposes a new method for solving the submatch addressing problem using nondeterministic finite automata with transitions augmented by copy-on-write update operations. The resulting algorithm makes a single pass over the input string, always using time linearly proportional to the input. Space consumption depends only on the used regular expression, and not on the input string. To the author's knowledge, this is a new result. A prototype of a POSIX.2 compatible regular expression matcher using the algorithm was done. Benchmarking results indicate that the prototype compares favorably against some popular implementations. Furthermore, absence of exponential or polynomial time worst cases makes it possible to use any regular expression without performance problems, which is not the case with previous implementations or algorithms.
|
2010
|
The Design and Analysis of Computer Algorithms
– Aho, Hopcroft, et al.
- 1974
|
|
1052
|
The C Programming Language
– Kerighan, Ritchie
- 1978
|
|
553
|
Binary codes capable of correcting deletions, insertions and reversals
– Levenshtein
- 1966
|
|
461
|
A Logical Calculus of the Ideas Immanent in Nervous Activity
– McCulloch, Pitts
- 1943
|
|
226
|
Elements of the theory of computation
– Lewis, Papadimitriou
- 1981
|
|
179
|
Storing a sparse table with O(1) worst case access time
– Fredman, Komlós, et al.
- 1984
|
|
156
|
Purely functional data structures
– Okasaki
- 1998
|
|
117
|
LEX – a lexical analyzer generator
– Lesk, Schmidt
- 1975
|
|
113
|
Dynamic Perfect Hashing: Upper and Lower Bounds
– Dietzfelbinger, Karlin, et al.
- 1994
|
|
112
|
Derivatives of Regular Expressions
– Brzozowski
- 1964
|
|
92
|
Regular expression pattern matching for XML
– Hosoya, Pierce
|
|
71
|
Deterministic Part-of-Speech Tagging with Finite-State Transducers
– Roche, Schabes
- 1995
|
|
67
|
From Regular Expressions to Deterministic Automata
– Berry, Sethi
- 1986
|
|
66
|
Codes and Automata
– Berstel, Perrin, et al.
- 2009
|
|
56
|
Regular expressions and state graphs for automata
– McNaughton, Yamada
- 1960
|
|
51
|
Approximate matching of regular expressions
– Myers, Miller
- 1989
|
|
49
|
Storing a sparse table
– Tarjan, Yao
- 1979
|
|
42
|
Economy of description by automata, grammars, and formal systems
– Meyer, Fischer
- 1971
|
|
36
|
On the use of Regular Expressions for Searching Text
– Clarke, Cormack
- 1997
|
|
28
|
A four-russian algorithm for regular expression pattern matching
– Myers
- 1992
|
|
26
|
Functional Programming with Graphs
– Erwig
- 1997
|
|
17
|
Flex—Fast Lexical Analyzer Generator
– Paxson
- 1995
|
|
13
|
Representation of events in nerve nets and nite automata
– Kleene
- 1956
|
|
13
|
Nfas with tagged transitions, their conversion to deterministic automata and application to regular expressions
– Laurikari
- 2000
|
|
12
|
Programming techniques: Regular expression search algorithm
– Thompson
- 1968
|
|
10
|
Reporting exact and approximate regular expression matches
– Guimaraes, Oliva, et al.
- 1998
|
|
8
|
Storing a dynamic sparse table
– Aho, Lee
- 1986
|
|
8
|
Approximate regular expression pattern matching with concave gap penalties
– Knight, Myers
- 1995
|
|
8
|
Real-time Garbage Collection of a Functional Persistent Heap
– Oksanen
- 1999
|
|
7
|
A string manipulation language
– SNOBOL
- 1964
|
|
6
|
Efficiently building a parse tree from a regular expression
– Dubé, Feeley
- 2000
|
|
6
|
A procedure for checking equality of regular expressions
– Ginzburg
- 1967
|
|
5
|
Extending regular expressions with context operators and parse extraction
– Kearns
- 1991
|
|
5
|
Generation of pattern-matching algorithms by extended regular expressions
– Nakata
- 1993
|
|
5
|
Regular expressions with semantic rules and their application to data structure directed programs
– Nakata, Sassa
- 1991
|
|
4
|
Algorithms for nding patterns in strings
– Aho
- 1990
|
|
3
|
Finding patterns common to a set of strings (extended abstract
– Angluin
- 1979
|
|
3
|
TLex v.68 user's manual
– Kearns
- 1990
|
|
3
|
Regular expressions with nested levels of back referencing form a hierarchy
– Larsen
- 1998
|
|
3
|
Design of sequential machines from their regular expressions
– Ott, Feinstein
- 1961
|
|
3
|
Languages and Parsing, volume 1 of Parsing Theory
– Sippu, Soisalon-Soininen
- 1988
|
|
2
|
Generating nite-state transducers for semistructured data extraction from the web
– Hsu, Dung
- 1998
|
|
1
|
Partial derivatives of regular expressions and nite automaton constructions
– Antimirov
- 1996
|
|
1
|
XML: The Annotated Specication
– DuCharme
- 1999
|
|
1
|
On the succinctness of dierent representations of languages
– Hartmanis
- 1980
|
|
1
|
Haruo Hosoya, Jrme Vouillon. Regular expression types for XML
– P
- 2000
|
|
1
|
Approximate regular expression matching
– Mutko
- 1996
|
|
1
|
Parsing with nite state transducers
– Roche
|
|
1
|
Index "-closure, 27
– O'Reilly, Associates
- 2000
|