## Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions

Citations: | 5 - 0 self |

### BibTeX

@MISC{Becchi_extendingfinite,

author = {Michela Becchi},

title = {Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions},

year = {}

}

### OpenURL

### Abstract

Regular expression matching is a crucial task in several networking applications. Current implementations are based on one of two types of finite state machines. Non-deterministic finite automata (NFAs) have minimal storage demand but have high memory bandwidth requirements. Deterministic finite automata (DFAs) exhibit low and deterministic memory bandwidth requirements at the cost of increased memory space. It has already been shown how the presence of wildcards and repetitions of large character classes can render DFAs and NFAs impractical. Additionally, recent security-oriented rule-sets include patterns with advanced features, namely back-references, which add to the expressive power of traditional regular expressions and cannot therefore be supported through classical finite automata. In this work, we propose and evaluate an extended finite automaton designed to address these shortcomings. First, the automaton provides an alternative approach to handle character repetitions that limits memory space and bandwidth requirements. Second, it supports back-references without the need for backtracking in the input string. In our discussion of this proposal, we address practical implementation issues and evaluate the automaton on real-world rule-sets. To our knowledge, this is the first high-speed automaton that can accommodate all the Perlcompatible regular expressions present in the Snort network intrusion and detection system. 1.

### Citations

3836 |
J.D.: Introduction to automata theory, languages, and computation
- Hopcroft, Motwani, et al.
(Show Context)
Citation Context ...[14]. The basic challenge with high-speed regular expression evaluation is to minimize both memory space and memory bandwidth. Finite automata (FA) are typically used to represent regular expressions =-=[2]-=-. Two classic automata are used for this purpose, and each has its strengths and weaknesses. Non-deterministic finite automata (NFAs) have the benefit of a limited memory space requirement, which is d... |

825 | Snort-lightweight intrusion detection for networks
- Roesch
- 1999
(Show Context)
Citation Context ...line rates (up to several gigabits per second) against large data-sets, sometimes consisting of thousands of patterns. Examples include network intrusion detection and prevention systems (e.g., Snort =-=[6]-=-[7], Bro [8], Cisco Security Appliance [10], Citrix Application Firewall [11]), Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without f... |

532 |
Efficient string matching: An AID to bibliographic search
- Aho, Corasick
- 1975
(Show Context)
Citation Context ...ed an algorithm to compress a DFA through the introduction of default transitions, a generalization of the failure pointer concept presented in the classical AhoCorasick algorithm for string-matching =-=[1]-=-. Their work is based on the idea of trading of memory storage requirement with processing time. A more general and less complex algorithm to achieve the same goal was recently proposed by Becchi et a... |

301 | Efficient filtering of XML documents for selective dissemination of information
- Altinel, Franklin
- 2000
(Show Context)
Citation Context ...k Crowley Washington University Computer Science and Engineering St. Louis, MO 63130-4899 pcrowley@wustl.edu email scanning systems (ClamAV [9]), application-level filtering and content-based routing =-=[12]-=-. While a substantial amount of work has focused on exact-match string search, research interest has recently moved toward designing data structures, algorithms and architectures to support regular ex... |

144 | Fast regular expression matching using FPGAs
- Sidhu, Prasanna
- 2001
(Show Context)
Citation Context ...-sets. We conclude in Section 9. 2. BACKGROUND The prior work in the area of regular expression matching at line rate can be categorized by distinct implementation targets: FPGA-based implementations =-=[22]-=-[23][24] [25][26] and approaches suitable for deployment on a general-purpose processor or on ASIC hardware [15][16][17][18][19][20][21]. The extended automaton proposed in this paper can be applied t... |

93 | Enhancing Byte-Level Network Intrusion Detection Signatures with Context
- Sommer, Paxson
- 2003
(Show Context)
Citation Context ...g data structures, algorithms and architectures to support regular expressions, which are more expressive than exact-match strings and therefore able to describe a wider variety of pattern signatures =-=[13]-=-[14]. The basic challenge with high-speed regular expression evaluation is to minimize both memory space and memory bandwidth. Finite automata (FA) are typically used to represent regular expressions ... |

76 | A high throughput string matching architecture for intrusion detection and prevention
- Tan, Sherwood
- 2005
(Show Context)
Citation Context ... rate can be categorized by distinct implementation targets: FPGA-based implementations [22][23][24] [25][26] and approaches suitable for deployment on a general-purpose processor or on ASIC hardware =-=[15]-=-[16][17][18][19][20][21]. The extended automaton proposed in this paper can be applied to all these implementation scenarios. However, we reserve the evaluation on FPGAs for future work. The two main ... |

31 | An improved algorithm to accelerate regular expression evaluation
- Becchi, Crowley
(Show Context)
Citation Context ...the 16.3% of regular expressions containing dot-star terms in order to compile them into feasible data structures. Note that this is independent of the use of efficient DFA compression techniques [17]=-=[19]-=-. Rule partitioning implies that several DFA instances must be created and operated in parallel, which requires an increase in memory bandwidth linear in the number of DFAs. Therefore, 60.3% of Snort ... |

29 |
Bro: a system for detecting network intruders
- Paxson
- 1999
(Show Context)
Citation Context ...up to several gigabits per second) against large data-sets, sometimes consisting of thousands of patterns. Examples include network intrusion detection and prevention systems (e.g., Snort [6][7], Bro =-=[8]-=-, Cisco Security Appliance [10], Citrix Application Firewall [11]), Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided ... |

18 | NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions
- Laurikari
- 2000
(Show Context)
Citation Context ...e performed through a finite state machine, which is intrinsically memory-less. One could consider augmenting a NFA with tags capturing the beginning (and the end) of a match as proposed by Laurikari =-=[5]-=- to solve the problem of determining the position of a match or of a sub-match in linear time. However, the problem is more complicated. In fact, there exist situations where the start and the end of ... |

12 | Compatible Regular Expression. http://www.pcre.org - Perl |

9 |
et al., “Algorithms to accelerate multiple regular expressions matching for deep packet inspection
- Kumar
(Show Context)
Citation Context ...ion the 16.3% of regular expressions containing dot-star terms in order to compile them into feasible data structures. Note that this is independent of the use of efficient DFA compression techniques =-=[17]-=-[19]. Rule partitioning implies that several DFA instances must be created and operated in parallel, which requires an increase in memory bandwidth linear in the number of DFAs. Therefore, 60.3% of Sn... |

4 |
et al. Fast and memory-efficient regular expression matching for deep packet inspection
- Yu
- 2006
(Show Context)
Citation Context ...required to encode a DFA representing a set of regular expressions can increase exponentially as compared with an NFA representation, a fact that often renders DFAs infeasible for practical rule-sets =-=[16]-=-[20]. Practical rule-sets often include three categories of patterns that make FA implementations problematic. The first are unbounded repetitions of sub-patterns, particularly those involving wildcar... |

4 | A workload for evaluating deep packet inspection architectures
- Becchi, Franklin, et al.
- 2008
(Show Context)
Citation Context ...ext state information, and one for the counters and the back-reference information. All DFAs are compressed through the techniques detailed in Section 5 [19] and are encoded using indirect addressing =-=[27]-=-. State identifiers carry the information about which labeled outgoing transitions are present, thus allowing one memory access per state traversal. As we will detail in Section 8, 32bit state identif... |

3 | et al., “Advanced Algorithms for Fast and Scalable Deep Packet Inspection - Kumar |

3 |
A Hybrid Finite Automaton for
- Becchi, Crowley
(Show Context)
Citation Context .... 6.2 The solution One way to avoid this effect is to isolate each counting constraint/back-reference/dot-star condition from the other. This can be accomplished by using a hybrid-FA, as described in =-=[20]-=-. Specifically, the hybrid-FA can be built as follows. The subset construction operation (that is, NFA RE 2 RE 1 RE 1 RE 2 Figure 8: Exemplification of DFA obtained by compiling RegEx RE 1=.*RE 1a.*RE... |

3 |
et al., “Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia
- Kumar
(Show Context)
Citation Context ...automata, either DFAs or NFAs, can handle the 6% of Snort rules containing back-references. The problem of unconstrained repetitions of large character classes has recently been addressed in [20] and =-=[21]-=-. However, as we discuss in Section 2, neither of these proposals treat counting constraints in an exhaustive way. Finally, back-references have been fully omitted from previous work. In this paper, w... |

3 |
et al., “A Scalable Architecture For HighThroughput Regular-Expression
- Brodie
(Show Context)
Citation Context ...n [20] and, in fact, how our proposal can be viewed as a natural extension of it. Also, our extended deterministic finite automata can be used to generalize existing techniques based on multiple DFAs =-=[25]-=- in order to handle regular expressions with counting constraints. The remainder of this paper is organized as follows: In Section 2, we provide additional background and describe our contributions in... |

2 |
et al., “Assisting Network Intrusion Detection with Reconfigurable Hardware
- Franklin
(Show Context)
Citation Context ...s. We conclude in Section 9. 2. BACKGROUND The prior work in the area of regular expression matching at line rate can be categorized by distinct implementation targets: FPGA-based implementations [22]=-=[23]-=-[24] [25][26] and approaches suitable for deployment on a general-purpose processor or on ASIC hardware [15][16][17][18][19][20][21]. The extended automaton proposed in this paper can be applied to al... |

2 |
et al., “Efficient reconfigurable logic circuit for matching complex network intrusion detection patterns
- Clark
(Show Context)
Citation Context ...e conclude in Section 9. 2. BACKGROUND The prior work in the area of regular expression matching at line rate can be categorized by distinct implementation targets: FPGA-based implementations [22][23]=-=[24]-=- [25][26] and approaches suitable for deployment on a general-purpose processor or on ASIC hardware [15][16][17][18][19][20][21]. The extended automaton proposed in this paper can be applied to all th... |

2 |
et al., “Compiling PCRE to FPGA for Accelerating
- Mitra
(Show Context)
Citation Context ...e in Section 9. 2. BACKGROUND The prior work in the area of regular expression matching at line rate can be categorized by distinct implementation targets: FPGA-based implementations [22][23][24] [25]=-=[26]-=- and approaches suitable for deployment on a general-purpose processor or on ASIC hardware [15][16][17][18][19][20][21]. The extended automaton proposed in this paper can be applied to all these imple... |

1 |
Mastering Regular Expressions,” Third Edition
- Friedl
- 2006
(Show Context)
Citation Context ... the problem of back-references. To this end, it is worth mentioning how the problem is faced within the string-processing arena, that is, within tools like grep, awk, Perl and so on. As explained in =-=[3]-=-, string-processing tools are based on either a text-directed or a regex-directed engine. In either case, the regular expression under consideration is a 0 n+3* *| cnt≠n 0 a *, cnt b | cnt=n 1 2 cnt+... |

1 |
et al., “Polygraph: Automatic Signature Generation for Polymorphic Worms
- Newsome
- 2005
(Show Context)
Citation Context ...ta structures, algorithms and architectures to support regular expressions, which are more expressive than exact-match strings and therefore able to describe a wider variety of pattern signatures [13]=-=[14]-=-. The basic challenge with high-speed regular expression evaluation is to minimize both memory space and memory bandwidth. Finite automata (FA) are typically used to represent regular expressions [2].... |