Implementing Regular Expressions

Russ Cox

rsc@swtch.com


This page collects resources about
implementing regular expression search
efficiently.

Articles and Notes

“Regular Expression Matching Can Be Simple And Fast”

An introduction to using finite automata to implement
regular expression matching, and why the standard
backtracking implementation is a bad idea.

Supporting programs:

NFA |
DFA |
bounded-memoryDFA |
timing scripts


NFAs with submatch tracking:
Perl rules |
POSIX rules


Transliteration of Thompson's code for bytecode machine and x86, by Jan Burgy.

“Regular Expression Matching: the Virtual Machine Approach”

An introduction to submatch tracking during
efficient (non-backtracking) NFA-based regular expression matching.

Supporting programs:
http://code.google.com/p/re1/

“Regular Expression Matching in the Wild”

A tour of RE2, an efficient, production regular expression implementation.

Supporting programs:
http://code.google.com/p/re2/

“Regular Expression Matching with a Trigram Index”
[New!]

How Google Code Search worked.

Supporting programs:
http://code.google.com/p/codesearch/

“IBM 7094 Cheat Sheet”

If you want to read Ken Thompson's original 1968 paper
(see below),
you'll want to take this with you.

“Regular Expressions: Languages, Algorithms, and Software”

by Brian W. Kernighan and Rob Pike

The cleanest, simplest, backtracking implementation
you'll ever see.

(But still backtracking, so runs very slowly on some inputs.)

See also Chapter 9 of The Practice of Programming
and Chapter 1 of Beautiful Code.

Efficient Implementations

RE2 regular expression library

Efficient automaton-based implementation of
Perl-syntax regular expressions (excluding backreferences).
Written by Russ Cox.

Plan9 regular expression library

Efficient (non-backtracking) NFA implementation
with submatch tracking.
Accepts UTF-8 and wide-character Unicode input.
Traditional egrep syntax only.
Written by Rob Pike.

TRE regular expression library

Efficient (non-backtracking) NFA implementation
with submatch tracking.
POSIX-compliant, wide-character input.
Written by Ville Laurikari.

Plan9 grep

Very fast DFA-based grep(1) implementation.
Accepts UTF-8 input.
Traditional egrep syntax only.
Written by Ken Thompson.

References

Michael Rabin and Dana Scott,
“Finite automata and their decision problems,”
IBM Journal of Research and Development 3 (1959), pp.114–125.
http://www.research.ibm.com/journal/rd/032/ibmrd0302C.pdf

Introduces the concept of NFAs, shows they are equivalent in power to DFAs.
Won Rabin and Scott a Turing Award

Ken Thompson,
“Regular expression search algorithm,”
Communications of the ACM 11(6) (June 1968), pp.419–422.
http://doi.acm.org/10.1145/363347.363387
(PDF)

The first efficient regular expression search.
Four pages but dense: every time I read it I learn something new.
Take an IBM 7094 cheat sheet with you.
See also old slides
from a lecture about the paper.

Ville Laurikari,
“NFAs with Tagged Transitions,
their Conversion to Deterministic Automata
and
Application to Regular Expressions,”
in Proceedings of the Symposium on String Processing and
Information Retrieval, September 2000.
http://laurikari.net/ville/spire2000-tnfa.ps

How to track submatches during efficient (non-backtracking)
NFA-based regular expression search.

M. Douglas McIlroy,
“Enumerating the strings of regular languages,”
Journal of Functional Programming 14 (2004), pp.503–518.
http://www.cs.dartmouth.edu/~doug/nfa.ps.gz (preprint)

Not relevant to implementing regular expression matching,
but a beautiful example of regular expression analysis
and manipulation using Haskell.