Regex – How a Regex Engine Works Internally?

By Vijay Gupta On June 6th, 2011 Posted in regex 1 Comment

More than the output, it is the internal operation of how a particular expression works which is important, because then it becomes easy for us to formulate any simple or a complex expression. This will also save us from lots of guesswork and confusions while formulating an expression.

The following regex tree image will illustrate how the regex engine works.

Regex Tree

Knowing how the regex engine works will enable you to craft better regexes more easily. It will help you understand quickly why a particular regex does not do what you initially expected. This will save you lots of guesswork and head scratching when you need to write more complex regexes.

There are two kinds of regular expression engines: text-directed engines, and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines, respectively. All the regex flavors treated in this tutorial are based on regex-directed engines. This is because certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines. No surprise that this kind of engine is more popular.

Notable tools that use text-directed engines are awk, egrep, flex, lex, MySQL and Procmail. For awk and egrep, there are a few versions of these tools that use a regex-directed engine.

Text-directed engine is a DFA (Deterministic Finite Automation) which runs in linear time because they do not require backtracking (and thus they never test the same character twice). They also match the longest possible string. However, since a DFA engine contains only finite state, it cannot match a pattern with back references, and because it does not construct an explicit expansion, it cannot capture sub expressions.

The features of text-directed engine are:

  • Searching is fast.
  • Search time depends on the length of the string.
  • Takes more memory (state explosion in NFA to DFA construction) than NFA.
  • Takes longer to compile the regex.
    • might be done when program is compiled.
    • might be done at runtime (just before string matching is needed).
    • some implementations even compile the regex in the midst of matching (if a match is found before the entire DFA is constructed, they can just stop).

For example:

Regular Expression: (a | b)*abb

Regex-directed engine is a NFA (Non-deterministic Finite Automation) and this algorithm tests all possible expansions of a regular expression in a specific order, accepting the first match. Because a NFA constructs a specific expansion of the regular expression for a successful match, it can capture sub expression matches and matching of back references. However, since a NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths. As a result, it runs slowly in the worst case. Since a NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.

Regex-directed engines are much favored by programmers because they are more expressive than the DFA engine. Although in the worst case they can run slowly, we can steer them up to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.

The .NET Framework regular expression engine is a backtracking regular expression matcher that incorporates a Nondeterministic Finite Automaton (NFA) engine such as that used by PERL, PYTHON, EMACS, and TCL.

Note: A very important point that has to be understood while using a regex-directed engine is that it will always return the leftmost match, even if a “better” match could be found later. When applying a regular expression to a string, the engine will start with the first character of the string. It will try all possible permutations of the regular expression with the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the expression, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

The features of a regex-directed engine are:

  • Searching can be slow (more on this later).
  • Search time depends on regex.
  • Takes less memory than DFA (how much depends on regex).
  • Compiles more quickly than DFA (how much depends on regex).
  • Needs to explore paths.
    • Uses backtracking algorithm to “try out the guesses”.
    • Different flavors of backtracking give different performance.
      • when there are options, what order do we try them in?
    • Can add in “extras” with no serious costs.
      • Sub-expression trapping.
      • Back-references (non-regular!).

For example:

Regular Expression: (a | b)*abb

The Regex-Directed Engine Always Returns the Leftmost Match

This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

When applying cat to He captured a catfish for his cat., the engine will try to match the first token in the regex c to the first character in the match H. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the c with the e. This fails too, as does matching the c with the space. Arriving at the 4th character in the match, c matches c. The engine will then try to match the second token a to the 5th character, a. This succeeds too. But then, t fails to match p. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: a. Again, c fails to match here and the engine carries on. At the 15th character in the match, c again matches c. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that a matches a and t matches t.

The entire regular expression could be matched starting at character 15. The engine is “eager” to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any “better” matches. The first match is considered good enough.

In this first example of the engine’s internals, our regex engine simply appears to work like a regular text search routine. A text-directed engine would have returned the same result too. However, it is important that you can follow the steps the engine takes in your mind. In following examples, the way the engine works will have a profound impact on the matches it will find. Some of the results may be surprising. But they are always logical and predetermined, once you know how the engine works.

What is Backtracking?

Backtracking is like leaving a pile of bread crumbs at every fork in the road. If the path we choose turns out to be a dead end, then we can retrace our steps giving up ground until we come across a pile of crumbs that indicates an untried path. Should that path, too, turn out to be a dead end, we can continue to backtrack, retracing our steps to the next pile of crumbs, and so on, until we eventually find a path that leads to our goal or until we run out of untried paths.

There are basically two points on backtracking: The general idea of how backtracking works is fairly simple, but some of the details are quite important for real-world use. Specifically, when faced with multiple choices, which choice should be tried first? And secondly, when forced to backtrack, which saved choice should the engine use?

In situations where the decision is between “make an attempt” and “skip an attempt,” as with items governed by a question, the engine always chooses to first make the attempt. It will return later (to try skipping the item) only if forced by the overall need to reach a global expression-wide match.

This simple rule has far-reaching repercussions. For starters, it helps to explain regex greediness, but not completely. To complete the picture, we need to know which (among possibly many) saved options to use when we backtrack. To simply put, the most recently saved option is the one returned to when a local failure forces backtracking. It’s LIFO (last in first out).

This is easily understood in the crummy analogy — if your path becomes blocked, you simply retrace your steps until you come across a pile of bread crumbs. The first you’ll return to is the most recently laid. The traditional analogy for describing LIFO also holds, like stacking and unstacking dishes, the most-recently stacked will be the first you’ll unstack.

Discussion 1 Response

Trackbacks

  1. Learn writing regular expression in 60 minutes | WebTechTuts  

Leave a Reply

*

About Us

webtechtuts is a site by web developers aimed at web developers and designers offering articles on technologies, skills and techniques to improve how you design and build websites. We cover HTML, CSS, Javascript, PHP, Photoshop, MySql, Oracle, and TeraData.