"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." -- JWZ
Regexes can sometimes be cryptic and difficult to understand. On top of that, when a regex fails, regex implementations rarely give you any feedback to figure out why or where. I've spent a suboptimal portion of my life staring bemusedly at regexes that weren't quite right, so I decided to write a small hack to fix my annoyances.
Since I had a regex implementation kicking around for Myrddin, I decided to instrument it to make a tool that would make it easier to see where regexes failed.
The regex dialect used is documented on the Myrddin reference page
Understanding the Output
Before describing how it works, I'm going to go through some examples of what it dumps, and how it can be used to debug things:
We're going to be ignoring the fundamental rule of validating email addresses with regex ("don't") and running with a very broken email regex:
$ redump '[a-zA-Z0-9]+@[a-zA-Z0-9]+.(com|org|net)'
redump will sit waiting for text to match on stdin, so let's type in an email that matches our expectations:
email@example.com Matched: firstname.lastname@example.org group 1: org coverage: [a-zA-Z0-9]+@[a-zA-Z0-9]+.(com|org|net) ^ ^^^ ^^^ ^^^^^
This tells us, pretty clearly: The string matched, and the group that was selected was 'org', since that was the only parenthesized section. It also shows the coverage for the regex -- in other words, the part of the regex that we passed through in order to match the string. In this case, we covered these all clauses up to the final selection, where the final match only included com.
Now, on the other hand, If the regex fails to match, we also get an indicator showing where the regex was where it failed, making it far easier to figure out why it failed:
email@example.com Match failed: [a-zA-Z0-9]+@[a-zA-Z0-9]+.(com|org|net) ~~~~~~~~~~~~^ firstname.lastname@example.org ~~~~~~^ email@example.com Match failed: [a-zA-Z0-9]+@[a-zA-Z0-9]+.(com|org|net) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ firstname.lastname@example.org ~~~~~~~~~~~~~~~~~~~~~^
This tells us where we were in the string when the regex failed, as well as
where we were in the regex. For example, with the input
email@example.com, the regex failed because of the '_'. Matching up
with the pattern, the failure is immediately after the
this pattern is failing because the first section of the pattern doesn't
The second input fails on the
edu in the string, and the
net in the
pattern, again highlighting the source of the failure quickly and easily,
showing where the farthest mismatch in the string is.
Note: in the current implementation of coverage printing, if we have a single clause, there is only one caret printed, even if it spans multiple characters. Therefore, if we passed through
We would mark that in coverage as
This is a restriction of the current way that debug information about the regex is recorded, but is probably not too difficult to fix.
Getting the Code
redump is built on top of the Myrddin regex
library, and as a result is shipped with the Myrddin programming language.
The dialect of regex implemented is documented here in the API reference
for the library
The code lives here:
Implementing The Debugger
The regex engine I'm using is implemented as a relatively simple NFA
virtual machine, which will walk through a set of instructions, either
matching the instruction or dying. So, for example, the regex
will generate the following code, as dumped by
redump -v abc:
0: `Ilbra 0 1: `Ibyte 97 (a) 2: `Ibyte 98 (b) 3: `Ibyte 99 (c) 4: `Irbra 0 5: `Imatch
The regex starts off at instruction 0.
Ilbra 0 marks the start of a capture
group 0, which captures the whole expression. The next three instructions will
fail the match if the next byte in the pattern is not
a, followed by
c. The next
Irbra instruction closes the match group started by
recording the capture group 0. Finally,
Imatch ends the expression
But this is a pretty boring example. What if we have more interesting regexes,
a*b? This compiles to something like:
0: `Ilbra 0 1: `Ifork (2, 4) 2: `Ibyte 97 (a) 3: `Ijmp 1 4: `Ibyte 98 (b) 5: `Irbra 0 6: `Imatch
The structure looks pretty similar, but there are a couple of very interesting
instructions in there.
Ifork will split the execution stream, effectively
branching both ways. Executing this regex program on 'aab' will succeed,
giving the following execution trace:
The red squares indicate failed attempts at matching. The final green square indicates a successful match.
However, if run over 'aacdef', the first mismatched character, 'c', will cause the final thread to exit, since it matches neither live thread. The
If we want to know why a regex failed, the last thread alive is the most interesting one. Simply mapping from the final program counter to the offset in the regex is sufficient to know where the regex failed with enough fidelity to enable debugging.
To support debugging, the regex parser and compiler needed to be instrumented to keep track of where nodes in the AST were generated, so that they could be mapped back to offsets within the pattern. The debug information is stored along side the regex, and is pulled out by redump when formatting the output:
0: offset 0: `Ilbra 0 // start 1: offset 1: `Ifork (2, 4) // from * 2: offset 0: `Ibyte 97 (a) // from 'a' 3: offset 1: `Ijmp 1 // from * 4: offset 2: `Ibyte 98 (b) // from b 5: offset 0: `Irbra 0 // don't care 6: offset 0: `Imatch // don't care
There's not too much more to it. We know what match the regex failed on, where we were when we failed, and where in the pattern that came from.
Often, it's just as confusing trying to sort out why a regex failed as why it succeeded. A bug in the pattern can allow some essential portion to be skipped, or some portion may capture too much data. Coming up with a useful method of debugging successful matches is slightly trickier, but I think that the coverage solution presented above does a good job of making it easier to understand why a pattern matched.
The debug information needed for generating the coverage info is a superset of the information needed to debug failures. In addition to knowing the farthest PC of the thread, a full trace of the states that it hit is needed.
This is recorded during regex execution as per-thread a bit set of instruction pointers, indexed by the thread id. Each step advanced through the instructions is recorded, and the winning thread is stored at the end of regex execution.
Finally, after the success is recorded, the set of 'hit' locations is generated by mapping from the instruction pointers to the regex offsets, and printing a caret for every regex offset that was hit.
There's one problem with the coverage information being used for this purpose: Currently, each pc only maps to a single point, rather than a range. This causes some loss of fidelity, and is the reason that the coverage caret only shows up under the start of a multi character entity.
Fixing this bug would involve a bit of extra book keeping, but should be fixable.
That's it. Regexes are debuggable, and you can use this code to do it.
If you want to learn more about how regexes work in detail, I'd suggest reading Russ Cox's excellent series: https://swtch.com/~rsc/regexp/