Debugging Python Regular Expressions

Ever since I learned regular expressions, they have been one of my most beloved features of any language, especially Python. I know lots of people just cringe at the thought of having to sit down for 30 minutes to hash out a 40 character pattern, but it's too powerful to pass up. Up until recently, my process was just to think the problem through and slowly work up the pattern until I got something that worked. Debugging consisted of merely pulling pieces out and trying them individually. That was until I discovered the hidden Python regex debugging feature a couple weeks ago on Stack Overflow, thanks to BatchyX.

I was unable to find any other official resources on this particular trick, so I've decided to do a little experimentation and and write out a few examples.

Lets start with a simple example of what I've been rambling about thus far. Don't feel like you should understand this at all, yet.

>>> re.compile(r'<a href="(.+?)">(.+)</a>', 128)
literal 60
literal 97
literal 32
literal 104
literal 114
literal 101
literal 102
literal 61
literal 34
subpattern 1
  min_repeat 1 65535
    any None
literal 34
literal 62
subpattern 2
  max_repeat 1 65535
    any None
literal 60
literal 47
literal 97
literal 62

This is a fairly simple regex. It will pull the link and the text out of an href tag. Notice, however, the 128 passed at the second argument. That is the key that enables the fancy debugging mode. The printed output is a little daunting at first glance, but lets try to clear that up. I'll start with a slightly smaller regex, so you can get the feel for how this works.

>>> re.compile('foo', 128)
literal 102
literal 111
literal 111

This output is a little nicer to look at. You get three 'literal' statements followed by numbers. This just means that the pattern is looking for three exact characters represented by their ordinal integer. Try this:

>>> for x in 'foo':
...     ord(x)
... 
102
111
111

See! Same values!

Moving on...

>>> re.compile('fo{1,2}', 128)
literal 102
max_repeat 1 2
  literal 111

In this example, you see the new statement: max_repeat. This is thanks to our '{1,2}' tacked on to the first 'o'. It should make sense. Literal 111, or 'o', will appear 1 or 2 times. This same designation is used for other repeating techniques: *, + and &. The * and + both allow ranges up to 65535.

>>> re.compile('f(o{1,2})', 128)
literal 102
subpattern 1
  max_repeat 1 2
    literal 111

This pattern introduces a group, designated by the 'subpattern'. Then everything inside that group you will find indented beneath subpattern. Each accessibly group will be signified by a number (the 1). Even if you used named groups, it will still output a number.

>>> re.compile('f(?:o{1,2})', 128)
literal 102
subpattern None
  max_repeat 1 2
    literal 111

Notice what changed? The subpattern is now followed by a None. This is because we switched over to using a non-accessible group with (?: ... ).

The next debugging statement you'll likely see often (and one which BatchyX alluded to in his example) is 'in' when using square brackets.

>>> re.compile('f(?:[Oo]{1,2})', 128)
literal 102
subpattern None
  max_repeat 1 2
    in
      literal 79
      literal 111

This pattern will match both capital and lowercase 'o'. In the debugging output, you see that with the 'in' used, either literal 79 or 111 can appear 1 or 2 times.

One more example of with square brackets shows the 'range' statement.

>>> re.compile('f(?:[a-z]{1,3})', 128)
literal 102
subpattern None
  max_repeat 1 3
    in
      range (97, 122)

I had to skew my example a little bit, but you see that when I search for [a-z], it is actually shown as a range of ordinals from 97-122.

Lets say we give up on matching various forms of 'foo' and just want to make 'f' followed by any number of anything (previously stated to have a max of 65536).

>>> re.compile('f(?:.*)', 128)
literal 102
subpattern None
  max_repeat 0 65535
    any None

There you go. max_repeat 0 through 65536 with 'any' used to denote the period. I don't know what the None following the 'any' represents, yet. This concludes the main part of what I know about regex debugging. The only items left I've experimented with are the escaped character patterns: \W, \d, etc.

>>> re.compile('\w+\D{5,10}', 128)
max_repeat 1 65535
  in
    category category_word
max_repeat 5 10
  in
    category category_not_digit

This is just a sample of what the output from these look like. The \w gives you 'category category_word' and the \D gives you 'category category_not_digit'. Pretty self explanatory. I'll leave you to play with the rest of the escaped characters.

A couple other things to note: You can combine the '128' passed as the second argument with others like re.VERBOSE and re.MULTILINE just by adding them together. Also, when compiling your patterns, note that the debugging output is only printed the first time you compile the pattern. If you immediately follow with the same compilation, no debugging output will be printed.

That's all the tips I've been able to come up with on this subject. Please, if anyone else knows some added syntax, leave a comment.

Cheers!

Posted by Sean Stoops on October 28, 2008

128, debug, expressions, python, regular


Comments

No one has said anything yet...

Post a Comment

(not displayed)

Comment:

The SigB Links:

External Links:

Archives

RSS feeds

A Django project.
Made with vim