Gotchas

RE can get quite complicated and cryptic a lot of the times. But sometimes, if something is not working as expected, it could be because of quirky corner cases.

Escape sequences

Some RE engines match character literally if an escape sequence is not defined. Python raises an exception for such cases. Apart from sequences defined for RE (for example \d), these are allowed: \a \b \f \n \N \r \t \u \U \v \x \\ where \b means backspace only in character classes and \u \U are valid only in Unicode patterns.

>>> bool(re.search(r'\t', 'cat\tdog'))
True

>>> bool(re.search(r'\c', 'cat\tdog'))
re.error: bad escape \c at position 0

Only octal escapes are allowed inside raw strings in replacement section. If you are otherwise not using \ character, then using normal strings in replacement section is preferred as it will also allow \x and unicode escapes.

>>> re.sub(r',', r'\x7c', '1,2')
re.error: bad escape \x at position 0

>>> re.sub(r',', r'\174', '1,2')
'1|2'
>>> re.sub(r',', '\x7c', '1,2')
'1|2'

Line anchors with \n as last character

There is an additional start/end of line match after last newline character if line anchors are used as standalone pattern. End of line match after newline is straightforward to understand as $ matches both end of line and end of string.

# note the use of inline flag to enable . to match newlines
>>> print(re.sub(r'(?m)^', 'foo ', '1\n2\n'))
foo 1
foo 2
foo 

>>> print(re.sub(r'(?m)$', ' baz', '1\n2\n'))
1 baz
2 baz
 baz

Zero length matches

How much does * or *+ match? See also regular-expressions: Zero-Length Matches.

# there is an extra empty string match at end of matches
>>> re.sub(r'[^,]*', r'{\g<0>}', ',cat,tiger')
'{},{cat}{},{tiger}{}'
>>> regex.sub(r'[^,]*+', r'{\g<0>}', ',cat,tiger')
'{},{cat}{},{tiger}{}'

# use lookarounds as a workaround
>>> re.sub(r'(?<![^,])[^,]*', r'{\g<0>}', ',cat,tiger')
'{},{cat},{tiger}'

Capture group with quantifiers

Referring to text matched by a capture group with a quantifier will give only the last match, not entire match. Use a non-capturing group inside a capture group to get the entire matched portion.

>>> re.sub(r'\A([^,]+,){3}([^,]+)', r'\1(\2)', '1,2,3,4,5,6,7')
'3,(4),5,6,7'
>>> re.sub(r'\A((?:[^,]+,){3})([^,]+)', r'\1(\2)', '1,2,3,4,5,6,7')
'1,2,3,(4),5,6,7'

# as mentioned earlier, findall can be useful for debugging purposes
>>> re.findall(r'([^,]+,){3}', '1,2,3,4,5,6,7')
['3,', '6,']
>>> re.findall(r'(?:[^,]+,){3}', '1,2,3,4,5,6,7')
['1,2,3,', '4,5,6,']

Converting re to regex module

When using flags options with regex module, the constants should also be used from regex module. A typical workflow shown below:

# Using re module, unsure if a feature is available
>>> re.findall(r'[[:word:]]+', 'fox:αλεπού,eagle:αετός', flags=re.A)
<stdin>:1: FutureWarning: Possible nested set at position 1
[]

# Convert re to regex: oops, output is still wrong
>>> regex.findall(r'[[:word:]]+', 'fox:αλεπού,eagle:αετός', flags=re.A)
['fox', 'αλεπού', 'eagle', 'αετός']

# Finally correct solution, the constant had to be changed as well
>>> regex.findall(r'[[:word:]]+', 'fox:αλεπού,eagle:αετός', flags=regex.A)
['fox', 'eagle']

# or, use inline flags to avoid these shenanigans
>>> regex.findall(r'(?a)[[:word:]]+', 'fox:αλεπού,eagle:αετός')
['fox', 'eagle']

Optional arguments syntax

Speaking of flags, try to always use it as keyword argument. Using it as positional argument leads to a common mistake between re.findall and re.sub due to difference in placement. Their syntax, as per the docs, is shown below:

re.findall(pattern, string, flags=0)

re.sub(pattern, repl, string, count=0, flags=0)

Here's an example:

>>> +re.I
2

# works because flags is the only optional argument for findall
>>> re.findall(r'key', 'KEY portkey oKey Keyed', re.I)
['KEY', 'key', 'Key', 'Key']

# no error, because re.I has a value of 2
# this is same as count=2
>>> re.sub(r'key', 'X', 'KEY portkey oKey Keyed', re.I)
'KEY portX oKey Keyed'

# correct use of keyword argument
>>> re.sub(r'key', 'X', 'KEY portkey oKey Keyed', flags=re.I)
'X portX oX Xed'

Summary

Hope you have found Python regular expressions an interesting topic to learn. Sooner or later, you'll need to use them if you are doing plenty of text processing tasks. At the same time, knowing when to use normal string methods and knowing when to reach for other text parsing modules like json is important. Happy coding!