Gotchas

Regular expressions can get quite complicated and cryptic. So, it is natural to assume you have made a mistake if something isn't working as expected. However, sometimes it might just be one of the quirky corner cases discussed in this chapter.

Escape sequences

Some RE engines match characters literally if an escape sequence is not defined. Python raises an exception for such cases. Apart from sequences defined for RE (for example \d), these are allowed: \a \b \f \n \N \r \t \u \U \v \x \\ where \b means backspace only in character classes. Also, \u and \U are valid only in Unicode patterns.

>>> bool(re.search(r'\t', 'cat\tdog'))
True

>>> bool(re.search(r'\c', 'cat\tdog'))
re.error: bad escape \c at position 0

Only octal escapes are allowed inside raw strings in the replacement section. If you are otherwise not using the \ character, then using normal strings in replacement section is preferred as it will also allow hexadecimal and unicode escapes.

>>> re.sub(r',', r'\x7c', '1,2')
re.error: bad escape \x at position 0

>>> re.sub(r',', r'\174', '1,2')
'1|2'
>>> re.sub(r',', '\x7c', '1,2')
'1|2'

Line anchors with \n as the last character

There is an additional start/end of line match after the last newline character if line anchors are used as a standalone pattern. End of line match after newline is straightforward to understand as $ matches both the end of lines and the end of strings.

>>> print(re.sub(r'(?m)^', 'apple ', '1\n2\n'))
apple 1
apple 2
apple 

>>> print(re.sub(r'(?m)$', ' banana', '1\n2\n'))
1 banana
2 banana
 banana

Zero-length matches

Beware of empty matches. See also regular-expressions: Zero-Length Matches.

# there is an extra empty string match at the end of matches
>>> re.sub(r'[^,]*', r'{\g<0>}', ',cat,tiger')
'{},{cat}{},{tiger}{}'
>>> re.sub(r'[^,]*+', r'{\g<0>}', ',cat,tiger')
'{},{cat}{},{tiger}{}'

# use lookarounds as a workaround
>>> re.sub(r'(?<![^,])[^,]*', r'{\g<0>}', ',cat,tiger')
'{},{cat},{tiger}'

Capture group with quantifiers

Referring to the text matched by a capture group with a quantifier will give only the last match, not the entire match. You'll need an outer capture group to get the entire matched portion.

>>> re.sub(r'\A([^,]+,){3}([^,]+)', r'\1(\2)', '1,2,3,4,5,6,7')
'3,(4),5,6,7'
>>> re.sub(r'\A((?:[^,]+,){3})([^,]+)', r'\1(\2)', '1,2,3,4,5,6,7')
'1,2,3,(4),5,6,7'

# as mentioned earlier, findall can be useful for debugging purposes
>>> re.findall(r'([^,]+,){3}', '1,2,3,4,5,6,7')
['3,', '6,']
>>> re.findall(r'(?:[^,]+,){3}', '1,2,3,4,5,6,7')
['1,2,3,', '4,5,6,']

Converting re to regex module

When using the flags option with the regex module, the constants should also be used from the regex module.

# Using re module, assuming you don't yet know that this feature isn't supported
>>> re.findall(r'[[:word:]]+', 'fox:αλεπού,eagle:αετός', flags=re.A)
<stdin>:1: FutureWarning: Possible nested set at position 1
[]

# Convert re to regex: oops, output is still wrong
>>> regex.findall(r'[[:word:]]+', 'fox:αλεπού,eagle:αετός', flags=re.A)
['fox', 'αλεπού', 'eagle', 'αετός']

# Finally correct solution, the constant had to be changed as well
>>> regex.findall(r'[[:word:]]+', 'fox:αλεπού,eagle:αετός', flags=regex.A)
['fox', 'eagle']

# or, use inline flags to entirely avoid such shenanigans
>>> regex.findall(r'(?a)[[:word:]]+', 'fox:αλεπού,eagle:αετός')
['fox', 'eagle']

Optional arguments syntax

Speaking of flags, you should always pass them as a keyword argument. Using it as positional argument leads to a common mistake between re.findall() and re.sub() functions due to difference in placement. Their syntax, as per the docs, is shown below:

re.findall(pattern, string, flags=0)
re.sub(pattern, repl, string, count=0, flags=0)

Here's an example:

>>> +re.I
2

# works because flags is the only optional argument for findall
>>> re.findall(r'key', 'KEY portkey oKey Keyed', re.I)
['KEY', 'key', 'Key', 'Key']

# wrong usage, but no error because re.I has a value of 2
# so, this is same as specifying count=2
>>> re.sub(r'key', r'(\g<0>)', 'KEY portkey oKey Keyed', re.I)
'KEY port(key) oKey Keyed'

# correct use of keyword argument
>>> re.sub(r'key', r'(\g<0>)', 'KEY portkey oKey Keyed', flags=re.I)
'(KEY) port(key) o(Key) (Key)ed'
# alternatively, you can use inline flags to avoid this problem altogether
>>> re.sub(r'(?i)key', r'(\g<0>)', 'KEY portkey oKey Keyed')
'(KEY) port(key) o(Key) (Key)ed'

Summary

Hope you have found Python regular expressions an interesting topic to learn. Sooner or later, you'll need to use them if your project has text processing tasks. At the same time, knowing when to use normal string methods and knowing when to reach for other text parsing modules like json is important. Happy coding!

Understanding Python re(gex)?