Python 3.11: possessive quantifiers and atomic grouping added to re module

Quoting from What's New In Python 3.11:

Atomic grouping ((?>...)) and possessive quantifiers (*+, ++, ?+, {m,n}+) are now supported in regular expressions. (Contributed by Jeffrey C. Jacobs and Serhiy Storchaka in bpo-433030.)

Python possessive quantifiers and atomic grouping

Poster created using Canva

info If you are not familiar with regular expressions, see my Understanding Python re(gex)? ebook to get started.

Backtracking🔗

Greedy quantifiers match as much as possible, provided the overall regex is satisfied. For example, :.* will match : followed by rest of the input line. However, if you change the pattern to :.*apple, the .* portion cannot simply consume the rest of the input line. The regex engine will have to find the largest portion such that apple is also part of the match (provided the input has such a string, of course).

>>> import re

>>> ip = 'fig:mango:pineapple:guava:apples:orange'

>>> re.search(r':.*', ip)[0]
':mango:pineapple:guava:apples:orange'

>>> re.search(r':.*apple', ip)[0]
':mango:pineapple:guava:apple'

For the :.*apple case, the Python regular expression engine actually does consume all the characters on seeing .*. Then realizing that the overall match failed, it gives back one character from the end of line and checks again. This process is repeated until a match is found or failure is confirmed. In regular expression parlance, this is called backtracking.

This type of exploring matches to satisfy overall regex also applies to non-greedy quantifiers. .*? will start with zero characters followed by one, two, three and so on until a match is found.

>>> ip = 'fig:mango:pineapple:guava:apples:orange'

>>> re.search(r':.*?', ip)[0]
':'

>>> re.search(r':.*?apple', ip)[0]
':mango:pineapple'

Note that some regex engines like re2 do not use backtracking.

Possessive quantifiers🔗

Until Python 3.10, you had to use alternatives like the third-party regex module for possessive quantifiers. The re module supports possessive quantifiers from Python 3.11 version.

The difference between greedy and possessive quantifiers is that possessive will not backtrack to find a match. In other words, possessive quantifiers will always consume every character that matches the pattern on which it is applied. Syntax wise, you need to append + to greedy quantifiers to make it possessive, similar to adding ? for non-greedy case.

Unlike greedy or non-greedy quantifiers, :.*+apple will never match, because .*+ will consume rest of the line, leaving no way to match apple.

$ python3.11 -q
>>> import re

>>> ip = 'fig:mango:pineapple:guava:apples:orange'

>>> re.search(r':.*+', ip)[0]
':mango:pineapple:guava:apples:orange'

>>> bool(re.search(r':.*+apple', ip))
False

Here's a more practical example. Suppose you want to match integer numbers greater than or equal to 100 where these numbers can optionally have leading zeros.

>>> numbers = '42 314 001 12 00984'

# this solution fails because 0* and \d{3,} can both match leading zeros
# and greedy quantifiers will give up characters to help overall regex succeed
>>> re.findall(r'0*\d{3,}', numbers)
['314', '001', '00984']

# here 0*+ will not give back leading zeros after they are consumed
>>> re.findall(r'0*+\d{3,}', numbers)
['314', '00984']

# workaround if possessive quantifiers are not supported
>>> re.findall(r'0*[1-9]\d{2,}', numbers)
['314', '00984']

Here's another example. The goal is to match lines whose first non-whitespace character is not a # character. A matching line should have at least one non-# character, so empty lines and those with only whitespace characters should not match.

>>> lines = ['#comment', 'c = "#"', '\t #comment', 'abc', '', ' \t ']

# this solution fails because \s* can backtrack
# and [^#] can match a whitespace character as well
>>> [e for e in lines if re.match(r'\s*[^#]', e)]
['c = "#"', '\t #comment', 'abc', ' \t ']

# this works because \s*+ will not give back any whitespace characters
>>> [e for e in lines if re.match(r'\s*+[^#]', e)]
['c = "#"', 'abc']

# workaround if possessive quantifiers are not supported
>>> [e for e in lines if re.match(r'\s*[^#\s]', e)]
['c = "#"', 'abc']

Atomic grouping🔗

(?>pat) is an atomic group, where pat is the pattern you want to safeguard from further backtracking by isolating it from other parts of the regex.

Here's an example with greedy quantifier:

>>> numbers = '42 314 001 12 00984'

# 0* is greedy and the (?>) grouping prevents backtracking
# same as: re.findall(r'0*+\d{3,}', numbers)
>>> re.findall(r'(?>0*)\d{3,}', numbers)
['314', '00984']

Here's an example with non-greedy quantifier:

>>> ip = 'fig::mango::pineapple::guava::apples::orange'

# this matches from the first '::' to the first occurrence of '::apple'
>>> re.search(r'::.*?::apple', ip)[0]
'::mango::pineapple::guava::apple'

# '(?>::.*?::)' will match only from '::' to the very next '::'
# '::mango::' fails because 'apple' isn't found afterwards
# similarly '::pineapple::' fails
# '::guava::' succeeds because it is followed by 'apple'
>>> re.search(r'(?>::.*?::)apple', ip)[0]
'::guava::apple'

info The regex module has a regex.REVERSE flag to match from right-to-left making it better suited than atomic grouping for certain cases.

>>> import regex

>>> ip = 'fig::mango::pineapple::guava::apples::orange'
>>> regex.search(r'(?r)::.*?::apple', ip)[0]
'::guava::apple'

# this won't be possible with just atomic grouping
>>> ip = 'and this book is good and those are okay and that movie is bad'
>>> regex.search(r'(?r)th.*?\bis bad', ip)[0]
'that movie is bad'

Catastrophic Backtracking🔗

Backtracking can become significantly time consuming for certain corner cases. Which is why some regex engines do not use them, at the cost of not supporting some features like lookarounds. If your application accepts user defined regex, you might need to protect against such catastrophic patterns. From wikipedia: ReDoS:

A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or becoming unresponsive.

Here's an example:

>>> from timeit import timeit

>>> greedy = re.compile(r'(a+|\w+)*:')
>>> possessive = re.compile(r'(a+|\w+)*+:')

# string that'll match the above patterns
>>> s1 = 'aaaaaaaaaaaaaaaa:123'
# string that does NOT match the above patterns
>>> s2 = 'aaaaaaaaaaaaaaaa-123'

# no issues when input string has a match
>>> timeit('greedy.search(s1)', number=10000, globals=globals())
0.016464739997900324
>>> timeit('possessive.search(s1)', number=10000, globals=globals())
0.016358205997676123

# if input doesn't match, greedy version suffers from catastrophic backtracking
# note that 'number' parameter is reduced to 10 since it takes a long time
>>> timeit('greedy.search(s2)', number=10, globals=globals())
53.71723825200024
>>> timeit('possessive.search(s2)', number=10, globals=globals())
0.00019008600065717474

(a+|\w+)*: is a silly regex pattern, since it can be rewritten as \w*: which will not suffer from catastrophic backtracking. But this example shows how quantifiers applied to a group with multiple alternatives using quantifiers can lead to explosive results. More such patterns and mitigation strategies can be found in the following links:

Contents

Backtracking🔗

Possessive quantifiers🔗

Atomic grouping🔗

Catastrophic Backtracking🔗