Python tip 22: possessive quantifiers

Until Python 3.10, you had to use alternatives like the third-party regex module for possessive quantifiers and atomic grouping. The re module supports these features from Python 3.11 version.

Greedy quantifiers will match as much as possible but will backtrack to help the overall pattern to succeed. Possessive quantifiers behave like greedy but won't backtrack.

Suppose you want to match integer numbers greater than or equal to 100 where these numbers can optionally have leading zeros.

>>> numbers = '42 314 001 12 00984'

# this solution fails because 0* and \d{3,} can both match leading zeros
# and greedy quantifiers will give up characters to help overall regex succeed
>>> re.findall(r'0*\d{3,}', numbers)
['314', '001', '00984']

# here 0*+ will not give back leading zeros after they are consumed
>>> re.findall(r'0*+\d{3,}', numbers)
['314', '00984']

# workaround if possessive quantifiers are not supported
>>> re.findall(r'0*[1-9]\d{2,}', numbers)
['314', '00984']

Here's another example. The goal is to match lines whose first non-whitespace character is not a # character. A matching line should have at least one non-# character, so empty lines and those with only whitespace characters should not match.

>>> lines = ['#cmt', 'c = "#"', '\t #comment', 'abc', '', ' \t ']

# this solution fails because \s* can backtrack
# and [^#] can match a whitespace character as well
>>> [e for e in lines if re.match(r'\s*[^#]', e)]
['c = "#"', '\t #comment', 'abc', ' \t ']

# this works because \s*+ will not give back any whitespace characters
>>> [e for e in lines if re.match(r'\s*+[^#]', e)]
['c = "#"', 'abc']

# workaround if possessive quantifiers are not supported
>>> [e for e in lines if re.match(r'\s*[^#\s]', e)]
['c = "#"', 'abc']

info See my blog post on possessive quantifiers and atomic grouping for more examples, details about catastrophic backtracking and so on.

Video demo:

info See also my 100 Page Python Intro and Understanding Python re(gex)? ebooks.