Python tip 12: negate a regex grouping
You might be familiar with negating a character class, for example:
>>> import re
# remove first two columns
>>> re.sub(r'\A([^:]+:){2}', '', 'apple:42:banana:1000:cherry:512')
'banana:1000:cherry:512'
# filter all elements not ending with `r` or `t`
>>> words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']
>>> [w for w in words if re.search(r'[^rt]\Z', w)]
['unicorn', 'empty', 'eel']
But do you know how to match characters based on a negated group? You can use a combination of negative lookahead and quantifiers as shown in the examples below:
>>> pets = 'fox,cat,dog,parrot'
# match if 'do' is not present between 'at' and 'par'
>>> bool(re.search(r'at((?!do).)*par', pets))
False
# match if 'go' is not present between 'at' and 'par'
>>> bool(re.search(r'at((?!go).)*par', pets))
True
# easier to understand by looking at the matched portions
>>> re.search(r'at((?!go).)*par', pets)[0]
'at,dog,par'
>>> re.search(r'\A((?!par).)*', pets)[0]
'fox,cat,dog,'
The .
in ((?!go).)*
will match a character only if the sequence of current and next characters are not go
. Similarly, the .
in ((?!par).)*
matches a character only if the current and next two characters are not par
. The *
quantifier is applied on the outer group to match zero or more characters satisfying the given condition.
The outer group in the above examples are capturing groups, though it wasn't required. Just makes the pattern concise. However, capturing groups affect the behavior of functions like re.split
and re.findall
. You can use non-capturing groups in such cases:
# capture group affects the behavior of 're.findall'
>>> re.findall(r'\b((?!42)\w)+\b', 'a422b good bad42 nice100')
['d', '0']
# so, use a non-capturing group here
>>> re.findall(r'\b(?:(?!42)\w)+\b', 'a422b good bad42 nice100')
['good', 'nice100']
Test your understanding by solving this exercise. Construct a regex solution that works for all three sample transformations shown below:
Power(x,2)
should be replaced with(x)*(x)
Power(Power(x,2) + x,2)
should be changed to((x)*(x) + x)*((x)*(x) + x)
Power(x + Power(x,2),2)
should be changed to(x + (x)*(x))*(x + (x)*(x))
If that was easy, make it work for general powers instead of just
2
:
Power(Power(x,2),3)
translates to((x)*(x))*((x)*(x))*((x)*(x))
The above exercise is based on this stackoverflow Q&A.
Video demo:
See also my Understanding Python re(gex)? ebook.