Lookarounds

You've already seen how to create custom character classes and various avatars of special groupings. In this chapter you'll learn more groupings, known as lookarounds, that help to create custom anchors and add conditions within RE definition. These assertions are also known as zero-width patterns because they add restrictions similar to anchors and are not part of the matched portions. Also, you will learn how to negate a grouping similar to negated character sets.

Conditional expressions

Before you get used to lookarounds too much, it is good to remember that Python is a programming language. You have control structures and you can combine multiple conditions using logical operators, functions like all/any, etc. Also, do not forget that re is only one of the tools available for string processing.

>>> items = ['1,2,3,4', 'a,b,c,d', '#foo 123']

# filter elements containing digit and '#' characters
>>> [s for s in items if re.search(r'\d', s) and '#' in s]
['#foo 123']

# modify elements only if it doesn't start with '#'
>>> for s in items:
...     if s[0] != '#':
...         print(re.sub(r',.+,', ' ', s))
... 
1 4
a d

Negative lookarounds

Lookaround assertions can be added in two ways — lookbehind and lookahead. Each of these can be a positive or a negative assertion. Syntax wise, lookbehind has an extra < compared to the lookahead version. Negative lookarounds can be identified by the use of ! whereas = is used for positive lookarounds. This section is about negative lookarounds, whose complete syntax is shown below.

  • (?!pat) for negative lookahead assertion
  • (?<!pat) for negative lookbehind assertion

As mentioned earlier, lookarounds are not part of matched portions and do not affect capture group numbering.

# change 'foo' only if it is not followed by a digit character
# note that end of string satisfies the given assertion
# 'foofoo' has two matches as the assertion doesn't consume characters
>>> re.sub(r'foo(?!\d)', 'baz', 'hey food! foo42 foot5 foofoo')
'hey bazd! foo42 bazt5 bazbaz'

# change 'foo' only if it is not preceded by _
# note how 'foo' at start of string is matched as well
>>> re.sub(r'(?<!_)foo', 'baz', 'foo _foo 42foofoo')
'baz _foo 42bazbaz'

# overlap example
# the final _ was replaced as well as played a part in the assertion
>>> re.sub(r'(?<!_)foo.', 'baz', 'food _fool 42foo_foot')
'baz _fool 42bazfoot'

Lookarounds can be mixed with already existing anchors and other features to define truly powerful restrictions:

# change whole word only if it is not preceded by : or -
>>> re.sub(r'(?<![:-])\b\w+\b', 'X', ':cart <apple -rest ;tea')
':cart <X -rest ;X'

# add space to word boundaries, but not at start or end of string
# similar to: re.sub(r'\b', ' ', 'foo_baz=num1+35*42/num2').strip()
>>> re.sub(r'(?<!\A)\b(?!\Z)', ' ', 'foo_baz=num1+35*42/num2')
'foo_baz = num1 + 35 * 42 / num2'

In all the examples so far, lookahead grouping was placed as a suffix and lookbehind as a prefix. This is how they are used most of the time, but not the only way to use them. Lookarounds can be placed anywhere and multiple lookarounds can be combined in any order. They do not consume characters nor do they play a role in matched portions. They just let you know whether the condition you want to test is satisfied from the current location in the input string.

# these two are equivalent
# replace a character as long as it is not preceded by 'p' or 'r'
>>> re.sub(r'(?<![pr]).', '*', 'spare')
'**a*e'
>>> re.sub(r'.(?<![pr].)', '*', 'spare')
'**a*e'

# replace 'par' as long as 's' is not present later in the input
# this assumes that the lookaround doesn't conflict with search pattern
# i.e. 's' will not conflict 'par' but would affect if it was 'r' and 'par'
>>> re.sub(r'par(?!.*s)', 'X', 'par spare part party')
'par sXe Xt Xty'
>>> re.sub(r'(?!.*s)par', 'X', 'par spare part party')
'par sXe Xt Xty'

# since the three assertions used here are all zero-width,
# all of the 6 possible combinations will be equivalent
>>> re.sub(r'(?!\Z)\b(?<!\A)', ' ', 'foo_baz=num1+35*42/num2')
'foo_baz = num1 + 35 * 42 / num2'

Positive lookarounds

Unlike negative lookarounds, absence of something will not satisfy positive lookarounds. Instead, for the condition to satisfy, the pattern has to match actual characters and/or zero-width assertions. Positive lookaround can be identified by use of = in the grouping. The complete syntax looks like:

  • (?=pat) for positive lookahead assertion
  • (?<=pat) for positive lookbehind assertion
# extract digits only if it is followed by ,
# note that end of string doesn't qualify as this is positive assertion
>>> re.findall(r'\d+(?=,)', '42 foo-5, baz3; x-83, y-20: f12')
['5', '83']

# extract digits only if it is preceded by - and followed by ; or :
>>> re.findall(r'(?<=-)\d+(?=[:;])', '42 foo-5, baz3; x-83, y-20: f12')
['20']

# replace 'par' as long as 'part' occurs as a whole word later in the line
>>> re.sub(r'par(?=.*\bpart\b)', 'X', 'par spare part party')
'X sXe part party'

Lookarounds can be quite handy in field based processing.

# except first and last fields
>>> re.findall(r'(?<=,)[^,]+(?=,)', '1,two,3,four,5')
['two', '3', 'four']

# replace empty fields with NA
# note that in this case, order of lookbehind and lookahead doesn't matter
>>> re.sub(r'(?<![^,])(?![^,])', 'NA', ',1,,,two,3,,,')
'NA,1,NA,NA,two,3,NA,NA,NA'

Capture groups inside positive lookarounds

Even though lookarounds are not part of matched portions, capture groups can be used inside positive lookarounds. Can you reason out why it won't work for negative lookarounds?

>>> print(re.sub(r'(\S+\s+)(?=(\S+)\s)', r'\1\2\n', 'a b c d e'))
a b
b c
c d
d e

# and of course, use non-capturing group where needed
>>> re.findall(r'(?<=(po|ca)re)\d+', 'pore42 car3 pare7 care5')
['po', 'ca']
>>> re.findall(r'(?<=(?:po|ca)re)\d+', 'pore42 car3 pare7 care5')
['42', '5']

Conditional AND with lookarounds

As promised earlier, here's how lookarounds make it simpler to construct AND conditionals.

>>> words = ['sequoia', 'subtle', 'questionable', 'exhibit', 'equation']

# words containing 'b' and 'e' and 't' in any order
# same as: r'b.*e.*t|b.*t.*e|e.*b.*t|e.*t.*b|t.*b.*e|t.*e.*b'
>>> [w for w in words if re.search(r'(?=.*b)(?=.*e).*t', w)]
['subtle', 'questionable', 'exhibit']

# words containing all lowercase vowels in any order
>>> [w for w in words if re.search(r'(?=.*a)(?=.*e)(?=.*i)(?=.*o).*u', w)]
['sequoia', 'questionable', 'equation']

# words containing 'a' and 'q' but not 'n' at the end of the element
>>> [w for w in words if re.search(r'(?=.*a)(?=.*q)(?!.*n\Z)', w)]
['sequoia', 'questionable']

Variable length lookbehind

With lookbehind assertion (both positive and negative), the pattern used for the assertion cannot imply matching variable length of text. Fixed length quantifier is allowed. Different length alternations are not allowed, even if the individual alternations are of fixed length. Here's some examples to clarify these points.

>>> s = 'pore42 tar3 dare7 care5'

# allowed
>>> re.findall(r'(?<=(?:po|da)re)\d+', s)
['42', '7']
>>> re.findall(r'(?<=\b[a-z]{4})\d+', s)
['42', '7', '5']

# not allowed
>>> re.findall(r'(?<!tar|dare)\d+', s)
re.error: look-behind requires fixed-width pattern
>>> re.findall(r'(?<=\b[pd][a-z]*)\d+', s)
re.error: look-behind requires fixed-width pattern
>>> re.sub(r'(?<=\A|,)(?=,|\Z)', 'NA', ',1,,,two,3,,,')
re.error: look-behind requires fixed-width pattern

There are various workarounds possible depending upon the use case. See Variable length lookbehind section for regex module for the easiest and most robust solution.

First up, when the lookbehind alternatives are of fixed length. In such cases, you can manually expand the alternatives into individual lookbehinds. For negatives, place them next to each other. For positives, separate them with alternation inside another grouping.

>>> s = 'pore42 tar3 dare7 care5'

# workaround for r'(?<!tar|dare)\d+'
>>> re.findall(r'(?<!tar)(?<!dare)\d+', s)
['42', '5']

# workaround for r'(?<=tar|dare)\d+'
>>> re.findall(r'(?:(?<=tar)|(?<=dare))\d+', s)
['3', '7']

# workaround for r'(?<=\A|,)(?=,|\Z)'
>>> re.sub(r'((?<=\A)|(?<=,))(?=,|\Z)', 'NA', ',1,,,two,3,,,')
'NA,1,NA,NA,two,3,NA,NA,NA'

For some of the positive lookbehind cases, you can skip lookbehind entirely and workaround with normal groupings. This works even when you don't know the length of patterns.

>>> s = 'pore42 tar3 dare7 care5'

# same as: re.findall(r'(?:(?<=tar)|(?<=dare))\d+', s)
>>> re.findall(r'(?:tar|dare)(\d+)', s)
['3', '7']

# delete digits only if they are preceded by 'tar' or 'dare'
>>> re.sub(r'(tar|dare)\d+', r'\1', s)
'pore42 tar dare care5'

# workaround for r'(?<=\b[pd][a-z]*)\d+'
# get digits only if they are preceded by a word starting with 'p' or 'd'
>>> re.findall(r'\b[pd][a-z]*(\d+)', s)
['42', '7']

# delete digits only if they are preceded by a word starting with 'p' or 'd'
>>> re.sub(r'(\b[pd][a-z]*)\d+', r'\1', s)
'pore tar3 dare care5'

However, if you don't know the lengths for negative lookbehind, you cannot use the above workarounds. The next section will show how to negate a grouping, and that helps for some of the variable negative lookbehind cases.

Negated groups

Variable length negative lookbehind can be simulated using negative lookahead (which doesn't have restriction on variable length) inside a grouping and applying quantifier to match characters one by one. This also showcases how grouping can be negated in certain cases. Note that this will only work if you have well defined conditions before the negated group.

# note the use of \A anchor to force matching all characters up to 'dog'
>>> bool(re.search(r'\A((?!cat).)*dog', 'fox,cat,dog,parrot'))
False
>>> bool(re.search(r'\A((?!parrot).)*dog', 'fox,cat,dog,parrot'))
True
# without the anchor, you'll get false matches
>>> bool(re.search(r'((?!cat).)*dog', 'fox,cat,dog,parrot'))
True

# easier to understand by checking matched portion
>>> re.search(r'\A((?!cat).)*', 'fox,cat,dog,parrot')[0]
'fox,'
>>> re.search(r'\A((?!parrot).)*', 'fox,cat,dog,parrot')[0]
'fox,cat,dog,'
>>> re.search(r'\A((?!(.)\2).)*', 'fox,cat,dog,parrot')[0]
'fox,cat,dog,pa'

Here's some examples of matching two patterns with a negated group between them.

# match if 'do' is not there between 'at' and 'par'
>>> bool(re.search(r'at((?!do).)*par', 'fox,cat,dog,parrot'))
False

# match if 'go' is not there between 'at' and 'par'
>>> bool(re.search(r'at((?!go).)*par', 'fox,cat,dog,parrot'))
True
>>> re.search(r'at((?!go).)*par', 'fox,cat,dog,parrot')[0]
'at,dog,par'

# use non-capturing group if required
>>> re.findall(r'a(?:(?!\d).)*z', 'at,baz,a2z,bad-zoo')
['at,baz', 'ad-z']

Cheatsheet and Summary

NoteDescription
lookaroundscustom assertions, zero-width like anchors
(?!pat)negative lookahead assertion
(?<!pat)negative lookbehind assertion
(?=pat)positive lookahead assertion
(?<=pat)positive lookbehind assertion
(?!pat1)(?=pat2)multiple assertions can be specified next to each other in any order
as they mark a matching location without consuming characters
((?!pat).)*Negate a grouping, similar to negated character class

In this chapter, you learnt how to use lookarounds to create custom restrictions and also how to use negated grouping. With this, most of the powerful features of regular expressions have been covered. The special groupings seem never ending though, there's some more of them in coming chapters!!

Exercises

info Please use lookarounds for solving the following exercises even if you can do it without lookarounds. Unless you cannot use lookarounds for cases like variable length lookbehinds.

a) Replace all whole words with X unless it is preceded by ( character.

>>> ip = '(apple) guava berry) apple (mango) (grape'

##### add your solution here
'(apple) X X) X (mango) (grape'

b) Replace all whole words with X unless it is followed by ) character.

>>> ip = '(apple) guava berry) apple (mango) (grape'

##### add your solution here
'(apple) X berry) X (mango) (X'

c) Replace all whole words with X unless it is preceded by ( or followed by ) characters.

>>> ip = '(apple) guava berry) apple (mango) (grape'

##### add your solution here
'(apple) X berry) X (mango) (grape'

d) Extract all whole words that do not end with e or n.

>>> ip = 'at row on urn e note dust n'

##### add your solution here
['at', 'row', 'dust']

e) Extract all whole words that do not start with a or d or n.

>>> ip = 'at row on urn e note dust n'

##### add your solution here
['row', 'on', 'urn', 'e']

f) Extract all whole words only if they are followed by : or , or -.

>>> ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'

##### add your solution here
['poke', 'so', 'ever']

g) Extract all whole words only if they are preceded by = or / or -.

>>> ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'

##### add your solution here
['so', 'is', 'sit']

h) Extract all whole words only if they are preceded by = or : and followed by : or ..

>>> ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'

##### add your solution here
['so', 'ink']

i) Extract all whole words only if they are preceded by = or : or . or ( or - and not followed by . or /.

>>> ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'

##### add your solution here
['so', 'vast', 'sit']

j) Remove leading and trailing whitespaces from all the individual fields where , is the field separator.

>>> csv1 = ' comma  ,separated ,values \t\r '
>>> csv2 = 'good bad,nice  ice  , 42 , ,   stall   small'

>>> remove_whitespace = re.compile()     ##### add your solution here

>>> remove_whitespace.sub('', csv1)
'comma,separated,values'
>>> remove_whitespace.sub('', csv2)
'good bad,nice  ice,42,,stall   small'

k) Filter all elements that satisfy all of these rules:

  • should have at least two alphabets
  • should have at least 3 digits
  • should have at least one special character among % or * or # or $
  • should not end with a whitespace character
>>> pwds = ['hunter2', 'F2H3u%9', '*X3Yz3.14\t', 'r2_d2_42', 'A $B C1234']

##### add your solution here
['F2H3u%9', 'A $B C1234']

l) For the given string, surround all whole words with {} except for whole words par and cat and apple.

>>> ip = 'part; cat {super} rest_42 par scatter apple spar'

##### add your solution here
'{part}; cat {{super}} {rest_42} par {scatter} apple {spar}'

m) Extract integer portion of floating-point numbers for the given string. A number ending with . and no further digits should not be considered.

>>> ip = '12 ab32.4 go 5 2. 46.42 5'

##### add your solution here
['32', '46']

n) For the given input strings, extract all overlapping two character sequences.

>>> s1 = 'apple'
>>> s2 = '1.2-3:4'

>>> pat = re.compile()       ##### add your solution here

##### add your solution here for s1
['ap', 'pp', 'pl', 'le']
##### add your solution here for s2
['1.', '.2', '2-', '-3', '3:', ':4']

o) The given input strings contain fields separated by : character. Delete : and the last field if there is a digit character anywhere before the last field.

>>> s1 = '42:cat'
>>> s2 = 'twelve:a2b'
>>> s3 = 'we:be:he:0:a:b:bother'

>>> pat = re.compile()      ##### add your solution here

>>> pat.sub()       ##### add your solution here for s1
'42'
>>> pat.sub()       ##### add your solution here for s2
'twelve:a2b'
>>> pat.sub()       ##### add your solution here for s3
'we:be:he:0:a:b'

p) Extract all whole words unless they are preceded by : or <=> or ---- or #.

>>> ip = '::very--at<=>row|in.a_b#b2c=>lion----east'

##### add your solution here
['at', 'in', 'a_b', 'lion']

q) Match strings if it contains qty followed by price but not if there is whitespace or the string error between them.

>>> str1 = '23,qty,price,42'
>>> str2 = 'qty price,oh'
>>> str3 = '3.14,qty,6,errors,9,price,3'
>>> str4 = '42\nqty-6,apple-56,price-234,error'
>>> str5 = '4,price,3.14,qty,4'

>>> neg = re.compile()       ##### add your solution here

>>> bool(neg.search(str1))
True
>>> bool(neg.search(str2))
False
>>> bool(neg.search(str3))
False
>>> bool(neg.search(str4))
True
>>> bool(neg.search(str5))
False

r) Can you reason out why the output shown is different for these two regular expressions?

>>> ip = 'I have 12, he has 2!'

>>> re.sub(r'\b..\b', '{\g<0>}', ip)
'{I }have {12}{, }{he} has{ 2}!'

>>> re.sub(r'(?<!\w)..(?!\w)', '{\g<0>}', ip)
'I have {12}, {he} has {2!}'