Working with matched portions

Having seen a few features that can match varying text, you'll learn how to extract and work with those matching portions in this chapter. First, you'll learn in detail about re.Match object. And then you'll learn about re.findall and re.finditer functions to get all the matches instead of just the first match. You'll also learn a few tricks like using functions in replacement section of re.sub and the use of re.subn function.

re.Match object

The re.search and re.fullmatch functions return a re.Match object from which various details can be extracted like the matched portion of string, location of matched portion, etc. Note that you get the details only for the first match, you'll see multiple matches later in this chapter. Here's some example with re.Match output.

>>> re.search(r'ab*c', 'abc ac adc abbbc')
<re.Match object; span=(0, 3), match='abc'>

>>> re.fullmatch(r'1(2|3)*4', '1233224')
<re.Match object; span=(0, 7), match='1233224'>

The details in the output above are for quick reference only. There are methods and attributes that you can apply on the re.Match object to get only the exact information you need. Use span method to get the starting and ending + 1 indexes of the matching portion.

>>> sentence = 'that is quite a fabricated tale'
>>> m = re.search(r'q.*?t', sentence)
>>> m.span()
(8, 12)
>>> m.span()[0]
8

# you can also directly use the method without using intermediate variable
>>> re.search(r'q.*?t', sentence).span()
(8, 12)

The () grouping is also known as a capture group. It has multiple uses, one of which is the ability to work with matched portions of those groups. When capture groups are used with re.search or re.fullmatch, they can be retrieved using an index or group method on the re.Match object. The first element is always the entire matched portion and rest of the elements are for capture groups if they are present. The leftmost ( in the pattern will get group number 1, second leftmost ( will get group number 2 and so on. Use groups method to get a tuple of only the capture group portions.

>>> re.search(r'b.*d', 'abc ac adc abbbc')
<re.Match object; span=(1, 9), match='bc ac ad'>
# retrieving entire matched portion using index
>>> re.search(r'b.*d', 'abc ac adc abbbc')[0]
'bc ac ad'
# retrieving entire matched portion using 'group' method
# you can also skip passing '0' as that is the default value
>>> re.search(r'b.*d', 'abc ac adc abbbc').group(0)
'bc ac ad'

# capture group example
>>> m = re.fullmatch(r'a(.*?) (.*)d(.*)c', 'abc ac adc abbbc')
# to get matched portion of second capture group, can also use m.group(2)
>>> m[2]
'ac a'
# to get matched portion of third and first capture groups
>>> m.group(3, 1)
('c abbb', 'bc')
# to get a tuple of all the capture groups
# note that this will not have the entire matched portion
>>> m.groups()
('bc', 'ac a', 'c abbb')

To get the matching locations for the capture groups, pass the group number to span method. You can also use start and end methods to get either of those locations. Passing 0 is optional when you need the information for the entire matched portion.

>>> m = re.search(r'w(.*)me', 'awesome')

>>> m.span()
(1, 7)
>>> m.span(1)
(2, 5)

>>> m.start()
1
>>> m.end(1)
5

info There are many more methods and attributes available. See docs.python: Match Objects for details.

>>> pat = re.compile(r'hi.*bye')
>>> m = pat.search('This is goodbye then', 1, 15)
>>> m.pos
1
>>> m.endpos
15
>>> m.re
re.compile('hi.*bye')
>>> m.string
'This is goodbye then'

groupdict method will be covered in Named capture groups section and expand method will be covered in Match.expand section.

Assignment expressions

Since Python 3.8 introduced assignment expressions, it has become easier to work with matched portions in conditional statements.

# print capture group content only if the pattern matches
>>> if m := re.search(r'(.*)s', 'oh!'):
...     print(m[1])
... 
>>> if m := re.search(r'(.*)s', 'awesome'):
...     print(m[1])
... 
awe

This comes up often when you are processing a text file and the instructions depend on which pattern matches.

>>> text = ['type: fruit', 'date: 2020/04/28']
>>> for ip in text:
...     if m := re.search(r'type: (.*)', ip):
...         print(m[1])
...     elif m := re.search(r'date: (.*?)/(.*?)/', ip):
...         print(f'month: {m[2]}, year: {m[1]}')
... 
fruit
month: 04, year: 2020

info Did you know that PEP 572 uses re module as one of the examples for assignment expressions?

Using functions in replacement section

Functions can be used in replacement section of re.sub instead of a string. A re.Match object will be passed to the function as argument. In Backreference section, you'll also learn how to directly reference the matching portions in replacement section.

# m[0] will contain entire matched portion
# a^2 and b^2 for the two matches in this example
>>> re.sub(r'(a|b)\^2', lambda m: m[0].upper(), 'a^2 + b^2 - C*3')
'A^2 + B^2 - C*3'

>>> re.sub(r'2|3', lambda m: str(int(m[0])**2), 'a^2 + b^2 - C*3')
'a^4 + b^4 - C*9'

Note that the output of the function has to be a string, otherwise you'll get an error. You'll see more examples with lambda and user defined functions in coming sections (for example, see Numeric ranges section).

Using dict in replacement section

Using a function in replacement section, you can specify a dict variable to determine the replacement string based on the matched text.

# one to one mappings
>>> d = { '1': 'one', '2': 'two', '4': 'four' }
>>> re.sub(r'1|2|4', lambda m: d[m[0]], '9234012')
'9two3four0onetwo'

# if the matched text doesn't exist as a key, default value will be used
# you'll later learn a much easier way to specify all digits
>>> re.sub(r'0|1|2|3|4|5|6|7|8|9', lambda m: d.get(m[0], 'X'), '9234012')
'XtwoXfourXonetwo'

For swapping two or more portions without using intermediate result, using a dict object is recommended.

>>> swap = { 'cat': 'tiger', 'tiger': 'cat' }
>>> words = 'cat tiger dog tiger cat'

>>> re.sub(r'cat|tiger', lambda m: swap[m[0]], words)
'tiger cat dog cat tiger'

For dict objects that have many entries and likely to undergo changes during development, building alternation list manually is not a good choice. Also, recall that as per precedence rules, longest length string should come first.

# note that numbers have been converted to strings here
# otherwise, you'd need to convert it in the lambda code
>>> d = { 'hand': '1', 'handy': '2', 'handful': '3', 'a^b': '4' }

# sort the keys to handle precedence rules
>>> words = sorted(d.keys(), key=len, reverse=True)
# add anchors and flags if needed
>>> pat = re.compile('|'.join(re.escape(s) for s in words))
>>> pat.pattern
'handful|handy|hand|a\\^b'
>>> pat.sub(lambda m: d[m[0]], 'handful hand pin handy (a^b)')
'3 1 pin 2 (4)'

info If you have thousands of key-value pairs, using specialized libraries like github: flashtext is highly recommended instead of regular expressions.

re.findall

The re.findall function returns all the matched portions as a list of strings.

re.findall(pattern, string, flags=0)

The first argument is the RE pattern you want to test and extract against the input string, which is the second argument. flags is optional. Here's some examples.

>>> re.findall(r'ab*c', 'abc ac adc abbbc')
['abc', 'ac', 'abbbc']

>>> re.findall(r'ab+c', 'abc ac adc abbbc')
['abc', 'abbbc']

>>> s = 'PAR spar apparent SpArE part pare'
>>> re.findall(r'\bs?pare?\b', s, flags=re.I)
['PAR', 'spar', 'SpArE', 'pare']

It is useful for debugging purposes as well, for example to see the potential matches before applying substitution.

>>> re.findall(r't.*a', 'that is quite a fabricated tale')
['that is quite a fabricated ta']

>>> re.findall(r't.*?a', 'that is quite a fabricated tale')
['tha', 't is quite a', 'ted ta']

Presence of capture groups affects re.findall in different ways depending on number of groups used.

  • If a single capture group is used, output will be a list of strings. Each element will have only the portion matched by the capture group
  • If more than one capture group is used, output will be a list of tuples. Each element will be a tuple containing portions matched by all the capturing groups

For both cases, any pattern outside the capture groups will not be represented in the output. Also, you'll get an empty string if a particular capture group didn't match any character.

# without capture groups
>>> re.findall(r'ab*c', 'abc ac adc abbc xabbbcz bbb bc abbbbbc')
['abc', 'ac', 'abbc', 'abbbc', 'abbbbbc']
# with single capture group
>>> re.findall(r'a(b*)c', 'abc ac adc abbc xabbbcz bbb bc abbbbbc')
['b', '', 'bb', 'bbb', 'bbbbb']

# multiple capture groups
# note that last date didn't match because there's no comma at the end
# you'll later learn better ways to match such patterns
>>> re.findall(r'(.*?)/(.*?)/(.*?),', '2020/04/25,1986/Mar/02,77/12/31')
[('2020', '04', '25'), ('1986', 'Mar', '02')]

See Non-capturing groups section if you need to use groupings without affecting re.findall output.

re.finditer

Use re.finditer to get an iterator object with each element as re.Match objects for each matched portion.

re.finditer(pattern, string, flags=0)

# output of finditer is an iterator object
>>> re.finditer(r'ab+c', 'abc ac adc abbbc')
<callable_iterator object at 0x7fb65e103438>

# each element is a re.Match object corresponding to the matched portion
>>> m_iter = re.finditer(r'ab+c', 'abc ac adc abbbc')
>>> for m in m_iter:
...     print(m)
... 
<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(11, 16), match='abbbc'>

Use the re.Match object's methods and attributes as needed. You can replicate re.findall functionality as well.

>>> m_iter = re.finditer(r'ab+c', 'abc ac adc abbbc')
>>> for m in m_iter:
...     print(m[0].upper(), m.span(), sep='\t')
... 
ABC     (0, 3)
ABBBC   (11, 16)

# same as: re.findall(r'(.*?)/(.*?)/(.*?),', d)
>>> d = '2020/04/25,1986/Mar/02,77/12/31'
>>> m_iter = re.finditer(r'(.*?)/(.*?)/(.*?),', d)
>>> [m.groups() for m in m_iter]
[('2020', '04', '25'), ('1986', 'Mar', '02')]

warning Since the output of re.finditer is an iterator object, you cannot iterate over it again without re-assigning. Not the case with re.findall which gives a list.

>>> d = '2020/04/25,1986/Mar/02,77/12/31'
>>> m_iter = re.finditer(r'(.*?),', d)

>>> [m[1] for m in m_iter]
['2020/04/25', '1986/Mar/02']
>>> [m[1] for m in m_iter]
[]

re.split with capture groups

Capture groups affects re.split function as well. If the pattern used to split contains capture groups, the portions matched by those groups will also be a part of the output list.

# without capture group
>>> re.split(r'1*4?2', '31111111111251111426')
['3', '5', '6']

# to include the matching portions of the pattern as well in the output
>>> re.split(r'(1*4?2)', '31111111111251111426')
['3', '11111111112', '5', '111142', '6']

If part of the pattern is outside a capture group, the text thus matched won't be in the output. If a capture group didn't participate, it will be represented by None in the output list.

# here 4?2 is outside capture group, so that portion won't be in output
>>> re.split(r'(1*)4?2', '31111111111251111426')
['3', '1111111111', '5', '1111', '6']

# multiple capture groups example
# note that the portion matched by b+ isn't present in the output
>>> re.split(r'(a+)b+(c+)', '3.14aabccc42')
['3.14', 'aa', 'ccc', '42']

# here (4)? matches zero times on the first occasion
>>> re.split(r'(1*)(4)?2', '31111111111251111426')
['3', '1111111111', None, '5', '1111', '4', '6']

Use of capture groups and maxsplit=1 gives behavior similar to str.partition method.

# first element is portion before the first match
# second element is portion matched by the pattern itself
# third element is rest of the input string
>>> re.split(r'(a+b+c+)', '3.14aabccc42abc88', maxsplit=1)
['3.14', 'aabccc', '42abc88']

re.subn

The re.subn has the same functionality as re.sub except that the output is a tuple. The first element of the tuple is the same output as re.sub function. The second element gives the number of substitutions made. In other words, you also get the number of matches.

re.subn(pattern, repl, string, count=0, flags=0)

>>> greeting = 'Have a nice weekend'

>>> re.sub(r'e', 'E', greeting)
'HavE a nicE wEEkEnd'

# with re.subn, you can infer that 5 substitutions were made
>>> re.subn(r'e', 'E', greeting)
('HavE a nicE wEEkEnd', 5)

Here's an example that performs conditional operation based on whether the substitution was successful.

>>> word = 'coffining'
# recursively delete 'fin'
>>> while True:
...     word, cnt = re.subn(r'fin', '', word)
...     if cnt == 0:
...         break
... 
>>> word
'cog'

Cheatsheet and Summary

NoteDescription
re.Match objectget details like matched portions, location, etc
m[0] or m.group(0)entire matched portion of re.Match object m
m[1] or m.group(1)matched portion of first capture group
m[2] or m.group(2)matched portion of second capture group and so on
m.groups()tuple of all the capture groups' matched portions
m.span()start and end+1 index of entire matched portion
pass a number to get span of that particular capture group
can also use m.start() and m.end()
re.sub(r'pat', f, s)function f will get re.Match object as argument
using dictreplacement string based on the matched text as dictionary key
ex: re.sub(r'pat', lambda m: d.get(m[0], default), s)
re.findallreturns all the matches as a list of strings
re.findall(pattern, string, flags=0)
if 1 capture group is used, only its matches are returned
1+, each element will be tuple of capture groups
portion matched by pattern outside group won't be in output
empty matches will be represented by empty string
re.finditeriterator with re.Match object for each match
re.finditer(pattern, string, flags=0)
re.splitcapture groups affects re.split too
text matched by the groups will be part of the output
portion matched by pattern outside group won't be in output
group that didn't match will be represented by None
re.subngives tuple of modified string and number of substitutions
re.subn(pattern, repl, string, count=0, flags=0)

This chapter introduced different ways to work with various matching portions of input string. re.Match object helps you get the portion matched by the RE pattern and capture groups, location of the match, etc. Functions can be used in replacement section, which gets re.Match object as an argument. Using functions, you can do substitutions based on dict mappings. To get all the matches instead of just the first match, you can use re.findall (which gives a list of strings as output) and re.finditer (which gives an iterator of re.Match objects). You also learnt how capture groups affect the output of re.findall and re.split functions. You'll see many more uses of groupings in coming chapters. The re.subn function is like re.sub but additionally gives number of matches as well.

Exercises

a) For the given strings, extract the matching portion from first is to last t.

>>> str1 = 'This the biggest fruit you have seen?'
>>> str2 = 'Your mission is to read and practice consistently'

>>> pat = re.compile()     ##### add your solution here

##### add your solution here for str1
'is the biggest fruit'
##### add your solution here for str2
'ission is to read and practice consistent'

b) Find the starting index of first occurrence of is or the or was or to for the given input strings.

>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'
>>> s3 = 'this is good bye then'
>>> s4 = 'who was there to see?'

>>> pat = re.compile()      ##### add your solution here

##### add your solution here for s1
12
##### add your solution here for s2
4
##### add your solution here for s3
2
##### add your solution here for s4
4

c) Find the starting index of last occurrence of is or the or was or to for the given input strings.

>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'
>>> s3 = 'this is good bye then'
>>> s4 = 'who was there to see?'

>>> pat = re.compile()      ##### add your solution here

##### add your solution here for s1
12
##### add your solution here for s2
18
##### add your solution here for s3
17
##### add your solution here for s4
14

d) The given input string contains : exactly once. Extract all characters after the : as output.

>>> ip = 'fruits:apple, mango, guava, blueberry'

##### add your solution here
'apple, mango, guava, blueberry'

e) The given input strings contains some text followed by - followed by a number. Replace that number with its log value using math.log().

>>> s1 = 'first-3.14'
>>> s2 = 'next-123'

>>> pat = re.compile()      ##### add your solution here

>>> import math
>>> pat.sub()     ##### add your solution here for s1
'first-1.144222799920162'
>>> pat.sub()     ##### add your solution here for s2
'next-4.812184355372417'

f) Replace all occurrences of par with spar, spare with extra and park with garden for the given input strings.

>>> str1 = 'apartment has a park'
>>> str2 = 'do you have a spare cable'
>>> str3 = 'write a parser'

##### add your solution here

>>> pat.sub()        ##### add your solution here for str1
'aspartment has a garden'
>>> pat.sub()        ##### add your solution here for str2
'do you have a extra cable'
>>> pat.sub()        ##### add your solution here for str3
'write a sparser'

g) Extract all words between ( and ) from the given input string as a list. Assume that the input will not contain any broken parentheses.

>>> ip = 'another (way) to reuse (portion) matched (by) capture groups'

>>> re.findall()        ##### add your solution here
['way', 'portion', 'by']

h) Extract all occurrences of < up to next occurrence of >, provided there is at least one character in between < and >.

>>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'

>>> re.findall()        ##### add your solution here
['<apple>', '<> b<bye>', '<> c<cat>']

i) Use re.findall to get the output as shown below for the given input strings. Note the characters used in the input strings carefully.

>>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '

>>> pat = re.compile()       ##### add your solution here

>>> pat.findall(row1)
[('-2', '5'), ('4', '+3'), ('+42', '-53'), ('4356246', '-357532354')]
>>> pat.findall(row2)
[('1.32', '-3.14'), ('634', '5.63'), ('63.3e3', '9907809345343.235')]

j) This is an extension to previous question.

  • For row1, find the sum of integers of each tuple element. For example, sum of -2 and 5 is 3.
  • For row2, find the sum of floating-point numbers of each tuple element. For example, sum of 1.32 and -3.14 is -1.82.
>>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '

# should be same as previous question
>>> pat = re.compile()       ##### add your solution here

##### add your solution here for row1
[3, 7, -11, -353176108]

##### add your solution here for row2
[-1.82, 639.63, 9907809408643.234]

k) Use re.split to get the output as shown below.

>>> ip = '42:no-output;1000:car-truck;SQEX49801'

>>> re.split()        ##### add your solution here
['42', 'output', '1000', 'truck', 'SQEX49801']

l) For the given list of strings, change the elements into a tuple of original element and number of times t occurs in that element.

>>> words = ['sequoia', 'attest', 'tattletale', 'asset']

##### add your solution here
[('sequoia', 0), ('attest', 3), ('tattletale', 4), ('asset', 1)]

m) The given input string has fields separated by :. Each field contains four uppercase alphabets followed optionally by two digits. Ignore the last field, which is empty. See docs.python: Match.groups and use re.finditer to get the output as shown below. If the optional digits aren't present, show 'NA' instead of None.

>>> ip = 'TWXA42:JWPA:NTED01:'

##### add your solution here
[('TWXA', '42'), ('JWPA', 'NA'), ('NTED', '01')]

info Note that this is different from re.findall which will just give empty string instead of None when a capture group doesn't participate.

n) Convert the comma separated strings to corresponding dict objects as shown below.

>>> row1 = 'name:rohan,maths:75,phy:89,'
>>> row2 = 'name:rose,maths:88,phy:92,'

>>> pat = re.compile()      ##### add your solution here

##### add your solution here for row1
{'name': 'rohan', 'maths': '75', 'phy': '89'}
##### add your solution here for row2
{'name': 'rose', 'maths': '88', 'phy': '92'}