Working with matched portions

You have already seen a few features that can match varying text. In this chapter, you'll learn how to extract and work with those matching portions. First, the re.Match object will be discussed in detail. And then you'll learn about re.findall() and re.finditer() functions to get all the matches instead of just the first match. You'll also learn a few tricks like using functions in the replacement section of re.sub(). And finally, some examples for the re.subn() function.

re.Match object

The re.search() and re.fullmatch() functions return a re.Match object from which various details can be extracted like the matched portion of string, location of the matched portion, etc. Note that you'll get the details only for the first match. Working with multiple matches will be covered later in this chapter. Here are some examples with re.Match output.

>>> re.search(r'so+n', 'too soon a song snatch')
<re.Match object; span=(4, 8), match='soon'>

>>> re.fullmatch(r'1(2|3)*4', '1233224')
<re.Match object; span=(0, 7), match='1233224'>

The details in the output above are for quick reference only. There are methods and attributes that you can apply on the re.Match object to get only the exact information you need. Use the span() method to get the starting and ending + 1 indexes of the matching portion.

>>> sentence = 'that is quite a fabricated tale'
>>> m = re.search(r'q.*?t', sentence)
>>> m.span()
(8, 12)
>>> m.span()[0]
8

# you can also directly apply the method instead of intermediate variables
>>> re.search(r'q.*?t', sentence).span()
(8, 12)

The () grouping is also known as a capture group. It has multiple uses, one of which is the ability to work with matched portions of those groups. When capture groups are used with re.search() or re.fullmatch(), they can be retrieved using an index or the group() method on the re.Match object. The first element is always the entire matched portion and the rest of the elements are for capture groups (if they are present).

>>> motivation = 'Doing is often better than thinking of doing.'

>>> re.search(r'of.*ink', motivation)
<re.Match object; span=(9, 32), match='often better than think'>

# retrieving the entire matched portion using index
>>> re.search(r'of.*ink', motivation)[0]
'often better than think'

# retrieving the entire matched portion using the 'group' method
# passing '0' is optional as that is the default value
>>> re.search(r'of.*ink', motivation).group(0)
'often better than think'

Here's an example with capture groups. The leftmost ( in the pattern will get group number 1, second leftmost ( will get group number 2 and so on. Use the groups() method to get a tuple of only the capture group portions.

>>> purchase = 'coffee:100g tea:250g sugar:75g chocolate:50g'
>>> m = re.search(r':(.*?)g.*?:(.*?)g.*?chocolate:(.*?)g', purchase)

# matched portion of the second capture group, can also use m.group(2)
>>> m[2]
'250'
# matched portion of third and first capture groups
>>> m.group(3, 1)
('50', '100')
# tuple of all the capture groups (entire matched portion won't be present)
>>> m.groups()
('100', '250', '50')

To get the matching locations for the capture groups, pass the group number to the span() method. You can also use the start() and end() methods to get either of those locations. Passing 0 is optional when you need the information for the entire matched portion.

>>> m = re.fullmatch(r'aw(.*)me', 'awesome')

>>> m.span(1)
(2, 5)

>>> m.start()
0
>>> m.end(1)
5

There are many more methods and attributes available. See docs.python: Match Objects for details.
>>> pat = re.compile(r'hi.*bye')
>>> m = pat.search('This is goodbye then', 1, 15)
>>> m.pos
1
>>> m.endpos
15
>>> m.re
re.compile('hi.*bye')
>>> m.string
'This is goodbye then'

The groupdict() method will be discussed in the Named capture groups section and the expand() method will be covered in the Match.expand() section.

Assignment expressions

Introduced in Python 3.8, assignment expressions has made it easier to work with matched portions in conditional structures. Here's an example to print the capture group content only if the pattern matches:

# no output since there's no match
>>> if m := re.search(r'(.*)s', 'oh!'):
...     print(m[1])
... 

# a match is found in this case
>>> if m := re.search(r'(.*)s', 'awesome'):
...     print(m[1])
... 
awe

Here's a practical example that comes up often when you are processing a text file.

>>> text = ['type: fruit', 'date: 2023/04/28']
>>> for ip in text:
...     if m := re.search(r'type: (.+)', ip):
...         print(m[1])
...     elif m := re.search(r'date: (.*?)/(.*?)/', ip):
...         print(f'month: {m[2]}, year: {m[1]}')
... 
fruit
month: 04, year: 2023

Did you know that PEP 572 gives re module as one of the use cases for assignment expressions?

Using functions in the replacement section

Functions can be used in the replacement section of re.sub() instead of a string. A re.Match object will be passed to the function as the argument. In the Backreference section, you'll learn an easier way to directly reference the matching portions in the replacement section.

# m[0] will contain entire matched portion
# a^2 and b^2 for the two matches in this example
>>> re.sub(r'(a|b)\^2', lambda m: m[0].upper(), 'a^2 + b^2 - c*3')
'A^2 + B^2 - c*3'

>>> re.sub(r'2|3', lambda m: str(int(m[0])**2), 'a^2 + b^2 - c*3')
'a^4 + b^4 - c*9'

>>> re.sub(r'a|b|c', lambda m: m[0]*4, 'a^2 + b^2 - c*3')
'aaaa^2 + bbbb^2 - cccc*3'

Note that the output of the function has to be a string, otherwise you'll get an error. You'll see more examples with lambda and user defined functions in the coming sections (for example, see the Numeric ranges section).

Using dict in the replacement section

Using a function in the replacement section, you can specify a dict variable to determine the replacement string based on the matched text.

# one to one mappings
>>> d = {'1': 'one', '2': 'two', '4': 'four'}
>>> re.sub(r'1|2|4', lambda m: d[m[0]], '9234012')
'9two3four0onetwo'

# if the matched text doesn't exist as a key, the default value will be used
# recall that \d matches all the digit characters
>>> re.sub(r'\d', lambda m: d.get(m[0], 'X'), '9234012')
'XtwoXfourXonetwo'

For swapping two or more portions without using intermediate results, using a dict object is recommended.

>>> swap = {'cat': 'tiger', 'tiger': 'cat'}
>>> words = 'cat tiger dog tiger cat'

>>> re.sub(r'cat|tiger', lambda m: swap[m[0]], words)
'tiger cat dog cat tiger'

For dict objects that have many entries and likely to undergo changes during development, building an alternation list manually is not a good choice. Also, recall that as per precedence rules, longest length string should come first.

# note that numbers have been converted to strings here
# otherwise, you'd need to convert it in the lambda code
>>> d = {'hand': '1', 'handy': '2', 'handful': '3', 'a^b': '4'}

# sort the keys to handle precedence rules
>>> words = sorted(d, key=len, reverse=True)
# add anchors and flags if needed
>>> pat = re.compile('|'.join(re.escape(s) for s in words))
>>> pat.pattern
'handful|handy|hand|a\\^b'
>>> pat.sub(lambda m: d[m[0]], 'handful hand pin handy (a^b)')
'3 1 pin 2 (4)'

If you have thousands of key-value pairs, using specialized libraries like flashtext (or flashtext2) is highly recommended instead of regular expressions.

re.findall()

The re.findall() function returns all the matched portions as a list of strings.

re.findall(pattern, string, flags=0)

The first argument is the RE pattern you want to test and extract against the input string, which is the second argument. flags is optional. Here are some examples.

>>> re.findall(r'so*n', 'too soon a song snatch')
['soon', 'son', 'sn']

>>> re.findall(r'so+n', 'too soon a song snatch')
['soon', 'son']

>>> s = 'PAR spar apparent SpArE part pare'
>>> re.findall(r'\bs?pare?\b', s, flags=re.I)
['PAR', 'spar', 'SpArE', 'pare']

It is useful for debugging purposes as well. For example, to see the potential matches before applying substitution.

>>> s = 'green:3.14:teal::brown:oh!:blue'

>>> re.findall(r':.*:', s)
[':3.14:teal::brown:oh!:']

>>> re.findall(r':.*?:', s)
[':3.14:', '::', ':oh!:']

>>> re.findall(r':.*+:', s)
[]

Presence of capture groups affects re.findall() in different ways depending on the number of groups used:

If a single capture group is used, output will be a list of strings. Each element will have only the portion matched by the capture group
If more than one capture group is used, output will be a list of tuples. Each element will be a tuple containing portions matched by all the capturing groups

For both cases, any pattern outside the capture groups will not be represented in the output. Also, you'll get an empty string if a particular capture group didn't match any character.

>>> purchase = 'coffee:100g tea:250g sugar:75g chocolate:50g salt:g'
# without capture groups
>>> re.findall(r':.*?g', purchase)
[':100g', ':250g', ':75g', ':50g', ':g']
# single capture group
>>> re.findall(r':(.*?)g', purchase)
['100', '250', '75', '50', '']

# multiple capture groups
# note that the last date didn't match because there's no comma at the end
# you'll later learn better ways to match such patterns
>>> re.findall(r'(.*?)/(.*?)/(.*?),', '2023/04/25,1986/Mar/02,77/12/31')
[('2023', '04', '25'), ('1986', 'Mar', '02')]

See the Non-capturing groups section if you need to use groupings without the behavior shown above.

re.finditer()

You can use the re.finditer() function to get an iterator object with each element as re.Match objects for the matched portions.

re.finditer(pattern, string, flags=0)

Here's an example:

# output of finditer is an iterator object
>>> re.finditer(r'so+n', 'song too soon snatch')
<callable_iterator object at 0x7fb65e103438>

# each element is a re.Match object corresponding to the matched portion
>>> m_iter = re.finditer(r'so+n', 'song too soon snatch')
>>> for m in m_iter:
...     print(m)
... 
<re.Match object; span=(0, 3), match='son'>
<re.Match object; span=(9, 13), match='soon'>

Use the re.Match object's methods and attributes as needed. You can also replicate the re.findall() functionality.

>>> m_iter = re.finditer(r'so+n', 'song too soon snatch')
>>> for m in m_iter:
...     print(m[0].upper(), m.span(), sep='\t')
... 
SON     (0, 3)
SOON    (9, 13)

# same as: re.findall(r'(.*?)/(.*?)/(.*?),', d)
>>> d = '2023/04/25,1986/Mar/02,77/12/31'
>>> m_iter = re.finditer(r'(.*?)/(.*?)/(.*?),', d)
>>> [m.groups() for m in m_iter]
[('2023', '04', '25'), ('1986', 'Mar', '02')]

Since the output of re.finditer() is an iterator object, you cannot iterate over it more than once.
>>> d = '2023/04/25,1986/Mar/02,77/12/31'
>>> m_iter = re.finditer(r'(.*?),', d)

>>> [m[1] for m in m_iter]
['2023/04/25', '1986/Mar/02']
>>> [m[1] for m in m_iter]
[]

re.split() with capture groups

Capture groups affect the re.split() function as well. If the pattern used to split contains capture groups, the portions matched by those groups will also be a part of the output list.

# without capture group
>>> re.split(r'1*4?2', '31111111111251111426')
['3', '5', '6']

# to include the matching portions of the pattern as well in the output
>>> re.split(r'(1*4?2)', '31111111111251111426')
['3', '11111111112', '5', '111142', '6']

If part of the pattern is outside a capture group, the text thus matched won't be in the output. If a capture group didn't participate, it will be represented by None in the output list.

# here 4?2 is outside the capture group, so that portion won't be in the output
>>> re.split(r'(1*)4?2', '31111111111251111426')
['3', '1111111111', '5', '1111', '6']

# multiple capture groups example
# note that the portion matched by b+ isn't present in the output
>>> re.split(r'(a+)b+(c+)', '3.14aabccc42')
['3.14', 'aa', 'ccc', '42']

# here (4)? matches zero times on the first occasion
>>> re.split(r'(1*)(4)?2', '31111111111251111426')
['3', '1111111111', None, '5', '1111', '4', '6']

Use of capture groups and maxsplit=1 gives behavior similar to the str.partition() method.

# first element is the portion before the first match
# second element is the portion matched by the pattern itself
# third element is the rest of the input string
>>> re.split(r'(a+b+c+)', '3.14aabccc42abc88', maxsplit=1)
['3.14', 'aabccc', '42abc88']

re.subn()

The re.subn() function behaves the same as re.sub() except that the output is a tuple. The first element of the tuple is the same output as the re.sub() function. The second element gives the number of substitutions made. In other words, you also get the number of matches.

re.subn(pattern, repl, string, count=0, flags=0)

>>> greeting = 'Have a nice weekend'

>>> re.sub(r'e', 'E', greeting)
'HavE a nicE wEEkEnd'

# with re.subn, you can also infer that 5 substitutions were made
>>> re.subn(r'e', 'E', greeting)
('HavE a nicE wEEkEnd', 5)

Here's an example that performs a conditional operation based on whether the substitution was successful or not.

>>> word = 'coffining'
# recursively delete 'fin'
>>> while True:
...     word, cnt = re.subn(r'fin', '', word)
...     if cnt == 0:
...         break
... 
>>> word
'cog'

If you like using assignment expressions, the above while loop can be shortened to:

while (op := re.subn(r'fin', '', word))[1]:
    word = op[0]

Cheatsheet and Summary

Note	Description
`re.Match` object	get details like matched portions, location, etc
`m[0]` or `m.group(0)`	entire matched portion of `re.Match` object `m`
`m[1]` or `m.group(1)`	matched portion of the first capture group
`m[2]` or `m.group(2)`	matched portion of the second capture group and so on
`m.groups()`	tuple of all the capture groups' matched portions
`m.span()`	start and end+1 index of the entire matched portion
	pass a number to get span of that particular capture group
	can also use `m.start()` and `m.end()`
`re.sub(r'pat', f, s)`	function `f` will get a `re.Match` object as the argument
using `dict`	replacement string based on the matched text as dictionary key
	ex: `re.sub(r'pat', lambda m: d.get(m[0], default), s)`
`re.findall()`	returns all the matches as a list of strings
	`re.findall(pattern, string, flags=0)`
	if 1 capture group is used, only its matches are returned
	1+, each element will be tuple of capture groups
	portion matched by pattern outside groups won't be in output
	empty matches will be represented by empty string
`re.finditer()`	iterator with `re.Match` object for each match
	`re.finditer(pattern, string, flags=0)`
`re.split()`	capture groups affects `re.split()` too
	text matched by the groups will be part of the output
	portion matched by pattern outside groups won't be in output
	group that didn't match will be represented by `None`
`re.subn()`	gives tuple of modified string and number of substitutions
	`re.subn(pattern, repl, string, count=0, flags=0)`

This chapter introduced different ways to work with various matching portions of the input string. The re.Match object helps you get the portion matched by the RE pattern and capture groups, location of the match, etc. Functions can be used in the replacement section, which gets re.Match object as an argument. Using functions, you can do substitutions based on dict mappings. To get all the matches instead of just the first match, you can use re.findall() (which gives a list of strings as output) and re.finditer() (which gives an iterator of re.Match objects). You also learnt how capture groups affect the output of re.findall() and re.split() functions. You'll see many more uses of groupings in the coming chapters. The re.subn() function is like re.sub() but additionally gives number of matches as well.

Exercises

1) For the given strings, extract the matching portion from the first is to the last t.

>>> str1 = 'This the biggest fruit you have seen?'
>>> str2 = 'Your mission is to read and practice consistently'

>>> pat = re.compile()      ##### add your solution here

##### add your solution here for str1
'is the biggest fruit'
##### add your solution here for str2
'ission is to read and practice consistent'

2) Find the starting index of the first occurrence of is or the or was or to for the given input strings.

>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'
>>> s3 = 'this is good bye then'
>>> s4 = 'who was there to see?'

>>> pat = re.compile()      ##### add your solution here

##### add your solution here for s1
12
##### add your solution here for s2
4
##### add your solution here for s3
2
##### add your solution here for s4
4

3) Find the starting index of the last occurrence of is or the or was or to for the given input strings.

>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'
>>> s3 = 'this is good bye then'
>>> s4 = 'who was there to see?'

>>> pat = re.compile()      ##### add your solution here

##### add your solution here for s1
12
##### add your solution here for s2
18
##### add your solution here for s3
17
##### add your solution here for s4
14

4) The given input string contains : exactly once. Extract all characters after the : as output.

>>> ip = 'fruits:apple, mango, guava, blueberry'

##### add your solution here
'apple, mango, guava, blueberry'

5) The given input strings contain some text followed by - followed by a number. Replace that number with its log value using math.log().

>>> s1 = 'first-3.14'
>>> s2 = 'next-123'

>>> pat = re.compile()      ##### add your solution here

>>> import math
>>> pat.sub()     ##### add your solution here for s1
'first-1.144222799920162'
>>> pat.sub()     ##### add your solution here for s2
'next-4.812184355372417'

6) Replace all occurrences of par with spar, spare with extra and park with garden for the given input strings.

>>> str1 = 'apartment has a park'
>>> str2 = 'do you have a spare cable'
>>> str3 = 'write a parser'

##### add your solution here

>>> pat.sub()        ##### add your solution here for str1
'aspartment has a garden'
>>> pat.sub()        ##### add your solution here for str2
'do you have a extra cable'
>>> pat.sub()        ##### add your solution here for str3
'write a sparser'

7) Extract all words between ( and ) from the given input string as a list. Assume that the input will not contain any broken parentheses.

>>> ip = 'another (way) to reuse (portion) matched (by) capture groups'

>>> re.findall()        ##### add your solution here
['way', 'portion', 'by']

8) Extract all occurrences of < up to the next occurrence of >, provided there is at least one character in between < and >.

>>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'

>>> re.findall()        ##### add your solution here
['<apple>', '<> b<bye>', '<> c<cat>']

9) Use re.findall() to get the output as shown below for the given input strings. Note the characters used in the input strings carefully.

>>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '

>>> pat = re.compile()       ##### add your solution here

>>> pat.findall(row1)
[('-2', '5'), ('4', '+3'), ('+42', '-53'), ('4356246', '-357532354')]
>>> pat.findall(row2)
[('1.32', '-3.14'), ('634', '5.63'), ('63.3e3', '9907809345343.235')]

10) This is an extension to the previous question.

For row1, find the sum of integers of each tuple element. For example, sum of -2 and 5 is 3.
For row2, find the sum of floating-point numbers of each tuple element. For example, sum of 1.32 and -3.14 is -1.82.

>>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '

# should be the same as previous question
>>> pat = re.compile()       ##### add your solution here

##### add your solution here for row1
[3, 7, -11, -353176108]

##### add your solution here for row2
[-1.82, 639.63, 9907809408643.234]

11) Use re.split() to get the output as shown below.

>>> ip = '42:no-output;1000:car-tr:u-ck;SQEX49801'

>>> re.split()        ##### add your solution here
['42', 'output', '1000', 'tr:u-ck', 'SQEX49801']

12) For the given list of strings, change the elements into a tuple of original element and the number of times t occurs in that element.

>>> words = ['sequoia', 'attest', 'tattletale', 'asset']

##### add your solution here
[('sequoia', 0), ('attest', 3), ('tattletale', 4), ('asset', 1)]

13) The given input string has fields separated by :. Each field contains four uppercase alphabets followed optionally by two digits. Ignore the last field, which is empty. See docs.python: Match.groups and use re.finditer() to get the output as shown below. If the optional digits aren't present, show 'NA' instead of None.

>>> ip = 'TWXA42:JWPA:NTED01:'

##### add your solution here
[('TWXA', '42'), ('JWPA', 'NA'), ('NTED', '01')]

Note that this is different from re.findall() which will just give empty string instead of None when a capture group doesn't participate.

14) Convert the comma separated strings to corresponding dict objects as shown below.

>>> row1 = 'name:rohan,maths:75,phy:89,'
>>> row2 = 'name:rose,maths:88,phy:92,'

>>> pat = re.compile()      ##### add your solution here

##### add your solution here for row1
{'name': 'rohan', 'maths': '75', 'phy': '89'}
##### add your solution here for row2
{'name': 'rose', 'maths': '88', 'phy': '92'}

Understanding Python re(gex)?