Working with matched portions
You have already seen a few features that can match varying text. In this chapter, you'll learn how to extract and work with those matching portions. First, the re.Match
object will be discussed in detail. And then you'll learn about re.findall()
and re.finditer()
functions to get all the matches instead of just the first match. You'll also learn a few tricks like using functions in the replacement section of re.sub()
. And finally, some examples for the re.subn()
function.
re.Match object
The re.search()
and re.fullmatch()
functions return a re.Match
object from which various details can be extracted like the matched portion of string, location of the matched portion, etc. Note that you'll get the details only for the first match. Working with multiple matches will be covered later in this chapter. Here are some examples with re.Match
output.
>>> re.search(r'so+n', 'too soon a song snatch')
<re.Match object; span=(4, 8), match='soon'>
>>> re.fullmatch(r'1(2|3)*4', '1233224')
<re.Match object; span=(0, 7), match='1233224'>
The details in the output above are for quick reference only. There are methods and attributes that you can apply on the re.Match
object to get only the exact information you need. Use the span()
method to get the starting and ending + 1 indexes of the matching portion.
>>> sentence = 'that is quite a fabricated tale'
>>> m = re.search(r'q.*?t', sentence)
>>> m.span()
(8, 12)
>>> m.span()[0]
8
# you can also directly apply the method instead of intermediate variables
>>> re.search(r'q.*?t', sentence).span()
(8, 12)
The ()
grouping is also known as a capture group. It has multiple uses, one of which is the ability to work with matched portions of those groups. When capture groups are used with re.search()
or re.fullmatch()
, they can be retrieved using an index or the group()
method on the re.Match
object. The first element is always the entire matched portion and the rest of the elements are for capture groups (if they are present).
>>> motivation = 'Doing is often better than thinking of doing.'
>>> re.search(r'of.*ink', motivation)
<re.Match object; span=(9, 32), match='often better than think'>
# retrieving entire matched portion using index
>>> re.search(r'of.*ink', motivation)[0]
'often better than think'
# retrieving the entire matched portion using the 'group' method
# passing '0' is optional as that is the default value
>>> re.search(r'of.*ink', motivation).group(0)
'often better than think'
Here's an example with capture groups. The leftmost (
in the pattern will get group number 1
, second leftmost (
will get group number 2
and so on. Use the groups()
method to get a tuple of only the capture group portions.
>>> purchase = 'coffee:100g tea:250g sugar:75g chocolate:50g'
>>> m = re.search(r':(.*?)g.*?:(.*?)g.*?chocolate:(.*?)g', purchase)
# matched portion of the second capture group, can also use m.group(2)
>>> m[2]
'250'
# matched portion of third and first capture groups
>>> m.group(3, 1)
('50', '100')
# tuple of all the capture groups (entire matched portion won't be present)
>>> m.groups()
('100', '250', '50')
To get the matching locations for the capture groups, pass the group number to the span()
method. You can also use the start()
and end()
methods to get either of those locations. Passing 0
is optional when you need the information for the entire matched portion.
>>> m = re.fullmatch(r'aw(.*)me', 'awesome')
>>> m.span(1)
(2, 5)
>>> m.start()
0
>>> m.end(1)
5
There are many more methods and attributes available. See docs.python: Match Objects for details.
>>> pat = re.compile(r'hi.*bye') >>> m = pat.search('This is goodbye then', 1, 15) >>> m.pos 1 >>> m.endpos 15 >>> m.re re.compile('hi.*bye') >>> m.string 'This is goodbye then'
groupdict()
method will be covered in the Named capture groups section and the expand()
method will be covered in the Match.expand() section.
Assignment expressions
Introduced in Python 3.8, assignment expressions has made it easier to work with matched portions in conditional structures. Here's an example to print the capture group content only if the pattern matches:
# no output since there's no match
>>> if m := re.search(r'(.*)s', 'oh!'):
... print(m[1])
...
# a match is found in this case
>>> if m := re.search(r'(.*)s', 'awesome'):
... print(m[1])
...
awe
Here's a practical example that comes up often when you are processing a text file.
>>> text = ['type: fruit', 'date: 2023/04/28']
>>> for ip in text:
... if m := re.search(r'type: (.+)', ip):
... print(m[1])
... elif m := re.search(r'date: (.*?)/(.*?)/', ip):
... print(f'month: {m[2]}, year: {m[1]}')
...
fruit
month: 04, year: 2023
Did you know that PEP 572 gives
re
module as one of the use cases for assignment expressions?
Using functions in the replacement section
Functions can be used in the replacement section of re.sub()
instead of a string. A re.Match
object will be passed to the function as argument. In the Backreference section, you'll learn an easier way to directly reference the matching portions in the replacement section.
# m[0] will contain entire matched portion
# a^2 and b^2 for the two matches in this example
>>> re.sub(r'(a|b)\^2', lambda m: m[0].upper(), 'a^2 + b^2 - c*3')
'A^2 + B^2 - c*3'
>>> re.sub(r'2|3', lambda m: str(int(m[0])**2), 'a^2 + b^2 - c*3')
'a^4 + b^4 - c*9'
>>> re.sub(r'a|b|c', lambda m: m[0]*4, 'a^2 + b^2 - c*3')
'aaaa^2 + bbbb^2 - cccc*3'
Note that the output of the function has to be a string, otherwise you'll get an error. You'll see more examples with lambda
and user defined functions in the coming sections (for example, see the Numeric ranges section).
Using dict in the replacement section
Using a function in the replacement section, you can specify a dict
variable to determine the replacement string based on the matched text.
# one to one mappings
>>> d = {'1': 'one', '2': 'two', '4': 'four'}
>>> re.sub(r'1|2|4', lambda m: d[m[0]], '9234012')
'9two3four0onetwo'
# if the matched text doesn't exist as a key, the default value will be used
# recall that \d matches all the digit characters
>>> re.sub(r'\d', lambda m: d.get(m[0], 'X'), '9234012')
'XtwoXfourXonetwo'
For swapping two or more portions without using intermediate results, using a dict
object is recommended.
>>> swap = {'cat': 'tiger', 'tiger': 'cat'}
>>> words = 'cat tiger dog tiger cat'
>>> re.sub(r'cat|tiger', lambda m: swap[m[0]], words)
'tiger cat dog cat tiger'
For dict
objects that have many entries and likely to undergo changes during development, building alternation list manually is not a good choice. Also, recall that as per precedence rules, longest length string should come first.
# note that numbers have been converted to strings here
# otherwise, you'd need to convert it in the lambda code
>>> d = {'hand': '1', 'handy': '2', 'handful': '3', 'a^b': '4'}
# sort the keys to handle precedence rules
>>> words = sorted(d, key=len, reverse=True)
# add anchors and flags if needed
>>> pat = re.compile('|'.join(re.escape(s) for s in words))
>>> pat.pattern
'handful|handy|hand|a\\^b'
>>> pat.sub(lambda m: d[m[0]], 'handful hand pin handy (a^b)')
'3 1 pin 2 (4)'
If you have thousands of key-value pairs, using specialized libraries like flashtext is highly recommended instead of regular expressions.
re.findall()
The re.findall()
function returns all the matched portions as a list of strings.
re.findall(pattern, string, flags=0)
The first argument is the RE pattern you want to test and extract against the input string, which is the second argument. flags
is optional. Here are some examples.
>>> re.findall(r'so*n', 'too soon a song snatch')
['soon', 'son', 'sn']
>>> re.findall(r'so+n', 'too soon a song snatch')
['soon', 'son']
>>> s = 'PAR spar apparent SpArE part pare'
>>> re.findall(r'\bs?pare?\b', s, flags=re.I)
['PAR', 'spar', 'SpArE', 'pare']
It is useful for debugging purposes as well. For example, to see the potential matches before applying substitution.
>>> s = 'green:3.14:teal::brown:oh!:blue'
>>> re.findall(r':.*:', s)
[':3.14:teal::brown:oh!:']
>>> re.findall(r':.*?:', s)
[':3.14:', '::', ':oh!:']
>>> re.findall(r':.*+:', s)
[]
Presence of capture groups affects re.findall()
in different ways depending on the number of groups used:
- If a single capture group is used, output will be a list of strings. Each element will have only the portion matched by the capture group
- If more than one capture group is used, output will be a list of tuples. Each element will be a tuple containing portions matched by all the capturing groups
For both cases, any pattern outside the capture groups will not be represented in the output. Also, you'll get an empty string if a particular capture group didn't match any character.
>>> purchase = 'coffee:100g tea:250g sugar:75g chocolate:50g salt:g'
# without capture groups
>>> re.findall(r':.*?g', purchase)
[':100g', ':250g', ':75g', ':50g', ':g']
# single capture group
>>> re.findall(r':(.*?)g', purchase)
['100', '250', '75', '50', '']
# multiple capture groups
# note that the last date didn't match because there's no comma at the end
# you'll later learn better ways to match such patterns
>>> re.findall(r'(.*?)/(.*?)/(.*?),', '2023/04/25,1986/Mar/02,77/12/31')
[('2023', '04', '25'), ('1986', 'Mar', '02')]
See the Non-capturing groups section if you need to use groupings without the behavior shown above.
re.finditer()
You can use the re.finditer()
function to get an iterator object with each element as re.Match
objects for the matched portions.
re.finditer(pattern, string, flags=0)
Here's an example:
# output of finditer is an iterator object
>>> re.finditer(r'so+n', 'song too soon snatch')
<callable_iterator object at 0x7fb65e103438>
# each element is a re.Match object corresponding to the matched portion
>>> m_iter = re.finditer(r'so+n', 'song too soon snatch')
>>> for m in m_iter:
... print(m)
...
<re.Match object; span=(0, 3), match='son'>
<re.Match object; span=(9, 13), match='soon'>
Use the re.Match
object's methods and attributes as needed. You can also replicate the re.findall()
functionality.
>>> m_iter = re.finditer(r'so+n', 'song too soon snatch')
>>> for m in m_iter:
... print(m[0].upper(), m.span(), sep='\t')
...
SON (0, 3)
SOON (9, 13)
# same as: re.findall(r'(.*?)/(.*?)/(.*?),', d)
>>> d = '2023/04/25,1986/Mar/02,77/12/31'
>>> m_iter = re.finditer(r'(.*?)/(.*?)/(.*?),', d)
>>> [m.groups() for m in m_iter]
[('2023', '04', '25'), ('1986', 'Mar', '02')]
Since the output of
re.finditer()
is an iterator object, you cannot iterate over it more than once.>>> d = '2023/04/25,1986/Mar/02,77/12/31' >>> m_iter = re.finditer(r'(.*?),', d) >>> [m[1] for m in m_iter] ['2023/04/25', '1986/Mar/02'] >>> [m[1] for m in m_iter] []
re.split() with capture groups
Capture groups affect the re.split()
function as well. If the pattern used to split contains capture groups, the portions matched by those groups will also be a part of the output list.
# without capture group
>>> re.split(r'1*4?2', '31111111111251111426')
['3', '5', '6']
# to include the matching portions of the pattern as well in the output
>>> re.split(r'(1*4?2)', '31111111111251111426')
['3', '11111111112', '5', '111142', '6']
If part of the pattern is outside a capture group, the text thus matched won't be in the output. If a capture group didn't participate, it will be represented by None
in the output list.
# here 4?2 is outside the capture group, so that portion won't be in the output
>>> re.split(r'(1*)4?2', '31111111111251111426')
['3', '1111111111', '5', '1111', '6']
# multiple capture groups example
# note that the portion matched by b+ isn't present in the output
>>> re.split(r'(a+)b+(c+)', '3.14aabccc42')
['3.14', 'aa', 'ccc', '42']
# here (4)? matches zero times on the first occasion
>>> re.split(r'(1*)(4)?2', '31111111111251111426')
['3', '1111111111', None, '5', '1111', '4', '6']
Use of capture groups and maxsplit=1
gives behavior similar to the str.partition()
method.
# first element is the portion before the first match
# second element is the portion matched by the pattern itself
# third element is the rest of the input string
>>> re.split(r'(a+b+c+)', '3.14aabccc42abc88', maxsplit=1)
['3.14', 'aabccc', '42abc88']
re.subn()
The re.subn()
function behaves the same as re.sub()
except that the output is a tuple. The first element of the tuple is the same output as the re.sub()
function. The second element gives the number of substitutions made. In other words, you also get the number of matches.
re.subn(pattern, repl, string, count=0, flags=0)
>>> greeting = 'Have a nice weekend'
>>> re.sub(r'e', 'E', greeting)
'HavE a nicE wEEkEnd'
# with re.subn, you can also infer that 5 substitutions were made
>>> re.subn(r'e', 'E', greeting)
('HavE a nicE wEEkEnd', 5)
Here's an example that performs a conditional operation based on whether the substitution was successful or not.
>>> word = 'coffining'
# recursively delete 'fin'
>>> while True:
... word, cnt = re.subn(r'fin', '', word)
... if cnt == 0:
... break
...
>>> word
'cog'
If you like using assignment expressions, the above while
loop can be shortened to:
while (op := re.subn(r'fin', '', word))[1]:
word = op[0]
Cheatsheet and Summary
Note | Description |
---|---|
re.Match object | get details like matched portions, location, etc |
m[0] or m.group(0) | entire matched portion of re.Match object m |
m[1] or m.group(1) | matched portion of the first capture group |
m[2] or m.group(2) | matched portion of the second capture group and so on |
m.groups() | tuple of all the capture groups' matched portions |
m.span() | start and end+1 index of the entire matched portion |
pass a number to get span of that particular capture group | |
can also use m.start() and m.end() | |
re.sub(r'pat', f, s) | function f will get a re.Match object as the argument |
using dict | replacement string based on the matched text as dictionary key |
ex: re.sub(r'pat', lambda m: d.get(m[0], default), s) | |
re.findall() | returns all the matches as a list of strings |
re.findall(pattern, string, flags=0) | |
if 1 capture group is used, only its matches are returned | |
1+, each element will be tuple of capture groups | |
portion matched by pattern outside groups won't be in output | |
empty matches will be represented by empty string | |
re.finditer() | iterator with re.Match object for each match |
re.finditer(pattern, string, flags=0) | |
re.split() | capture groups affects re.split() too |
text matched by the groups will be part of the output | |
portion matched by pattern outside groups won't be in output | |
group that didn't match will be represented by None | |
re.subn() | gives tuple of modified string and number of substitutions |
re.subn(pattern, repl, string, count=0, flags=0) |
This chapter introduced different ways to work with various matching portions of the input string. The re.Match
object helps you get the portion matched by the RE pattern and capture groups, location of the match, etc. Functions can be used in the replacement section, which gets re.Match
object as an argument. Using functions, you can do substitutions based on dict
mappings. To get all the matches instead of just the first match, you can use re.findall()
(which gives a list of strings as output) and re.finditer()
(which gives an iterator of re.Match
objects). You also learnt how capture groups affect the output of re.findall()
and re.split()
functions. You'll see many more uses of groupings in the coming chapters. The re.subn()
function is like re.sub()
but additionally gives number of matches as well.
Exercises
a) For the given strings, extract the matching portion from the first is
to the last t
.
>>> str1 = 'This the biggest fruit you have seen?'
>>> str2 = 'Your mission is to read and practice consistently'
>>> pat = re.compile() ##### add your solution here
##### add your solution here for str1
'is the biggest fruit'
##### add your solution here for str2
'ission is to read and practice consistent'
b) Find the starting index of the first occurrence of is
or the
or was
or to
for the given input strings.
>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'
>>> s3 = 'this is good bye then'
>>> s4 = 'who was there to see?'
>>> pat = re.compile() ##### add your solution here
##### add your solution here for s1
12
##### add your solution here for s2
4
##### add your solution here for s3
2
##### add your solution here for s4
4
c) Find the starting index of the last occurrence of is
or the
or was
or to
for the given input strings.
>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'
>>> s3 = 'this is good bye then'
>>> s4 = 'who was there to see?'
>>> pat = re.compile() ##### add your solution here
##### add your solution here for s1
12
##### add your solution here for s2
18
##### add your solution here for s3
17
##### add your solution here for s4
14
d) The given input string contains :
exactly once. Extract all characters after the :
as output.
>>> ip = 'fruits:apple, mango, guava, blueberry'
##### add your solution here
'apple, mango, guava, blueberry'
e) The given input strings contains some text followed by -
followed by a number. Replace that number with its log
value using math.log()
.
>>> s1 = 'first-3.14'
>>> s2 = 'next-123'
>>> pat = re.compile() ##### add your solution here
>>> import math
>>> pat.sub() ##### add your solution here for s1
'first-1.144222799920162'
>>> pat.sub() ##### add your solution here for s2
'next-4.812184355372417'
f) Replace all occurrences of par
with spar
, spare
with extra
and park
with garden
for the given input strings.
>>> str1 = 'apartment has a park'
>>> str2 = 'do you have a spare cable'
>>> str3 = 'write a parser'
##### add your solution here
>>> pat.sub() ##### add your solution here for str1
'aspartment has a garden'
>>> pat.sub() ##### add your solution here for str2
'do you have a extra cable'
>>> pat.sub() ##### add your solution here for str3
'write a sparser'
g) Extract all words between (
and )
from the given input string as a list. Assume that the input will not contain any broken parentheses.
>>> ip = 'another (way) to reuse (portion) matched (by) capture groups'
>>> re.findall() ##### add your solution here
['way', 'portion', 'by']
h) Extract all occurrences of <
up to the next occurrence of >
, provided there is at least one character in between <
and >
.
>>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'
>>> re.findall() ##### add your solution here
['<apple>', '<> b<bye>', '<> c<cat>']
i) Use re.findall()
to get the output as shown below for the given input strings. Note the characters used in the input strings carefully.
>>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '
>>> pat = re.compile() ##### add your solution here
>>> pat.findall(row1)
[('-2', '5'), ('4', '+3'), ('+42', '-53'), ('4356246', '-357532354')]
>>> pat.findall(row2)
[('1.32', '-3.14'), ('634', '5.63'), ('63.3e3', '9907809345343.235')]
j) This is an extension to the previous question.
- For
row1
, find the sum of integers of each tuple element. For example, sum of-2
and5
is3
. - For
row2
, find the sum of floating-point numbers of each tuple element. For example, sum of1.32
and-3.14
is-1.82
.
>>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '
# should be the same as previous question
>>> pat = re.compile() ##### add your solution here
##### add your solution here for row1
[3, 7, -11, -353176108]
##### add your solution here for row2
[-1.82, 639.63, 9907809408643.234]
k) Use re.split()
to get the output as shown below.
>>> ip = '42:no-output;1000:car-tr:u-ck;SQEX49801'
>>> re.split() ##### add your solution here
['42', 'output', '1000', 'tr:u-ck', 'SQEX49801']
l) For the given list of strings, change the elements into a tuple of original element and the number of times t
occurs in that element.
>>> words = ['sequoia', 'attest', 'tattletale', 'asset']
##### add your solution here
[('sequoia', 0), ('attest', 3), ('tattletale', 4), ('asset', 1)]
m) The given input string has fields separated by :
. Each field contains four uppercase alphabets followed optionally by two digits. Ignore the last field, which is empty. See docs.python: Match.groups and use re.finditer()
to get the output as shown below. If the optional digits aren't present, show 'NA'
instead of None
.
>>> ip = 'TWXA42:JWPA:NTED01:'
##### add your solution here
[('TWXA', '42'), ('JWPA', 'NA'), ('NTED', '01')]
Note that this is different from
re.findall()
which will just give empty string instead ofNone
when a capture group doesn't participate.
n) Convert the comma separated strings to corresponding dict
objects as shown below.
>>> row1 = 'name:rohan,maths:75,phy:89,'
>>> row2 = 'name:rose,maths:88,phy:92,'
>>> pat = re.compile() ##### add your solution here
##### add your solution here for row1
{'name': 'rohan', 'maths': '75', 'phy': '89'}
##### add your solution here for row2
{'name': 'rose', 'maths': '88', 'phy': '92'}