Dot metacharacter and Quantifiers

This chapter introduces the dot metacharacter and metacharacters related to quantifiers. As the name implies, quantifiers allows you to specify how many times a character or grouping should be matched. With the * string operator, you can do something like 'no' * 5 to get 'nonononono'. This saves you manual repetition as well as gives the ability to programmatically repeat a string object as many times as you need. Quantifiers support this simple repetition as well as ways to specify a range of repetition. This range has the flexibility of being bounded or unbounded with respect to the start and end values. Combined with the dot metacharacter (and alternation if needed), quantifiers allow you to construct conditional AND logic between patterns.

Dot metacharacter

The dot metacharacter serves as a placeholder to match any character except the newline character.

# matches character 'c', any character and then character 't'
>>> re.sub(r'c.t', 'X', 'tac tin cat abc;tuv acute')
'taXin X abXuv aXe'

# matches character 'r', any two characters and then character 'd'
>>> re.sub(r'r..d', 'X', 'breadth markedly reported overrides')
'bXth maXly repoX oveXes'

# matches character '2', any character and then character '3'
>>> re.sub(r'2.3', '8', '42\t35')
'485'

# by default, dot metacharacter doesn't match the newline character
>>> bool(re.search(r'a.b', 'a\nb'))
False

See the re.DOTALL section to know how . metacharacter can match newline as well. The Character class chapter will discuss how to define your own custom placeholder for limited set of characters.

warning Some characters like have more than one codepoint (numerical value of a character). You'll need to use multiple . metacharacters to match such characters (equal to the number of codepoints). Or, you can use the regex module to handle such cases — see the \X vs dot metacharacter section for more details.

>>> re.sub(r'a.e', 'o', 'cag̈ed')
'cag̈ed'
>>> re.sub(r'a..e', 'o', 'cag̈ed')
'cod'

re.split()

This chapter will additionally use the re.split() function to illustrate examples. For normal strings, you'd use the str.split() method. For regular expressions, use the re.split() function, whose argument list is shown below.

re.split(pattern, string, maxsplit=0, flags=0)

The first argument is the RE pattern to be used for splitting the input string, which is the second argument. maxsplit and flags are optional arguments. Output is a list of strings.

# same as: 'apple-85-mango-70'.split('-')
>>> re.split(r'-', 'apple-85-mango-70')
['apple', '85', 'mango', '70']

# maxsplit determines the maximum number of times to split the input
>>> re.split(r'-', 'apple-85-mango-70', maxsplit=1)
['apple', '85-mango-70']

# example with dot metacharacter
>>> re.split(r':.:', 'bus:3:car:-:van')
['bus', 'car', 'van']

See the re.split() with capture groups section for details of how capture groups affect the output of re.split().

Greedy quantifiers

Quantifiers have functionality like the string repetition operator and the range() function. They can be applied to characters and groupings (and more, as you'll see in later chapters). Apart from the ability to specify exact quantity and bounded range, these can also match unbounded varying quantities. If the input string can satisfy a pattern with varying quantities in multiple ways, you can choose among three types of quantifiers to narrow down possibilities. In this section, greedy type of quantifiers is covered.

First up, the ? metacharacter which quantifies a character or group to match 0 or 1 times. In other words, you make that character or group as something to be optionally matched. This leads to a terser RE compared to alternation and grouping.

# same as: r'ear|ar'
>>> re.sub(r'e?ar', 'X', 'far feat flare fear')
'fX feat flXe fX'

# same as: r'\bpar(t|)\b'
>>> re.sub(r'\bpart?\b', 'X', 'par spare part party')
'X spare X party'

# same as: r'\b(re.d|red)\b'
>>> words = ['red', 'read', 'ready', 're;d', 'road', 'redo', 'reed', 'rod']
>>> [w for w in words if re.search(r'\bre.?d\b', w)]
['red', 'read', 're;d', 'reed']

# same as: r'part|parrot'
>>> re.sub(r'par(ro)?t', 'X', 'par part parrot parent')
'par X X parent'
# same as: r'part|parent|parrot'
>>> re.sub(r'par(en|ro)?t', 'X', 'par part parrot parent')
'par X X X'

The * metacharacter quantifies a character or group to match 0 or more times. There is no upper bound.

# match 't' followed by zero or more of 'a' followed by 'r'
>>> re.sub(r'ta*r', 'X', 'tr tear tare steer sitaara')
'X tear Xe steer siXa'
# match 't' followed by zero or more of 'e' or 'a' followed by 'r'
>>> re.sub(r't(e|a)*r', 'X', 'tr tear tare steer sitaara')
'X X Xe sX siXa'
# match zero or more of '1' followed by '2'
>>> re.sub(r'1*2', 'X', '3111111111125111142')
'3X511114X'

Here are some more examples with the re.split() function.

# last element is empty because there is nothing after 2 at the end of string
>>> re.split(r'1*2', '3111111111125111142')
['3', '511114', '']
# later, you'll see how maxsplit helps to get behavior like str.partition
>>> re.split(r'1*2', '3111111111125111142', maxsplit=1)
['3', '5111142']

# empty string matches at the start and end of string
# it matches between every character
# and, there is an empty match after the split at u
>>> re.split(r'u*', 'cloudy')
['', 'c', 'l', 'o', '', 'd', 'y', '']

The + metacharacter quantifies a character or group to match 1 or more times. Similar to the * quantifier, there is no upper bound. More importantly, this doesn't have surprises like matching empty string in between patterns or at the start/end of string.

>>> re.sub(r'ta+r', 'X', 'tr tear tare steer sitaara')
'tr tear Xe steer siXa'
>>> re.sub(r't(e|a)+r', 'X', 'tr tear tare steer sitaara')
'tr X Xe sX siXa'

>>> re.sub(r'1+2', 'X', '3111111111125111142')
'3X5111142'

>>> re.split(r'1+', '3111111111125111142')
['3', '25', '42']
>>> re.split(r'u+', 'cloudy')
['clo', 'dy']

You can specify a range of integer numbers, both bounded and unbounded, using the {} metacharacters. There are four ways to use this quantifier as shown below:

PatternDescription
{m,n}match m to n times
{m,}match at least m times
{,n}match up to n times (including 0 times)
{n}match exactly n times
>>> repeats = ['abc', 'ac', 'adc', 'abbc', 'xabbbcz', 'bbb', 'bc', 'abbbbbc']

>>> [w for w in repeats if re.search(r'ab{1,4}c', w)]
['abc', 'abbc', 'xabbbcz']
>>> [w for w in repeats if re.search(r'ab{3,}c', w)]
['xabbbcz', 'abbbbbc']
>>> [w for w in repeats if re.search(r'ab{,2}c', w)]
['abc', 'ac', 'abbc']
>>> [w for w in repeats if re.search(r'ab{3}c', w)]
['xabbbcz']

info The {} metacharacters have to be escaped to match them literally. However, unlike the () metacharacters, these have more leeway. For example, escaping { alone is enough, or if it doesn't conform strictly to any of the four forms listed above, escaping is not needed at all.

>>> re.sub(r'a\{5}', 'a{6}', 'a{5} = 10')
'a{6} = 10'

>>> re.sub(r'_{a,b}', '-{c,d}', 'report_{a,b}.txt')
'report-{c,d}.txt'

Conditional AND

Next up, how to construct conditional AND using the dot metacharacter and quantifiers.

# match 'Error' followed by zero or more characters followed by 'valid'
>>> bool(re.search(r'Error.*valid', 'Error: not a valid input'))
True

>>> bool(re.search(r'Error.*valid', 'Error: key not found'))
False

To allow matching in any order, you'll have to bring in alternation as well. That is somewhat manageable for 2 or 3 patterns. See the Conditional AND with lookarounds section for an easier approach.

>>> s1 = 'cat and dog and parrot'
>>> s2 = 'dog and cat and parrot'
>>> pat = re.compile(r'cat.*dog|dog.*cat')

>>> pat.sub('X', s1)
'X and parrot'
>>> pat.sub('X', s2)
'X and parrot'

If you just need a boolean result, the all() function would be scalable and easier to use.

>>> s1 = 'cat and dog and parrot'
>>> s2 = 'dog and cat and parrot'
>>> patterns = (r'cat', r'dog')

>>> all(re.search(p, s1) for p in patterns)
True
>>> all(re.search(p, s2) for p in patterns)
True

What does greedy mean?

When you use the ? quantifier, how does Python decide to match 0 or 1 times, if both quantities can satisfy the RE? For example, consider the expression re.sub(r'f.?o', 'X', 'foot') — should foo be replaced or fo? It will always replace foo, because these are greedy quantifiers, i.e. they try to match as much as possible.

>>> re.sub(r'f.?o', 'X', 'foot')
'Xt'

# a more practical example
# prefix '<' with '\' if it is not already prefixed
# both '<' and '\<' will get replaced with '\<'
# note the use of raw string for all the three arguments
>>> print(re.sub(r'\\?<', r'\<', r'table \< fig < bat \< box < cake'))
table \< fig \< bat \< box \< cake

# say goodbye to r'handful|handy|hand' shenanigans
>>> re.sub(r'hand(y|ful)?', 'X', 'hand handy handful')
'X X X'

But wait, how did the r'Error.*valid' example work? Shouldn't .* consume all the characters after Error? Good question. The regular expression engine actually does consume all the characters. Then realizing that the match failed, it gives back one character from the end of string and checks again if the overall RE is satisfied. This process is repeated until a match is found or failure is confirmed. This is known as backtracking.

>>> sentence = 'that is quite a fabricated tale'

# r't.*a' will always match from the first 't' to the last 'a'
# which implies that there cannot be more than one match for such patterns
>>> re.sub(r't.*a', 'X', sentence)
'Xle'
>>> re.sub(r't.*a', 'X', 'star')
'sXr'

# matching first 't' to last 'a' for 't.*a' won't work for these cases
# so, the engine backtracks until the overall RE can be matched
>>> re.sub(r't.*a.*q.*f', 'X', sentence)
'Xabricated tale'
>>> re.sub(r't.*a.*u', 'X', sentence)
'Xite a fabricated tale'

Non-greedy quantifiers

As the name implies, these quantifiers will try to match as minimally as possible. Also known as lazy or reluctant quantifiers. Appending a ? to greedy quantifiers makes them non-greedy.

>>> re.sub(r'f.??o', 'X', 'foot')
'Xot'
>>> re.sub(r'f.??o', 'X', 'frost')
'Xst'

>>> re.sub(r'.{2,5}?', 'X', '123456789', count=1)
'X3456789'

Like greedy quantifiers, lazy quantifiers will try to satisfy the overall RE. For example, .*? will first start with an empty match and then move forward one character at a time until a match is found.

# r':.*:' will match from the first ':' to the last ':'
>>> re.split(r':.*:', 'green:3.14:teal::brown:oh!:blue')
['green', 'blue']

# r':.*?:' will match from ':' to the very next ':'
>>> re.split(r':.*?:', 'green:3.14:teal::brown:oh!:blue')
['green', 'teal', 'brown', 'blue']

Possessive quantifiers

Before Python 3.11, you had to use alternatives like the third-party regex module for possessive quantifiers. The difference between greedy and possessive quantifiers is that possessive will not backtrack to find a match. In other words, possessive quantifiers will always consume every character that matches the pattern on which it is applied. Syntax wise, you need to append + to greedy quantifiers to make it possessive, similar to adding ? for the non-greedy case.

Unlike greedy and non-greedy quantifiers, a pattern like :.*+apple will never result in a match because .*+ will consume rest of the line, leaving no way to match apple.

>>> ip = 'fig:mango:pineapple:guava:apples:orange'

>>> re.sub(r':.*+', 'X', ip)
'figX'

>>> bool(re.search(r':.*+apple', ip))
False

Here's a more practical example. Suppose you want to match integer numbers greater than or equal to 100 where these numbers can optionally have leading zeros. This illustration will use features yet to introduced. [1-9] matches any of the digits from 1 to 9 and \d matches digits 0 to 9. See the Character class chapter for more details and the Escape sequence sets section for another practical example.

>>> numbers = '42 314 001 12 00984'

# this solution fails because 0* and \d{3,} can both match leading zeros
# and greedy quantifiers will give up characters to help the overall RE succeed
>>> re.findall(r'0*\d{3,}', numbers)
['314', '001', '00984']

# here 0*+ will never give back leading zeros
>>> re.findall(r'0*+\d{3,}', numbers)
['314', '00984']

# workaround if you can only use greedy quantifiers
>>> re.findall(r'0*[1-9]\d{2,}', numbers)
['314', '00984']

See the Atomic grouping section for another way to safeguard a pattern from backtracking.

Catastrophic Backtracking

Backtracking can become significantly time consuming for certain corner cases. Which is why some regular expression engines do not use them, at the cost of not supporting some features like lookarounds. If your application accepts user defined RE, you might need to protect against such catastrophic patterns. From wikipedia: ReDoS:

A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or becoming unresponsive.

Here's an example. \w matches a word character once (see the Character class chapter for more details).

>>> from timeit import timeit

>>> greedy = re.compile(r'(a+|\w+)*:')
>>> possessive = re.compile(r'(a+|\w+)*+:')

# string that'll match the above patterns
>>> s1 = 'aaaaaaaaaaaaaaaa:123'
# string that does NOT match the above patterns
>>> s2 = 'aaaaaaaaaaaaaaaa-123'

# no issues when input string has a match
>>> timeit('greedy.search(s1)', number=10000, globals=globals())
0.016464739997900324
>>> timeit('possessive.search(s1)', number=10000, globals=globals())
0.016358205997676123

# if input doesn't match, greedy version suffers from catastrophic backtracking
# note that 'number' parameter is reduced to 10 since it takes a long time
>>> timeit('greedy.search(s2)', number=10, globals=globals())
53.71723825200024
>>> timeit('possessive.search(s2)', number=10, globals=globals())
0.00019008600065717474

(a+|\w+)*: is a silly pattern, since it can be rewritten as \w*: which will not suffer from catastrophic backtracking. But this example shows how quantifiers applied to a group with multiple alternatives using quantifiers can lead to explosive results. More such patterns and mitigation strategies can be found in the following links:

Cheatsheet and Summary

NoteDescription
.match any character except the newline character
greedymatch as much as possible
?greedy quantifier, match 0 or 1 times
*greedy quantifier, match 0 or more times
+greedy quantifier, match 1 or more times
{m,n}greedy quantifier, match m to n times
{m,}greedy quantifier, match at least m times
{,n}greedy quantifier, match up to n times (including 0 times)
{n}greedy quantifier, match exactly n times
pat1.*pat2any number of characters between pat1 and pat2
pat1.*pat2|pat2.*pat1match both pat1 and pat2 in any order
non-greedyappend ? to greedy quantifiers
match as minimally as possible
possessiveappend + to greedy quantifiers
like greedy, but no backtracking
re.split()split a string using regular expressions
re.split(pattern, string, maxsplit=0, flags=0)
maxsplit and flags are optional arguments

This chapter introduced the concept of specifying a placeholder instead of fixed string. When combined with quantifiers, you've seen a glimpse of how a simple RE can match wide ranges of text. In the coming chapters, you'll learn how to create your own restricted set of placeholder characters.

Exercises

info Since the . metacharacter doesn't match the newline character by default, assume that the input strings in the following exercises will not contain newline characters.

a) Replace 42//5 or 42/5 with 8 for the given input.

>>> ip = 'a+42//5-c pressure*3+42/5-14256'

>>> re.sub()      ##### add your solution here
'a+8-c pressure*3+8-14256'

b) For the list items, filter all elements starting with hand and ending immediately with at most one more character or le.

>>> items = ['handed', 'hand', 'handled', 'handy', 'unhand', 'hands', 'handle']

##### add your solution here
['hand', 'handy', 'hands', 'handle']

c) Use re.split() to get the output as shown for the given input strings.

>>> eqn1 = 'a+42//5-c'
>>> eqn2 = 'pressure*3+42/5-14256'
>>> eqn3 = 'r*42-5/3+42///5-42/53+a'

##### add your solution here for eqn1
['a+', '-c']
##### add your solution here for eqn2
['pressure*3+', '-14256']
##### add your solution here for eqn3
['r*42-5/3+42///5-', '3+a']

d) For the given input strings, remove everything from the first occurrence of i till the end of the string.

>>> s1 = 'remove the special meaning of such constructs'
>>> s2 = 'characters while constructing'
>>> s3 = 'input output'

>>> pat = re.compile()        ##### add your solution here

>>> pat.sub('', s1)
'remove the spec'
>>> pat.sub('', s2)
'characters wh'
>>> pat.sub('', s3)
''

e) For the given strings, construct a RE to get the output as shown below.

>>> str1 = 'a+b(addition)'
>>> str2 = 'a/b(division) + c%d(#modulo)'
>>> str3 = 'Hi there(greeting). Nice day(a(b)'

>>> remove_parentheses = re.compile()     ##### add your solution here

>>> remove_parentheses.sub('', str1)
'a+b'
>>> remove_parentheses.sub('', str2)
'a/b + c%d'
>>> remove_parentheses.sub('', str3)
'Hi there. Nice day'

f) Correct the given RE to get the expected output.

>>> words = 'plink incoming tint winter in caution sentient'
>>> change = re.compile(r'int|in|ion|ing|inco|inter|ink')

# wrong output
>>> change.sub('X', words)
'plXk XcomXg tX wXer X cautX sentient'

# expected output
>>> change = re.compile()       ##### add your solution here
>>> change.sub('X', words)
'plX XmX tX wX X cautX sentient'

g) For the given greedy quantifiers, what would be the equivalent form using the {m,n} representation?

  • ? is same as
  • * is same as
  • + is same as

h) (a*|b*) is same as (a|b)* — True or False?

i) For the given input strings, remove everything from the first occurrence of test (irrespective of case) till the end of the string, provided test isn't at the end of the string.

>>> s1 = 'this is a Test'
>>> s2 = 'always test your RE for corner cases'
>>> s3 = 'a TEST of skill tests?'

>>> pat = re.compile()      ##### add your solution here

>>> pat.sub('', s1)
'this is a Test'
>>> pat.sub('', s2)
'always '
>>> pat.sub('', s3)
'a '

j) For the input list words, filter all elements starting with s and containing e and t in any order.

>>> words = ['sequoia', 'subtle', 'exhibit', 'a set', 'sets', 'tests', 'site']

##### add your solution here
['subtle', 'sets', 'site']

k) For the input list words, remove all elements having less than 6 characters.

>>> words = ['sequoia', 'subtle', 'exhibit', 'asset', 'sets', 'tests', 'site']

##### add your solution here
['sequoia', 'subtle', 'exhibit']

l) For the input list words, filter all elements starting with s or t and having a maximum of 6 characters.

>>> words = ['sequoia', 'subtle', 'exhibit', 'asset', 'sets', 't set', 'site']

##### add your solution here
['subtle', 'sets', 't set', 'site']

m) Can you reason out why this code results in the output shown? The aim was to remove all <characters> patterns but not the <> ones. The expected result was 'a 1<> b 2<> c'.

>>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'

>>> re.sub(r'<.+?>', '', ip)
'a 1 2'

n) Use re.split() to get the output as shown below for given input strings.

>>> s1 = 'go there  //   "this // that"'
>>> s2 = 'a//b // c//d e//f // 4//5'
>>> s3 = '42// hi//bye//see // carefully'

>>> pat = re.compile()      ##### add your solution here

>>> pat.split()     ##### add your solution here for s1
['go there', '"this // that"']
>>> pat.split()     ##### add your solution here for s2
['a//b', 'c//d e//f // 4//5']
>>> pat.split()     ##### add your solution here for s3
['42// hi//bye//see', 'carefully']

o) Modify the given regular expression such that it gives the expected results.

>>> s1 = 'appleabcabcabcapricot'
>>> s2 = 'bananabcabcabcdelicious'

# wrong output
>>> pat = re.compile(r'(abc)+a')
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
True

# expected output
# 'abc' shouldn't be considered when trying to match 'a' at the end
>>> pat = re.compile()      ##### add your solution here
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
False

p) Modify the given regular expression such that it gives the expected result.

>>> cast = 'dragon-unicorn--centaur---mage----healer'
>>> c = '-'

# wrong output
>>> re.sub(rf'{c}{3,}', c, cast)
'dragon-unicorn--centaur---mage----healer'

# expected output
>>> re.sub(rf'', c, cast)   ##### add your solution here
'dragon-unicorn--centaur-mage-healer'