Alternation and Grouping
Similar to logical OR, alternation in regular expressions allows you to combine multiple patterns. These patterns can have some common elements between them, in which case grouping helps to form terser expressions. This chapter will also discuss the precedence rules used to determine which alternation wins.
Alternation
A conditional expression combined with logical OR evaluates to True
if any of the condition is satisfied. Similarly, in regular expressions, you can use the |
metacharacter to combine multiple patterns to indicate logical OR. The matching will succeed if any of the alternate pattern is found in the input string. These alternatives have the full power of a regular expression, for example they can have their own independent anchors. Here are some examples.
# match either 'cat' or 'dog'
>>> pet = re.compile(r'cat|dog')
>>> bool(pet.search('I like cats'))
True
>>> bool(pet.search('I like dogs'))
True
>>> bool(pet.search('I like parrots'))
False
# replace 'cat' at the start of string or 'cat' at the end of word
>>> re.sub(r'\Acat|cat\b', 'X', 'catapults concatenate cat scat cater')
'Xapults concatenate X sX cater'
# replace 'cat' or 'dog' or 'fox' with 'mammal'
>>> re.sub(r'cat|dog|fox', 'mammal', 'cat dog bee parrot fox')
'mammal mammal bee parrot mammal'
You might infer from the above examples that there can be cases where many alternations are required. The join
string method can be used to build the alternation list automatically from an iterable of strings.
>>> '|'.join(['car', 'jeep'])
'car|jeep'
>>> words = ['cat', 'dog', 'fox']
>>> '|'.join(words)
'cat|dog|fox'
>>> re.sub('|'.join(words), 'mammal', 'cat dog bee parrot fox')
'mammal mammal bee parrot mammal'
In the above examples, the elements do not contain any special regular expression characters. Handling strings that contain metacharacters will be discussed in the re.escape() section.
If you have thousands of search terms to be matched, using specialized libraries like flashtext is highly recommended instead of regular expressions.
Grouping
Often, there are some common portions among the alternatives. It could be common characters, qualifiers like the anchors and so on. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd
in maths, you get a(b|c)d = abd|acd
in regular expressions.
# without grouping
>>> re.sub(r'reform|rest', 'X', 'red reform read arrest')
'red X read arX'
# with grouping
>>> re.sub(r're(form|st)', 'X', 'red reform read arrest')
'red X read arX'
# without grouping
>>> re.sub(r'\bpar\b|\bpart\b', 'X', 'par spare part party')
'X spare X party'
# taking out common anchors
>>> re.sub(r'\b(par|part)\b', 'X', 'par spare part party')
'X spare X party'
# taking out common characters as well
# you'll later learn a better technique instead of using empty alternates
>>> re.sub(r'\bpar(|t)\b', 'X', 'par spare part party')
'X spare X party'
There are many more uses for grouping than just forming a terser RE. They will be discussed as they become relevant in the coming chapters.
For now, this is a good place to show how to incorporate normal strings (from a variable, expression result, etc) while building a regular expression. For example, adding anchors to an alternation list created using the join
method.
>>> words = ['cat', 'par']
>>> '|'.join(words)
'cat|par'
# without word boundaries, any matching portion will be replaced
>>> re.sub('|'.join(words), 'X', 'cater cat concatenate par spare')
'Xer X conXenate X sXe'
# you can also use: alt = re.compile(r'\b(' + '|'.join(words) + r')\b')
>>> alt = re.compile(rf"\b({'|'.join(words)})\b")
# only whole words will be replaced now
>>> alt.sub('X', 'cater cat concatenate par spare')
'cater X concatenate X spare'
# this is how the above RE looks as a normal string
>>> alt.pattern
'\\b(cat|par)\\b'
>>> alt.pattern == r'\b(cat|par)\b'
True
In the above example, you had to concatenate strings to add word boundaries. If you needed to add string anchors so that the pattern only matches whole string, you can use re.fullmatch()
instead of manually adding the anchors.
>>> terms = ['no', 'ten', 'it']
>>> items = ['dip', 'nobody', 'it', 'oh', 'no', 'bitten']
>>> pat = re.compile('|'.join(terms))
# matching only whole elements
>>> [w for w in items if(pat.fullmatch(w))]
['it', 'no']
# matching anywhere
>>> [w for w in items if(pat.search(w))]
['nobody', 'it', 'no', 'bitten']
Precedence rules
There are tricky situations when using alternation. There is no ambiguity if it is used to get a boolean result by testing a match against a string input. However, for cases like string replacement, it depends on a few factors. Say, you want to replace either are
or spared
— which one should get precedence? The bigger word spared
or the substring are
inside it or based on something else?
In Python, the alternative which matches earliest in the input string gets precedence. re.Match
output is handy to illustrate this concept.
>>> words = 'lion elephant are rope not'
# span shows the start and end+1 index of the matched portion
# match shows the text that satisfied the search criteria
>>> re.search(r'on', words)
<re.Match object; span=(2, 4), match='on'>
>>> re.search(r'ant', words)
<re.Match object; span=(10, 13), match='ant'>
# starting index of 'on' < index of 'ant' for the given string input
# so 'on' will be replaced irrespective of order
# count optional argument here restricts no. of replacements to 1
>>> re.sub(r'on|ant', 'X', words, count=1)
'liX elephant are rope not'
>>> re.sub(r'ant|on', 'X', words, count=1)
'liX elephant are rope not'
What happens if alternatives have the same starting index? The precedence is left-to-right in the order of declaration.
>>> mood = 'best years'
>>> re.search(r'year', mood)
<re.Match object; span=(5, 9), match='year'>
>>> re.search(r'years', mood)
<re.Match object; span=(5, 10), match='years'>
# starting index for 'year' and 'years' will always be the same
# so, which one gets replaced depends on the order of alternation
>>> re.sub(r'year|years', 'X', mood, count=1)
'best Xs'
>>> re.sub(r'years|year', 'X', mood, count=1)
'best X'
Another example (without count
restriction) to drive home the issue:
>>> words = 'ear xerox at mare part learn eye'
# this is going to be same as: r'ar'
>>> re.sub(r'ar|are|art', 'X', words)
'eX xerox at mXe pXt leXn eye'
# this is going to be same as: r'are|ar'
>>> re.sub(r'are|ar|art', 'X', words)
'eX xerox at mX pXt leXn eye'
# phew, finally this one works as needed
>>> re.sub(r'are|art|ar', 'X', words)
'eX xerox at mX pX leXn eye'
If you do not want substrings to sabotage your replacements, a robust workaround is to sort the alternations based on length, longest first.
>>> words = ['hand', 'handy', 'handful']
>>> alt = re.compile('|'.join(sorted(words, key=len, reverse=True)))
>>> alt.pattern
'handful|handy|hand'
>>> alt.sub('X', 'hands handful handed handy')
'Xs X Xed X'
# alternation order will come into play if you don't sort them properly
>>> re.sub('|'.join(words), 'X', 'hands handful handed handy')
'Xs Xful Xed Xy'
See also regular-expressions: alternation for more information regarding alternation and precedence rules in various regular expression implementations.
Cheatsheet and Summary
Note | Description |
---|---|
| | multiple RE combined as conditional OR |
each alternative can have independent anchors | |
'|'.join(iterable) | programmatically combine multiple RE |
() | group pattern(s) |
a(b|c)d | same as abd|acd |
Alternation precedence | pattern which matches earliest in the input gets precedence |
tie-breaker is left-to-right if patterns have the same starting location | |
robust solution: sort the alternations based on length, longest first | |
'|'.join(sorted(iterable, key=len, reverse=True)) |
So, this chapter was about specifying one or more alternate matches within the same RE using the |
metacharacter. Which can further be simplified using ()
grouping if the alternations have common portions. Among the alternations, earliest matching pattern gets precedence. Left-to-right ordering is used as a tie-breaker if multiple alternations have the same starting location. You also learnt ways to programmatically construct a RE.
Exercises
a) For the given list, filter all elements that start with den
or end with ly
.
>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']
##### add your solution here
['lovely', '2 lonely', 'dent']
b) For the given list, filter all elements having a line starting with den
or ending with ly
.
>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\nfar', 'dent']
##### add your solution here
['lovely', '1\ndentist', '2 lonely', 'fly\nfar', 'dent']
c) For the given strings, replace all occurrences of removed
or reed
or received
or refused
with X
.
>>> s1 = 'creed refuse removed read'
>>> s2 = 'refused reed redo received'
>>> pat = re.compile() ##### add your solution here
>>> pat.sub('X', s1)
'cX refuse X read'
>>> pat.sub('X', s2)
'X X redo X'
d) For the given strings, replace all matches from the list words
with A
.
>>> s1 = 'plate full of slate'
>>> s2 = "slated for later, don't be late"
>>> words = ['late', 'later', 'slated']
>>> pat = re.compile() ##### add your solution here
>>> pat.sub('A', s1)
'pA full of sA'
>>> pat.sub('A', s2)
"A for A, don't be A"
e) Filter all whole elements from the input list items
based on elements listed in words
.
>>> items = ['slate', 'later', 'plate', 'late', 'slates', 'slated ']
>>> words = ['late', 'later', 'slated']
>>> pat = re.compile() ##### add your solution here
##### add your solution here
['later', 'late']