Anchors
Now that you're familiar with RE syntax and couple of re
module functions, the next step is to know about the special features of regular expressions. In this chapter, you'll be learning about qualifying a pattern. Instead of matching anywhere in the given input string, restrictions can be specified. For now, you'll see the ones that are already part of the re
module. In later chapters, you'll learn how to define custom rules.
These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a \
character (discussed in the Escaping metacharacters chapter).
String anchors
This restriction is about qualifying a RE to match only at the start or the end of an input string. These provide functionality similar to the str
methods startswith()
and endswith()
. First up, the escape sequence \A
which restricts the matching to the start of string.
# \A is placed as a prefix to the search term
>>> bool(re.search(r'\Acat', 'cater'))
True
>>> bool(re.search(r'\Acat', 'concatenation'))
False
>>> bool(re.search(r'\Ahi', 'hi hello\ntop spot'))
True
>>> bool(re.search(r'\Atop', 'hi hello\ntop spot'))
False
To restrict the matching to the end of string, \Z
is used.
# \Z is placed as a suffix to the search term
>>> bool(re.search(r'are\Z', 'spare'))
True
>>> bool(re.search(r'are\Z', 'nearest'))
False
>>> words = ['surrender', 'unicorn', 'newer', 'door', 'erase', 'eel', 'pest']
>>> [w for w in words if re.search(r'er\Z', w)]
['surrender', 'newer']
>>> [w for w in words if re.search(r't\Z', w)]
['pest']
You can emulate string concatenation operations by using the anchors by themselves as a pattern.
# insert text at the start of a string
>>> re.sub(r'\A', 're', 'live')
'relive'
>>> re.sub(r'\A', 're', 'send')
'resend'
# appending text
>>> re.sub(r'\Z', 'er', 'cat')
'cater'
>>> re.sub(r'\Z', 'er', 'hack')
'hacker'
Use the optional start and end index arguments of the
Pattern.search()
method with caution. They are not equivalent to string slicing. For example, specifying a greater than0
start index when using\A
is always going to returnFalse
. This is because, as far as thesearch()
method is concerned, only the search space has been narrowed — the anchor positions haven't changed. When slicing is used, you are creating an entirely new string object with new anchor positions.>>> word_pat = re.compile(r'\Aat') >>> bool(word_pat.search('cater', 1)) False >>> bool(word_pat.search('cater'[1:])) True
re.fullmatch()
Combining both the start and end string anchors, you can restrict the matching to the whole string. The effect is similar to comparing strings using the ==
operator.
>>> word_pat = re.compile(r'\Acat\Z')
>>> bool(word_pat.search('cat'))
True
>>> bool(word_pat.search('concatenation'))
False
You can also use the re.fullmatch()
function to ensure the pattern matches only the whole input string and not just a part of the input. This may not seem useful with features introduced so far, but when you have a complex RE pattern with multiple alternatives, this function is quite handy. The argument list is same as the re.search()
function.
re.fullmatch(pattern, string, flags=0)
>>> word_pat = re.compile(r'cat', flags=re.I)
>>> bool(word_pat.fullmatch('Cat'))
True
>>> bool(word_pat.fullmatch('Scatter'))
False
Line anchors
A string input may contain single or multiple lines. The newline character \n
is considered as the line separator. There are two line anchors. ^
metacharacter for matching the start of line and $
for matching the end of line. If there are no newline characters in the input string, these will behave exactly the same as \A
and \Z
respectively.
>>> pets = 'cat and dog'
>>> bool(re.search(r'^cat', pets))
True
>>> bool(re.search(r'^dog', pets))
False
>>> bool(re.search(r'dog$', pets))
True
>>> bool(re.search(r'^dog$', pets))
False
By default, the input string is considered as a single line, even if multiple newline characters are present. In such cases, the
$
metacharacter can match both the end of string and just before\n
if it is the last character. However,\Z
will always match the end of string, irrespective of the characters present.>>> greeting = 'hi there\nhave a nice day\n' >>> bool(re.search(r'day$', greeting)) True >>> bool(re.search(r'day\n$', greeting)) True >>> bool(re.search(r'day\Z', greeting)) False >>> bool(re.search(r'day\n\Z', greeting)) True
To indicate that the input string should be treated as multiple lines, you need to enable the re.MULTILINE
flag (re.M
for short).
# check if any line in the string starts with 'top'
>>> bool(re.search(r'^top', 'hi hello\ntop spot', flags=re.M))
True
# check if any line in the string ends with 'ar'
>>> bool(re.search(r'ar$', 'spare\npar\ndare', flags=re.M))
True
# filter all elements having lines ending with 'are'
>>> elements = ['spare\ntool', 'par\n', 'dare']
>>> [e for e in elements if re.search(r'are$', e, flags=re.M)]
['spare\ntool', 'dare']
# check if any whole line in the string is 'par'
>>> bool(re.search(r'^par$', 'spare\npar\ndare', flags=re.M))
True
Just like string anchors, you can use the line anchors by themselves as a pattern.
>>> ip_lines = 'catapults\nconcatenate\ncat'
>>> print(re.sub(r'^', '* ', ip_lines, flags=re.M))
* catapults
* concatenate
* cat
>>> print(re.sub(r'$', '.', ip_lines, flags=re.M))
catapults.
concatenate.
cat.
If you are dealing with Windows OS based text files, you may have to convert
\r\n
line endings to\n
first. Python functions and methods make it easier to handle such situations. For example, you can specify which line ending to use for theopen()
function, thesplit()
string method handles all whitespaces by default and so on. Or, you can handle\r
as an optional character with quantifiers (see the Dot metacharacter and Quantifiers chapter for details).
Word anchors
The third type of restriction is word anchors. Alphabets (irrespective of case), digits and the underscore character qualify as word characters. You might wonder why there are digits and underscores as well, why not just alphabets? This comes from variable and function naming conventions — typically alphabets, digits and underscores are allowed. So, the definition is more oriented to programming languages than natural ones.
The escape sequence \b
denotes a word boundary. This works for both the start and end of word anchoring. Start of word means either the character prior to the word is a non-word character or there is no character (start of string). Similarly, end of word means the character after the word is a non-word character or no character (end of string). This implies that you cannot have word boundary \b
without a word character.
>>> words = 'par spar apparent spare part'
# replace 'par' irrespective of where it occurs
>>> re.sub(r'par', 'X', words)
'X sX apXent sXe Xt'
# replace 'par' only at the start of word
>>> re.sub(r'\bpar', 'X', words)
'X spar apparent spare Xt'
# replace 'par' only at the end of word
>>> re.sub(r'par\b', 'X', words)
'X sX apparent spare part'
# replace 'par' only if it is not part of another word
>>> re.sub(r'\bpar\b', 'X', words)
'X spar apparent spare part'
Using word boundary as a pattern by itself can yield creative solutions:
# space separated words to double quoted csv
# note the use of 'replace' string method for normal string replacement
# 'translate' method can also be used
>>> words = 'par spar apparent spare part'
>>> print(re.sub(r'\b', '"', words).replace(' ', ','))
"par","spar","apparent","spare","part"
>>> re.sub(r'\b', ' ', '-----hello-----')
'----- hello -----'
# make a programming statement more readable
# shown for illustration purpose only, won't work for all cases
>>> re.sub(r'\b', ' ', 'output=num1+35*42/num2')
' output = num1 + 35 * 42 / num2 '
# excess space at start/end of string can be stripped off
# later you'll learn how to add a qualifier so that strip is not needed
>>> re.sub(r'\b', ' ', 'output=num1+35*42/num2').strip()
'output = num1 + 35 * 42 / num2'
The word boundary has an opposite anchor too. \B
matches wherever \b
doesn't match. This duality will be seen with some other escape sequences too. Negative logic is handy in many text processing situations. But use it with care, you might end up matching things you didn't intend!
>>> words = 'par spar apparent spare part'
# replace 'par' if it is not at the start of word
>>> re.sub(r'\Bpar', 'X', words)
'par sX apXent sXe part'
# replace 'par' at the end of word but not the whole word 'par'
>>> re.sub(r'\Bpar\b', 'X', words)
'par sX apparent spare part'
# replace 'par' if it is not at the end of word
>>> re.sub(r'par\B', 'X', words)
'par spar apXent sXe Xt'
# replace 'par' if it is surrounded by word characters
>>> re.sub(r'\Bpar\B', 'X', words)
'par spar apXent sXe part'
Here are some standalone pattern usage to compare and contrast the two word anchors.
>>> re.sub(r'\b', ':', 'copper')
':copper:'
>>> re.sub(r'\B', ':', 'copper')
'c:o:p:p:e:r'
>>> re.sub(r'\b', ' ', '-----hello-----')
'----- hello -----'
>>> re.sub(r'\B', ' ', '-----hello-----')
' - - - - -h e l l o- - - - - '
Cheatsheet and Summary
Note | Description |
---|---|
\A | restricts the match to the start of string |
\Z | restricts the match to the end of string |
re.fullmatch() | ensures pattern matches the entire input string |
re.fullmatch(pattern, string, flags=0) | |
\n | line separator, dos-style files may need special attention |
metacharacter | characters with special meaning in RE |
^ | restricts the match to the start of line |
$ | restricts the match to the end of line |
re.MULTILINE or re.M | flag to treat input as multiline string |
\b | restricts the match to the start and end of words |
word characters: alphabets, digits, underscore | |
\B | matches wherever \b doesn't match |
In this chapter, you've begun to see building blocks of regular expressions and how they can be used in interesting ways. But at the same time, regular expression is but another tool in the land of text processing. Often, you'd get simpler solution by combining regular expressions with other string methods and expressions. Practice, experience and imagination would help you construct creative solutions. In the coming chapters, you'll see examples for anchors in combination with other features.
Exercises
a) Check if the given strings start with be
.
>>> line1 = 'be nice'
>>> line2 = '"best!"'
>>> line3 = 'better?'
>>> line4 = 'oh no\nbear spotted'
>>> pat = re.compile() ##### add your solution here
>>> bool(pat.search(line1))
True
>>> bool(pat.search(line2))
False
>>> bool(pat.search(line3))
True
>>> bool(pat.search(line4))
False
b) For the given input string, change only the whole word red
to brown
.
>>> words = 'bred red spread credible red.'
>>> re.sub() ##### add your solution here
'bred brown spread credible brown.'
c) For the given input list, filter all elements that contain 42
surrounded by word characters.
>>> words = ['hi42bye', 'nice1423', 'bad42', 'cool_42a', '42fake', '_42_']
>>> [w for w in words if re.search()] ##### add your solution here
['hi42bye', 'nice1423', 'cool_42a', '_42_']
d) For the given input list, filter all elements that start with den
or end with ly
.
>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']
>>> [e for e in items if ] ##### add your solution here
['lovely', '2 lonely', 'dent']
e) For the given input string, change whole word mall
to 1234
only if it is at the start of a line.
>>> para = '''\
... (mall) call ball pall
... ball fall wall tall
... mall call ball pall
... wall mall ball fall
... mallet wallet malls
... mall:call:ball:pall'''
>>> print(re.sub()) ##### add your solution here
(mall) call ball pall
ball fall wall tall
1234 call ball pall
wall mall ball fall
mallet wallet malls
1234:call:ball:pall
f) For the given list, filter all elements having a line starting with den
or ending with ly
.
>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\nfar', 'dent']
##### add your solution here
['lovely', '1\ndentist', '2 lonely', 'fly\nfar', 'dent']
g) For the given input list, filter all whole elements 12\nthree
irrespective of case.
>>> items = ['12\nthree\n', '12\nThree', '12\nthree\n4', '12\nthree']
##### add your solution here
['12\nThree', '12\nthree']
h) For the given input list, replace hand
with X
for all elements that start with hand
followed by at least one word character.
>>> items = ['handed', 'hand', 'handy', 'un-handed', 'handle', 'hand-2']
##### add your solution here
['Xed', 'hand', 'Xy', 'un-handed', 'Xle', 'hand-2']
i) For the given input list, filter all elements starting with h
. Additionally, replace e
with X
for these filtered elements.
>>> items = ['handed', 'hand', 'handy', 'unhanded', 'handle', 'hand-2']
##### add your solution here
['handXd', 'hand', 'handy', 'handlX', 'hand-2']