Anchors

Now that you're familiar with RE syntax and couple of re module functions, the next step is to know about the special features of regular expressions. In this chapter, you'll be learning about qualifying a pattern. Instead of matching anywhere in the given input string, restrictions can be specified. For now, you'll see the ones that are already part of re module. In later chapters, you'll learn how to define your own rules for restriction.

These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a \ character (discussed in Escaping metacharacters chapter).

String anchors

This restriction is about qualifying a RE to match only at the start or the end of an input string. These provide functionality similar to the str methods startswith and endswith. First up, the escape sequence \A which restricts the matching to the start of string.

# \A is placed as a prefix to the search term
>>> bool(re.search(r'\Acat', 'cater'))
True
>>> bool(re.search(r'\Acat', 'concatenation'))
False

>>> bool(re.search(r'\Ahi', 'hi hello\ntop spot'))
True
>>> bool(re.search(r'\Atop', 'hi hello\ntop spot'))
False

To restrict the matching to the end of string, \Z is used.

# \Z is placed as a suffix to the search term
>>> bool(re.search(r'are\Z', 'spare'))
True
>>> bool(re.search(r'are\Z', 'nearest'))
False

>>> words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']
>>> [w for w in words if re.search(r'er\Z', w)]
['surrender', 'newer']
>>> [w for w in words if re.search(r't\Z', w)]
['pest']

You can emulate string concatenation operations by using the anchors by themselves as a pattern.

# insert text at the start of a string
>>> re.sub(r'\A', 're', 'live')
'relive'
>>> re.sub(r'\A', 're', 'send')
'resend'

# appending text
>>> re.sub(r'\Z', 'er', 'cat')
'cater'
>>> re.sub(r'\Z', 'er', 'hack')
'hacker'

warning Use the optional start and end index arguments for Pattern.search method with caution. They are not equivalent to string slicing. For example, specifying a greater than 0 start index when using \A is always going to return False. This is because, as far as the search method is concerned, only the search space is narrowed and the anchor positions haven't changed. When slicing is used, you are creating an entirely new string object with new anchor positions.

>>> word_pat = re.compile(r'\Aat')

>>> bool(word_pat.search('cater', 1))
False
>>> bool(word_pat.search('cater'[1:]))
True

re.fullmatch

Combining both the start and end string anchors, you can restrict the matching to the whole string. Similar to comparing strings using the == operator.

>>> word_pat = re.compile(r'\Acat\Z')

>>> bool(word_pat.search('cat'))
True
>>> bool(word_pat.search('concatenation'))
False

You can also use re.fullmatch function to ensure the pattern matches only the whole input string and not just a part of the input. This may not seem useful with features introduced so far, but when you have a complex RE pattern with multiple alternatives, this function is quite handy. The argument list is same as the re.search function.

re.fullmatch(pattern, string, flags=0)

>>> word_pat = re.compile(r'cat', flags=re.I)

>>> bool(word_pat.fullmatch('Cat'))
True
>>> bool(word_pat.fullmatch('Scatter'))
False

Line anchors

A string input may contain single or multiple lines. The newline character \n is used as the line separator. There are two line anchors, ^ metacharacter for matching the start of line and $ for matching the end of line. If there are no newline characters in the input string, these will behave same as \A and \Z respectively.

>>> pets = 'cat and dog'

>>> bool(re.search(r'^cat', pets))
True
>>> bool(re.search(r'^dog', pets))
False

>>> bool(re.search(r'dog$', pets))
True
>>> bool(re.search(r'^dog$', pets))
False

By default, the input string is considered as a single line, even if multiple newline characters are present. In such cases, the $ metacharacter can match both the end of string and just before \n if it is the last character. However, \Z will always match the end of string, irrespective of the characters present.

>>> greeting = 'hi there\nhave a nice day\n'

>>> bool(re.search(r'day$', greeting))
True
>>> bool(re.search(r'day\n$', greeting))
True

>>> bool(re.search(r'day\Z', greeting))
False
>>> bool(re.search(r'day\n\Z', greeting))
True

To indicate that the input string should be treated as multiple lines, you need to enable the re.MULTILINE flag (or re.M short form).

# check if any line in the string starts with 'top'
>>> bool(re.search(r'^top', 'hi hello\ntop spot', flags=re.M))
True

# check if any line in the string ends with 'ar'
>>> bool(re.search(r'ar$', 'spare\npar\ndare', flags=re.M))
True

# filter all elements having lines ending with 'are'
>>> elements = ['spare\ntool', 'par\n', 'dare']
>>> [e for e in elements if re.search(r'are$', e, flags=re.M)]
['spare\ntool', 'dare']

# check if any complete line in the string is 'par'
>>> bool(re.search(r'^par$', 'spare\npar\ndare', flags=re.M))
True

Just like string anchors, you can use the line anchors by themselves as a pattern.

# note that there is no \n at the end of this input string
>>> ip_lines = 'catapults\nconcatenate\ncat'
>>> print(re.sub(r'^', '* ', ip_lines, flags=re.M))
* catapults
* concatenate
* cat

>>> print(re.sub(r'$', '.', ip_lines, flags=re.M))
catapults.
concatenate.
cat.

warning If you are dealing with Windows OS based text files, you'll have to convert \r\n line endings to \n first. Which is easily handled by many of the Python functions and methods. For example, you can specify which line ending to use for open function, the split string method handles all whitespaces by default and so on. Or, you can handle \r as optional character with quantifiers (see Dot metacharacter and Quantifiers chapter).

Word anchors

The third type of restriction is word anchors. Alphabets (irrespective of case), digits and the underscore character qualify as word characters. You might wonder why there are digits and underscores as well, why not only alphabets? This comes from variable and function naming conventions — typically alphabets, digits and underscores are allowed. So, the definition is more oriented to programming languages than natural ones.

The escape sequence \b denotes a word boundary. This works for both the start of word and end of word anchoring. Start of word means either the character prior to the word is a non-word character or there is no character (start of string). Similarly, end of word means the character after the word is a non-word character or no character (end of string). This implies that you cannot have word boundary \b without a word character.

>>> words = 'par spar apparent spare part'

# replace 'par' irrespective of where it occurs
>>> re.sub(r'par', 'X', words)
'X sX apXent sXe Xt'
# replace 'par' only at start of word
>>> re.sub(r'\bpar', 'X', words)
'X spar apparent spare Xt'
# replace 'par' only at end of word
>>> re.sub(r'par\b', 'X', words)
'X sX apparent spare part'
# replace 'par' only if it is not part of another word
>>> re.sub(r'\bpar\b', 'X', words)
'X spar apparent spare part'

You can get lot more creative with using word boundary as a pattern by itself:

# space separated words to double quoted csv
# note the use of 'replace' string method for normal string replacement
# 'translate' method can also be used
>>> words = 'par spar apparent spare part'
>>> print(re.sub(r'\b', '"', words).replace(' ', ','))
"par","spar","apparent","spare","part"

>>> re.sub(r'\b', ' ', '-----hello-----')
'----- hello -----'

# make a programming statement more readable
# shown for illustration purpose only, won't work for all cases
>>> re.sub(r'\b', ' ', 'foo_baz=num1+35*42/num2')
' foo_baz = num1 + 35 * 42 / num2 '
# excess space at start/end of string can be stripped off
# later you'll learn how to add a qualifier so that strip is not needed
>>> re.sub(r'\b', ' ', 'foo_baz=num1+35*42/num2').strip()
'foo_baz = num1 + 35 * 42 / num2'

The word boundary has an opposite anchor too. \B matches wherever \b doesn't match. This duality will be seen with some other escape sequences too. Negative logic is handy in many text processing situations. But use it with care, you might end up matching things you didn't intend!

>>> words = 'par spar apparent spare part'

# replace 'par' if it is not start of word
>>> re.sub(r'\Bpar', 'X', words)
'par sX apXent sXe part'
# replace 'par' at end of word but not whole word 'par'
>>> re.sub(r'\Bpar\b', 'X', words)
'par sX apparent spare part'
# replace 'par' if it is not end of word
>>> re.sub(r'par\B', 'X', words)
'par spar apXent sXe Xt'
# replace 'par' if it is surrounded by word characters
>>> re.sub(r'\Bpar\B', 'X', words)
'par spar apXent sXe part'

Here's some standalone pattern usage to compare and contrast the two word anchors.

>>> re.sub(r'\b', ':', 'copper')
':copper:'
>>> re.sub(r'\B', ':', 'copper')
'c:o:p:p:e:r'

>>> re.sub(r'\b', ' ', '-----hello-----')
'----- hello -----'
>>> re.sub(r'\B', ' ', '-----hello-----')
' - - - - -h e l l o- - - - - '

Cheatsheet and Summary

NoteDescription
\Arestricts the match to the start of string
\Zrestricts the match to the end of string
re.fullmatchensures pattern matches the entire input string
re.fullmatch(pattern, string, flags=0)
\nline separator, dos-style files need special attention
metacharactercharacters with special meaning in RE
^restricts the match to the start of line
$restricts the match to the end of line
re.MULTILINE or re.Mflag to treat input as multiline string
\brestricts the match to the start/end of words
word characters: alphabets, digits, underscore
\Bmatches wherever \b doesn't match

In this chapter, you've begun to see building blocks of regular expressions and how they can be used in interesting ways. But at the same time, regular expression is but another tool in the land of text processing. Often, you'd get simpler solution by combining regular expressions with other string methods and generator expressions. Practice, experience and imagination would help you construct creative solutions. In coming chapters, you'll see more applications of anchors. The regex module also supports \G anchor which is best understood in combination with other regular expression features.

Exercises

a) Check if the given strings start with be.

>>> line1 = 'be nice'
>>> line2 = '"best!"'
>>> line3 = 'better?'
>>> line4 = 'oh no\nbear spotted'

>>> pat = re.compile()       ##### add your solution here

>>> bool(pat.search(line1))
True
>>> bool(pat.search(line2))
False
>>> bool(pat.search(line3))
True
>>> bool(pat.search(line4))
False

b) For the given input string, change only whole word red to brown

>>> words = 'bred red spread credible'

>>> re.sub()     ##### add your solution here
'bred brown spread credible'

c) For the given input list, filter all elements that contains 42 surrounded by word characters.

>>> words = ['hi42bye', 'nice1423', 'bad42', 'cool_42a', 'fake4b']

>>> [w for w in words if re.search()]   ##### add your solution here
['hi42bye', 'nice1423', 'cool_42a']

d) For the given input list, filter all elements that start with den or end with ly.

>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']

>>> [e for e in items if ]        ##### add your solution here
['lovely', '2 lonely', 'dent']

e) For the given input string, change whole word mall to 1234 only if it is at the start of a line.

>>> para = '''\
... ball fall wall tall
... mall call ball pall
... wall mall ball fall
... mallet wallet malls'''

>>> print(re.sub())    ##### add your solution here
ball fall wall tall
1234 call ball pall
wall mall ball fall
mallet wallet malls

f) For the given list, filter all elements having a line starting with den or ending with ly.

>>> items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\nfar', 'dent']

##### add your solution here
['lovely', '1\ndentist', '2 lonely', 'fly\nfar', 'dent']

g) For the given input list, filter all whole elements 12\nthree irrespective of case.

>>> items = ['12\nthree\n', '12\nThree', '12\nthree\n4', '12\nthree']
##### add your solution here
['12\nThree', '12\nthree']

h) For the given input list, replace hand with X for all elements that start with hand followed by at least one word character.

>>> items = ['handed', 'hand', 'handy', 'unhanded', 'handle', 'hand-2']

##### add your solution here
['Xed', 'hand', 'Xy', 'unhanded', 'Xle', 'hand-2']

i) For the given input list, filter all elements starting with h. Additionally, replace e with X for these filtered elements.

>>> items = ['handed', 'hand', 'handy', 'unhanded', 'handle', 'hand-2']

##### add your solution here
['handXd', 'hand', 'handy', 'handlX', 'hand-2']