re introduction

In this chapter, you'll get an introduction of re module that is part of Python's standard library. For some examples, the equivalent normal string method is shown for comparison. This chapter focuses on syntax, regular expression features will be covered next chapter onwards.

re module documentation

It is always a good idea to know where to find the documentation. The default offering for Python regular expressions is the re standard library module. Visit docs.python: re for information on available methods, syntax, features, examples and more. Here's a quote:

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression

re.search

Normally you'd use the in operator to test whether a string is part of another string or not. For regular expressions, use the re.search function whose argument list is shown below.

re.search(pattern, string, flags=0)

The first argument is the RE pattern you want to test against the input string, which is the second argument. flags is optional, it helps to change the default behavior of RE patterns.

As a good practice, always use raw strings to construct the RE pattern. This will become clearer in later chapters. Here's some examples.

>>> sentence = 'This is a sample string'

# check if 'sentence' contains the given search string
>>> 'is' in sentence
True
>>> 'xyz' in sentence
False

# need to load the re module before use
>>> import re

# check if 'sentence' contains the pattern described by RE argument
>>> bool(re.search(r'is', sentence))
True
>>> bool(re.search(r'xyz', sentence))
False

Before using the re module, you need to import it. Further example snippets will assume that the module is already loaded. The return value of re.search function is a re.Match object when a match is found and None otherwise (note that I treat re as a word, not as r and e separately, hence the use of a instead of an). More details about the re.Match object will be discussed in Working with matched portions chapter. For presentation purposes, the examples will use bool function to show True or False depending on whether the RE pattern matched or not.

Here's an example with flags optional argument. It will be discussed in detail in Flags chapter.

>>> sentence = 'This is a sample string'

>>> bool(re.search(r'this', sentence))
False

# re.IGNORECASE (or re.I) is a flag to enable case insensitive matching
>>> bool(re.search(r'this', sentence, flags=re.I))
True

re.search in conditional expressions

As Python evaluates None as False in boolean context, re.search can be used directly in conditional expressions. See also docs.python: Truth Value Testing.

>>> sentence = 'This is a sample string'
>>> if re.search(r'ring', sentence):
...     print('mission success')
... 
mission success

>>> if not re.search(r'xyz', sentence):
...     print('mission failed')
... 
mission failed

Here's some generator expression examples.

>>> words = ['cat', 'attempt', 'tattle']

>>> [w for w in words if re.search(r'tt', w)]
['attempt', 'tattle']
>>> all(re.search(r'at', w) for w in words)
True
>>> any(re.search(r'stat', w) for w in words)
False

re.sub

For normal search and replace, you'd use the str.replace method. For regular expressions, use the re.sub function, whose argument list is shown below.

re.sub(pattern, repl, string, count=0, flags=0)

The first argument is the RE pattern to match against the input string, which is the third argument. The second argument specifies the string which will replace the portions matched by the RE pattern. count and flags are optional arguments.

>>> greeting = 'Have a nice weekend'

# replace all occurrences of 'e' with 'E'
# same as: greeting.replace('e', 'E')
>>> re.sub(r'e', 'E', greeting)
'HavE a nicE wEEkEnd'

# replace first two occurrences of 'e' with 'E'
# same as: greeting.replace('e', 'E', 2)
>>> re.sub(r'e', 'E', greeting, count=2)
'HavE a nicE weekend'

warning A common mistake, not specific to re.sub, is forgetting that strings are immutable in Python.

>>> word = 'cater'
# this will return a string object, won't modify 'word' variable
>>> re.sub(r'cat', 'wag', word)
'wager'
>>> word
'cater'

# need to explicitly assign the result if 'word' has to be changed
>>> word = re.sub(r'cat', 'wag', word)
>>> word
'wager'

Compiling regular expressions

Regular expressions can be compiled using re.compile function, which gives back a re.Pattern object.

re.compile(pattern, flags=0)

The top level re module functions are all available as methods for such objects. Compiling a regular expression is useful if the RE has to be used in multiple places or called upon multiple times inside a loop (speed benefit).

info By default, Python maintains a small list of recently used RE, so the speed benefit doesn't apply for trivial use cases. See also stackoverflow: Is it worth using re.compile?

>>> pet = re.compile(r'dog')
>>> type(pet)
<class 're.Pattern'>

# note that 'search' is called upon 'pet' which is a 're.Pattern' object
# since 'pet' has the RE information, you only need to pass input string
>>> bool(pet.search('They bought a dog'))
True
>>> bool(pet.search('A cat crossed their path'))
False

# replace all occurrences of 'dog' with 'cat'
>>> pet.sub('cat', 'They bought a dog')
'They bought a cat'

Some of the methods available for compiled patterns also accept more arguments than those available for top level functions of the re module. For example, the search method on a compiled pattern has two optional arguments to specify start and end index positions. Similar to range function and slicing notation, the ending index has to be specified 1 greater than desired index.

Pattern.search(string[, pos[, endpos]])

Note that there's no flags option as that has to be specified with re.compile.

>>> sentence = 'This is a sample string'
>>> word = re.compile(r'is')

# search for 'is' starting from 5th character of 'sentence' variable
>>> bool(word.search(sentence, 4))
True

# search for 'is' starting from 7th character of 'sentence' variable
>>> bool(word.search(sentence, 6))
False

# search for 'is' between 3rd and 4th characters
>>> bool(word.search(sentence, 2, 4))
True

bytes

To work with bytes data type, the RE must be of bytes data as well. Similar to str RE, use raw format to construct a bytes RE.

>>> byte_data = b'This is a sample string'

# error message truncated for presentation purposes
>>> re.search(r'is', byte_data)
TypeError: cannot use a string pattern on a bytes-like object

# use rb'..' for constructing bytes pattern
>>> bool(re.search(rb'is', byte_data))
True
>>> bool(re.search(rb'xyz', byte_data))
False

Cheatsheet and Summary

NoteDescription
docs.python: rePython standard module for regular expressions
re.searchCheck if given pattern is present anywhere in input string
re.search(pattern, string, flags=0)
Output is a re.Match object, usable in conditional expressions
raw strings preferred to define RE
Additionally, Python maintains a small cache of recent RE
re.subsearch and replace using RE
re.sub(pattern, repl, string, count=0, flags=0)
re.compileCompile a pattern for reuse, output is a re.Pattern object
re.compile(pattern, flags=0)
rb'pat'Use byte pattern for byte input
re.IGNORECASE or re.Iflag to ignore case while matching

This chapter introduced the re module, which is part of the standard library. Functions re.search and re.sub were discussed as well as how to compile RE using re.compile function. The RE pattern is usually defined using raw strings. For byte input, the pattern has to be of byte type too. Although the re module is good enough for most use cases, there are situations where you need to use the third party regex module. To avoid mixing up features, a separate chapter is dedicated for the regex module towards the end of the book.

The next section has exercises to test your understanding of the concepts introduced in this chapter. Please do solve them before moving on to the next chapter.

Exercises

info Try to solve exercises in every chapter using only the features discussed until that chapter. Some of the exercises will be easier to solve with techniques presented in later chapters, but the aim of these exercises is to explore the features presented so far.

info All the exercises are also collated together in one place at Exercises.md. For solutions, see Exercise_solutions.md.

a) Check whether the given strings contain 0xB0. Display a boolean result as shown below.

>>> line1 = 'start address: 0xA0, func1 address: 0xC0'
>>> line2 = 'end address: 0xFF, func2 address: 0xB0'

>>> bool(re.search(r'', line1))     ##### add your solution here
False
>>> bool(re.search(r'', line2))     ##### add your solution here
True

b) Replace all occurrences of 5 with five for the given string.

>>> ip = 'They ate 5 apples and 5 oranges'

>>> re.sub()        ##### add your solution here
'They ate five apples and five oranges'

c) Replace first occurrence of 5 with five for the given string.

>>> ip = 'They ate 5 apples and 5 oranges'

>>> re.sub()       ##### add your solution here
'They ate five apples and 5 oranges'

d) For the given list, filter all elements that do not contain e.

>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']

>>> [w for w in items if not re.search()]        ##### add your solution here
['goal', 'sit']

e) Replace all occurrences of note irrespective of case with X.

>>> ip = 'This note should not be NoTeD'

>>> re.sub()        ##### add your solution here
'This X should not be XD'

f) Check if at is present in the given byte input data.

>>> ip = b'tiger imp goat'

>>> bool(re.search())     ##### add your solution here
True

g) For the given input string, display all lines not containing start irrespective of case.

>>> para = '''good start
... Start working on that
... project you always wanted
... stars are shining brightly
... hi there
... start and try to
... finish the book
... bye'''

>>> pat = re.compile()      ##### add your solution here
>>> for line in para.split('\n'):
...     if not pat.search(line):
...         print(line)
... 
project you always wanted
stars are shining brightly
hi there
finish the book
bye

h) For the given list, filter all elements that contains either a or w.

>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']

##### add your solution here
>>> [w for w in items if re.search() or re.search()]
['goal', 'new', 'eat']

i) For the given list, filter all elements that contains both e and n.

>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']

##### add your solution here
>>> [w for w in items if re.search() and re.search()]
['new', 'dinner']

j) For the given string, replace 0xA0 with 0x7F and 0xC0 with 0x1F.

>>> ip = 'start address: 0xA0, func1 address: 0xC0'

##### add your solution here
'start address: 0x7F, func1 address: 0x1F'