In this chapter, you'll get an introduction of
re module that is part of Python's standard library. For some examples, the equivalent normal string method is shown for comparison. This chapter focuses on syntax, regular expression features will be covered next chapter onwards.
It is always a good idea to know where to find the documentation. The default offering for Python regular expressions is the
re standard library module. Visit docs.python: re for information on available methods, syntax, features, examples and more. Here's a quote:
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression
Normally you'd use the
in operator to test whether a string is part of another string or not. For regular expressions, use the
re.search function whose argument list is shown below.
re.search(pattern, string, flags=0)
The first argument is the RE pattern you want to test against the input string, which is the second argument.
flags is optional, it helps to change the default behavior of RE patterns.
As a good practice, always use raw strings to construct the RE pattern. This will become clearer in later chapters. Here's some examples.
>>> sentence = 'This is a sample string' # check if 'sentence' contains the given search string >>> 'is' in sentence True >>> 'xyz' in sentence False # need to load the re module before use >>> import re # check if 'sentence' contains the pattern described by RE argument >>> bool(re.search(r'is', sentence)) True >>> bool(re.search(r'xyz', sentence)) False
Before using the
re module, you need to
import it. Further example snippets will assume that the module is already loaded. The return value of
re.search function is a
re.Match object when a match is found and
None otherwise (note that I treat
re as a word, not as
e separately, hence the use of a instead of an). More details about the
re.Match object will be discussed in Working with matched portions chapter. For presentation purposes, the examples will use
bool function to show
False depending on whether the RE pattern matched or not.
Here's an example with
flags optional argument. It will be discussed in detail in Flags chapter.
>>> sentence = 'This is a sample string' >>> bool(re.search(r'this', sentence)) False # re.IGNORECASE (or re.I) is a flag to enable case insensitive matching >>> bool(re.search(r'this', sentence, flags=re.I)) True
As Python evaluates
False in boolean context,
re.search can be used directly in conditional expressions. See also docs.python: Truth Value Testing.
>>> sentence = 'This is a sample string' >>> if re.search(r'ring', sentence): ... print('mission success') ... mission success >>> if not re.search(r'xyz', sentence): ... print('mission failed') ... mission failed
Here's some generator expression examples.
>>> words = ['cat', 'attempt', 'tattle'] >>> [w for w in words if re.search(r'tt', w)] ['attempt', 'tattle'] >>> all(re.search(r'at', w) for w in words) True >>> any(re.search(r'stat', w) for w in words) False
For normal search and replace, you'd use the
str.replace method. For regular expressions, use the
re.sub function, whose argument list is shown below.
re.sub(pattern, repl, string, count=0, flags=0)
The first argument is the RE pattern to match against the input string, which is the third argument. The second argument specifies the string which will replace the portions matched by the RE pattern.
flags are optional arguments.
>>> greeting = 'Have a nice weekend' # replace all occurrences of 'e' with 'E' # same as: greeting.replace('e', 'E') >>> re.sub(r'e', 'E', greeting) 'HavE a nicE wEEkEnd' # replace first two occurrences of 'e' with 'E' # same as: greeting.replace('e', 'E', 2) >>> re.sub(r'e', 'E', greeting, count=2) 'HavE a nicE weekend'
A common mistake, not specific to
re.sub, is forgetting that strings are immutable in Python.
>>> word = 'cater' # this will return a string object, won't modify 'word' variable >>> re.sub(r'cat', 'wag', word) 'wager' >>> word 'cater' # need to explicitly assign the result if 'word' has to be changed >>> word = re.sub(r'cat', 'wag', word) >>> word 'wager'
Regular expressions can be compiled using
re.compile function, which gives back a
The top level
re module functions are all available as methods for such objects. Compiling a regular expression is useful if the RE has to be used in multiple places or called upon multiple times inside a loop (speed benefit).
By default, Python maintains a small list of recently used RE, so the speed benefit doesn't apply for trivial use cases. See also stackoverflow: Is it worth using re.compile?
>>> pet = re.compile(r'dog') >>> type(pet) <class 're.Pattern'> # note that 'search' is called upon 'pet' which is a 're.Pattern' object # since 'pet' has the RE information, you only need to pass input string >>> bool(pet.search('They bought a dog')) True >>> bool(pet.search('A cat crossed their path')) False # replace all occurrences of 'dog' with 'cat' >>> pet.sub('cat', 'They bought a dog') 'They bought a cat'
Some of the methods available for compiled patterns also accept more arguments than those available for top level functions of the
re module. For example, the
search method on a compiled pattern has two optional arguments to specify start and end index positions. Similar to
range function and slicing notation, the ending index has to be specified
1 greater than desired index.
Pattern.search(string[, pos[, endpos]])
Note that there's no
flags option as that has to be specified with
>>> sentence = 'This is a sample string' >>> word = re.compile(r'is') # search for 'is' starting from 5th character of 'sentence' variable >>> bool(word.search(sentence, 4)) True # search for 'is' starting from 7th character of 'sentence' variable >>> bool(word.search(sentence, 6)) False # search for 'is' between 3rd and 4th characters >>> bool(word.search(sentence, 2, 4)) True
To work with
bytes data type, the RE must be of
bytes data as well. Similar to
str RE, use raw format to construct a
>>> byte_data = b'This is a sample string' # error message truncated for presentation purposes >>> re.search(r'is', byte_data) TypeError: cannot use a string pattern on a bytes-like object # use rb'..' for constructing bytes pattern >>> bool(re.search(rb'is', byte_data)) True >>> bool(re.search(rb'xyz', byte_data)) False
|docs.python: re||Python standard module for regular expressions|
|Check if given pattern is present anywhere in input string|
|Output is a |
|raw strings preferred to define RE|
|Additionally, Python maintains a small cache of recent RE|
|search and replace using RE|
|Compile a pattern for reuse, output is a |
|Use byte pattern for byte input|
|flag to ignore case while matching|
This chapter introduced the
re module, which is part of the standard library. Functions
re.sub were discussed as well as how to compile RE using
re.compile function. The RE pattern is usually defined using raw strings. For byte input, the pattern has to be of byte type too. Although the
re module is good enough for most use cases, there are situations where you need to use the third party
regex module. To avoid mixing up features, a separate chapter is dedicated for the regex module towards the end of the book.
The next section has exercises to test your understanding of the concepts introduced in this chapter. Please do solve them before moving on to the next chapter.
Try to solve exercises in every chapter using only the features discussed until that chapter. Some of the exercises will be easier to solve with techniques presented in later chapters, but the aim of these exercises is to explore the features presented so far.
a) Check whether the given strings contain
0xB0. Display a boolean result as shown below.
>>> line1 = 'start address: 0xA0, func1 address: 0xC0' >>> line2 = 'end address: 0xFF, func2 address: 0xB0' >>> bool(re.search(r'', line1)) ##### add your solution here False >>> bool(re.search(r'', line2)) ##### add your solution here True
b) Replace all occurrences of
five for the given string.
>>> ip = 'They ate 5 apples and 5 oranges' >>> re.sub() ##### add your solution here 'They ate five apples and five oranges'
c) Replace first occurrence of
five for the given string.
>>> ip = 'They ate 5 apples and 5 oranges' >>> re.sub() ##### add your solution here 'They ate five apples and 5 oranges'
d) For the given list, filter all elements that do not contain
>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner'] >>> [w for w in items if not re.search()] ##### add your solution here ['goal', 'sit']
e) Replace all occurrences of
note irrespective of case with
>>> ip = 'This note should not be NoTeD' >>> re.sub() ##### add your solution here 'This X should not be XD'
f) Check if
at is present in the given byte input data.
>>> ip = b'tiger imp goat' >>> bool(re.search()) ##### add your solution here True
g) For the given input string, display all lines not containing
start irrespective of case.
>>> para = '''good start ... Start working on that ... project you always wanted ... stars are shining brightly ... hi there ... start and try to ... finish the book ... bye''' >>> pat = re.compile() ##### add your solution here >>> for line in para.split('\n'): ... if not pat.search(line): ... print(line) ... project you always wanted stars are shining brightly hi there finish the book bye
h) For the given list, filter all elements that contains either
>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner'] ##### add your solution here >>> [w for w in items if re.search() or re.search()] ['goal', 'new', 'eat']
i) For the given list, filter all elements that contains both
>>> items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner'] ##### add your solution here >>> [w for w in items if re.search() and re.search()] ['new', 'dinner']
j) For the given string, replace
>>> ip = 'start address: 0xA0, func1 address: 0xC0' ##### add your solution here 'start address: 0x7F, func1 address: 0x1F'