Flags

Just like options change the default behavior of command line tools, flags are used to change aspects of RE behavior. You have already seen flags for ignoring case and changing behavior of line anchors. Flags can be applied to entire RE using the flags optional argument or to a particular portion of RE using special groups. And both of these forms can be mixed up as well. In regular expression parlance, flags are also known as modifiers.

Flags already seen will again be discussed in this chapter for completeness sake. You'll also learn how to combine multiple flags.

re.IGNORECASE

First up, the flag to ignore case while matching alphabets. When flags argument is used, this can be specified as re.I or re.IGNORECASE constants.

>>> bool(re.search(r'cat', 'Cat'))
False
>>> bool(re.search(r'cat', 'Cat', flags=re.IGNORECASE))
True

>>> re.findall(r'c.t', 'Cat cot CATER ScUtTLe', flags=re.I)
['Cat', 'cot', 'CAT', 'cUt']

# without flag, you need to use: r'[a-zA-Z]+'
# with flag, can also use: r'[A-Z]+'
>>> re.findall(r'[a-z]+', 'Sample123string42with777numbers', flags=re.I)
['Sample', 'string', 'with', 'numbers']

re.DOTALL

Use re.S or re.DOTALL to allow the . metacharacter to match newline characters as well.

# by default, the . metacharacter doesn't match newline
>>> re.sub(r'the.*ice', 'X', 'Hi there\nHave a Nice Day')
'Hi there\nHave a Nice Day'

# re.S flag will allow newline character to be matched as well
>>> re.sub(r'the.*ice', 'X', 'Hi there\nHave a Nice Day', flags=re.S)
'Hi X Day'

Multiple flags can be combined using the bitwise OR operator.

>>> re.sub(r'the.*day', 'Bye', 'Hi there\nHave a Nice Day', flags=re.S|re.I)
'Hi Bye'

re.MULTILINE

As seen earlier, re.M or re.MULTILINE flag would allow the ^ and $ anchors to work line wise.

# check if any line in the string starts with 'top'
>>> bool(re.search(r'^top', 'hi hello\ntop spot', flags=re.M))
True

# check if any line in the string ends with 'ar'
>>> bool(re.search(r'ar$', 'spare\npar\ndare', flags=re.M))
True

re.VERBOSE

The re.X or re.VERBOSE flag is another provision like named capture groups to help add clarity to RE definitions. This flag allows you to use literal whitespaces for aligning purposes and add comments after the # character to break down complex RE into multiple lines.

# same as: pat = re.compile(r'\A((?:[^,]+,){3})([^,]+)')
# note the use of triple quoted string
>>> pat = re.compile(r'''
...         \A(                 # group-1, captures first 3 columns
...             (?:[^,]+,){3}   # non-capturing group to get the 3 columns
...           )
...         ([^,]+)             # group-2, captures 4th column
...         ''', flags=re.X)

>>> pat.sub(r'\1(\2)', '1,2,3,4,5,6,7')
'1,2,3,(4),5,6,7'

There are a few workarounds if you need to match whitespace and # characters literally. Here's the relevant quote from documentation:

Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

>>> bool(re.search(r't a', 'cat and dog', flags=re.X))
False
>>> bool(re.search(r't\ a', 'cat and dog', flags=re.X))
True
>>> bool(re.search(r't[ ]a', 'cat and dog', flags=re.X))
True
>>> bool(re.search(r't\x20a', 'cat and dog', flags=re.X))
True

>>> re.search(r'a#b', 'apple a#b 123', flags=re.X)[0]
'a'
>>> re.search(r'a\#b', 'apple a#b 123', flags=re.X)[0]
'a#b'

Inline comments

Comments can also be added using the (?#comment) special group. This is independent of the re.X flag.

>>> pat = re.compile(r'\A((?:[^,]+,){3})(?#3-cols)([^,]+)(?#4th-col)')

>>> pat.sub(r'\1(\2)', '1,2,3,4,5,6,7')
'1,2,3,(4),5,6,7'

Inline flags

To apply flags to specific portions of RE, specify them inside a special grouping syntax. This will override the flags applied to entire RE definitions, if any. The syntax variations are:

(?flags:pat) will apply flags only for this portion
(?-flags:pat) will negate flags only for this portion
(?flags-flags:pat) will apply and negate particular flags only for this portion
(?flags) will apply flags for the whole RE definition
- can be specified only at the start of RE definition
- if anchors are needed, they should be specified after this group

In these ways, flags can be specified precisely only where it is needed. The flags are to be given as single letter lowercase version of short form constants. For example, i for re.I, s for re.S and so on, except L for re.L or re.LOCALE (discussed in the re.ASCII section). And as can be observed from the below examples, these do not act as capture groups.

# case-sensitive for the whole RE definition
>>> re.findall(r'Cat[a-z]*\b', 'Cat SCatTeR CATER cAts')
['Cat']
# case-insensitive only for the '[a-z]*' portion
>>> re.findall(r'Cat(?i:[a-z]*)\b', 'Cat SCatTeR CATER cAts')
['Cat', 'CatTeR']

# case-insensitive for the whole RE definition using flags argument
>>> re.findall(r'Cat[a-z]*\b', 'Cat SCatTeR CATER cAts', flags=re.I)
['Cat', 'CatTeR', 'CATER', 'cAts']
# case-insensitive for the whole RE definition using inline flags
>>> re.findall(r'(?i)Cat[a-z]*\b', 'Cat SCatTeR CATER cAts')
['Cat', 'CatTeR', 'CATER', 'cAts']
# case-sensitive only for the 'Cat' portion
>>> re.findall(r'(?-i:Cat)[a-z]*\b', 'Cat SCatTeR CATER cAts', flags=re.I)
['Cat', 'CatTeR']

Cheatsheet and Summary

Note	Description
`re.IGNORECASE` or `re.I`	flag to ignore case
`re.DOTALL` or `re.S`	allow `.` metacharacter to match newline characters
`flags=re.S\|re.I`	multiple flags can be combined using the `\|` operator
`re.MULTILINE` or `re.M`	allow `^` and `$` anchors to match line wise
`re.VERBOSE` or `re.X`	allows to use literal whitespaces for aligning purposes
	and to add comments after the `#` character
	escape spaces and `#` if needed as part of actual RE
`(?#comment)`	another way to add comments (not a flag)
`(?flags:pat)`	inline flags only for this `pat`, overrides `flags` argument
	where flags is `i` for `re.I`, `s` for `re.S`, etc
	except `L` for `re.L`
`(?-flags:pat)`	negate flags only for this `pat`
`(?flags-flags:pat)`	apply and negate particular flags only for this `pat`
`(?flags)`	apply flags for whole RE, can be used only at start of RE
	anchors if any, should be specified after `(?flags)`

This chapter showed some of the flags that can be used to change the default behavior of RE definition. And more special groupings were covered.

Exercises

1) Remove from the first occurrence of hat to the last occurrence of it for the given input strings. Match these markers case insensitively.

>>> s1 = 'But Cool THAT\nsee What okay\nwow quite'
>>> s2 = 'it this hat is sliced HIT.'

>>> pat = re.compile()       ##### add your solution here

>>> pat.sub('', s1)
'But Cool Te'
>>> pat.sub('', s2)
'it this .'

2) Delete from start if it is at the beginning of a line up to the next occurrence of the end at the end of a line. Match these markers case insensitively.

>>> para = '''\
... good start
... start working on that
... project you always wanted
... to, do not let it end
... hi there
... start and end the end
... 42
... Start and try to
... finish the End
... bye'''

>>> pat = re.compile()        ##### add your solution here

>>> print(pat.sub('', para))
good start

hi there

42

bye

3) For the given input strings, match all of these three conditions:

This case sensitively
nice and cool case insensitively

>>> s1 = 'This is nice and Cool'
>>> s2 = 'Nice and cool this is'
>>> s3 = 'What is so nice and cool about This?'
>>> s4 = 'nice,cool,This'
>>> s5 = 'not nice This?'
>>> s6 = 'This is not cool'

>>> pat = re.compile()       ##### add your solution here

>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
False
>>> bool(pat.search(s3))
True
>>> bool(pat.search(s4))
True
>>> bool(pat.search(s5))
False
>>> bool(pat.search(s6))
False

4) For the given input strings, match if the string begins with Th and also contains a line that starts with There.

>>> s1 = 'There there\nHave a cookie'
>>> s2 = 'This is a mess\nYeah?\nThereeeee'
>>> s3 = 'Oh\nThere goes the fun'
>>> s4 = 'This is not\ngood\nno There'

>>> pat = re.compile()      ##### add your solution here

>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
True
>>> bool(pat.search(s3))
False
>>> bool(pat.search(s4))
False

5) Explore what the re.DEBUG flag does. Here are some example patterns to check out.

re.compile(r'\Aden|ly\Z', flags=re.DEBUG)
re.compile(r'\b(0x)?[\da-f]+\b', flags=re.DEBUG)
re.compile(r'\b(?:0x)?[\da-f]+\b', flags=re.I|re.DEBUG)

Understanding Python re(gex)?