regex module

The third party regex module ( offers advanced features like those found in Perl language and other regular expression implementations. To install the module from command line, you can use either of these depending on your usage:

  • pip install regex in a virtual environment
  • python3.8 -m pip install --user regex for system wide accessibility

By default, regex module uses VERSION0 which is compatible with the re module. If you want all the features, VERSION1 should be used. For example, set operators is a feature available only with VERSION1. You can choose the version to be used in two ways. Setting regex.DEFAULT_VERSION to regex.VERSION0 or regex.VERSION1 is a global option. (?V0) and (?V1) are inline flag options.

info The examples in this chapter are presented assuming VERSION1 is enabled.

>>> import regex
>>> regex.DEFAULT_VERSION = regex.VERSION1

>>> sentence = 'This is a sample string'
>>> bool('is', sentence))

Possessive quantifiers

Appending a + to greedy quantifiers makes them possessive. These behave like greedy quantifiers, but without the backtracking. So, something like r'Error.*+valid' will never match because .*+ will consume all the remaining characters. If both greedy and possessive quantifier versions are functionally equivalent, then possessive is preferred because it will fail faster for non-matching cases.

# functionally equivalent greedy and possessive versions
>>> demo = ['abc', 'ac', 'adc', 'abbc', 'xabbbcz', 'bbb', 'bc', 'abbbbbc']
>>> [w for w in demo if'ab*c', w)]
['abc', 'ac', 'abbc', 'xabbbcz', 'abbbbbc']
>>> [w for w in demo if'ab*+c', w)]
['abc', 'ac', 'abbc', 'xabbbcz', 'abbbbbc']

# different results
# numbers >= 100 if there are leading zeros
>>> regex.findall(r'\b0*\d{3,}\b', '0501 035 154 12 26 98234')
['0501', '035', '154', '98234']
>>> regex.findall(r'\b0*+\d{3,}\b', '0501 035 154 12 26 98234')
['0501', '154', '98234']

The effect of possessive quantifier can also be expressed using atomic grouping. The syntax is (?>pat), where pat is the portion you want to match possessively.

# same as: r'[bo]++'
>>> regex.sub(r'(?>[bo]+)', 'X', 'abbbc foooooot')
'aXc fXt'

# same as: r'\b0*+\d{3,}\b'
>>> regex.findall(r'\b(?>0*)\d{3,}\b', '0501 035 154 12 26 98234')
['0501', '154', '98234']

Subexpression calls

If backreferences are like variables, then subexpression calls are like functions. Backreferences allow you to reuse the portion matched by the capture group. Subexpression calls allow you to reuse the pattern that was used inside the capture group. You can call subexpressions recursively too, see Recursive matching section for details.

The syntax is (?N) where N is the capture group you want to call. This is applicable only in RE definition, not in replacement sections.

>>> row = 'today,2008-03-24,food,2012-08-12,nice,5632'

# with re module and manually repeating the pattern
>>>'\d{4}-\d{2}-\d{2}.*\d{4}-\d{2}-\d{2}', row)[0]

# with regex module and subexpression calling
>>>'(\d{4}-\d{2}-\d{2}).*(?1)', row)[0]

Named capture groups can be called using (?&name) syntax.

>>> row = 'today,2008-03-24,food,2012-08-12,nice,5632'

>>>'(?P<date>\d{4}-\d{2}-\d{2}).*(?&date)', row)[0]

Positive lookbehind with \K

Most (but not all) of the positive lookbehind cases can be solved by adding \K as a suffix to the pattern to be tested. This will work for variable length patterns as well.

# similar to: r'(?<=\b\w)\w*\W*'
# text matched before \K won't be replaced
>>> regex.sub(r'\b\w\K\w*\W*', '', 'sea eat car rat eel tea')

# variable length example
# replace only 3rd occurrence of 'cat'
>>> regex.sub(r'(cat.*?){2}\Kcat', 'X', 'cat scatter cater scat', count=1)
'cat scatter Xer scat'

Here's another example that won't work if greedy quantifier is used instead of possessive quantifier.

>>> row = '421,foo,2425,42,5,foo,6,6,42'

# lookarounds used to ensure start/end of column matching
# possessive quantifier used to ensure partial column is not captured
# if a column has same text as another column, the latter column is deleted
>>> while True:
...     row, cnt = regex.subn(r'(?<![^,])([^,]++).*\K,\1(?![^,])', r'', row)
...     if cnt == 0:
...         break
>>> row

Variable length lookbehind

The regex module allows using variable length lookbehind without needing any change.

>>> s = 'pore42 tar3 dare7 care5'
>>> regex.findall(r'(?<!tar|dare)\d+', s)
['42', '5']
>>> regex.findall(r'(?<=\b[pd][a-z]*)\d+', s)
['42', '7']
>>> regex.sub(r'(?<=\A|,)(?=,|\Z)', 'NA', ',1,,,two,3,,,')

>>> regex.sub(r'(?<=(cat.*?){2})cat', 'X', 'cat scatter cater scat', count=1)
'cat scatter Xer scat'

>>> bool('(?<!cat.*)dog', 'fox,cat,dog,parrot'))
>>> bool('(?<!parrot.*)dog', 'fox,cat,dog,parrot'))

warning As lookarounds do not consume characters, don't use variable length lookbehind between two patterns. Use negated groups instead.

# match if 'go' is not there between 'at' and 'par'

# wrong use of lookaround
>>> bool('at(?<!go.*)par', 'fox,cat,dog,parrot'))

# correct use of negated group
>>> bool('at((?!go).)*par', 'fox,cat,dog,parrot'))

\G anchor

The \G anchor restricts matching from start of string like the \A anchor. In addition, after a match is done, ending of that match is considered as the new anchor location. This process is repeated again and continues until the given RE fails to match (assuming multiple matches with sub, findall etc).

# all non-whitespace characters from start of string
>>> regex.findall(r'\G\S', '123-87-593 42 foo')
['1', '2', '3', '-', '8', '7', '-', '5', '9', '3']
>>> regex.sub(r'\G\S', '*', '123-87-593 42 foo')
'********** 42 foo'

# all digits and optional hyphen combo from start of string
>>> regex.findall(r'\G\d+-?', '123-87-593 42 foo')
['123-', '87-', '593']
>>> regex.sub(r'\G(\d+)(-?)', r'(\1)\2', '123-87-593 42 foo')
'(123)-(87)-(593) 42 foo'

# all word characters from start of string
# only if it is followed by word character
>>> regex.findall(r'\G\w(?=\w)', 'cat12 bat pin')
['c', 'a', 't', '1']
>>> regex.sub(r'\G\w(?=\w)', r'\g<0>:', 'cat12 bat pin')
'c:a:t:1:2 bat pin'

# all lowercase alphabets or space from start of string
>>> regex.sub(r'\G[a-z ]', r'(\g<0>)', 'par tar-den hen-food mood')
'(p)(a)(r)( )(t)(a)(r)-den hen-food mood'

Recursive matching

The subexpression call special group was introduced as analogous to function call. And in typical function fashion, it does support recursion. Useful to match nested patterns, which is usually not recommended to be done with regular expressions. Indeed, use a proper parser library if you are looking to parse file formats like html, xml, json, csv, etc. But for some cases, a parser might not be available and using RE might be simpler than writing a parser from scratch.

First up, a RE to match a set of parentheses that is not nested (termed as level-one RE for reference).

# note the use of possessive quantifier
>>> eqn0 = 'a + (b * c) - (d / e)'
>>> regex.findall(r'\([^()]++\)', eqn0)
['(b * c)', '(d / e)']

>>> eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
>>> regex.findall(r'\([^()]++\)', eqn1)
['(f+x)', '(3-g)']

Next, matching a set of parentheses which may optionally contain any number of non-nested sets of parentheses (termed as level-two RE for reference). See debuggex for a railroad diagram, which visually shows the recursive nature of this RE.

>>> eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
# note the use of non-capturing group
>>> regex.findall(r'\((?:[^()]++|\([^()]++\))++\)', eqn1)
['((f+x)^y-42)', '((3-g)^z+2)']

>>> eqn2 = 'a + (b) + ((c)) + (((d)))'
>>> regex.findall(r'\((?:[^()]++|\([^()]++\))++\)', eqn2)
['(b)', '((c))', '((d))']

That looks very cryptic. Better to use regex.X flag for clarity as well as for comparing against the recursive version. Breaking down the RE, you can see ( and ) have to be matched literally. Inside that, valid string is made up of either non-parentheses characters or a non-nested parentheses sequence (level-one RE).

>>> lvl2 = regex.compile('''
...          \(              #literal (
...            (?:           #start of non-capturing group
...             [^()]++      #non-parentheses characters
...             |            #OR
...             \([^()]++\)  #level-one RE
...            )++           #end of non-capturing group, 1 or more times
...          \)              #literal )
...          ''', flags=regex.X)

>>> lvl2.findall(eqn1)
['((f+x)^y-42)', '((3-g)^z+2)']

>>> lvl2.findall(eqn2)
['(b)', '((c))', '((d))']

To recursively match any number of nested sets of parentheses, use a capture group and call it within the capture group itself. Since entire RE needs to be called here, you can use the default zeroth capture group (this also helps to avoid having to use finditer). Comparing with level-two RE, the only change is that (?0) is used instead of the level-one RE in the second alternation.

>>> lvln = regex.compile('''
...          \(           #literal (
...            (?:        #start of non-capturing group
...             [^()]++   #non-parentheses characters
...             |         #OR
...             (?0)      #recursive call
...            )++        #end of non-capturing group, 1 or more times
...          \)           #literal )
...          ''', flags=regex.X)

>>> lvln.findall(eqn0)
['(b * c)', '(d / e)']

>>> lvln.findall(eqn1)
['((f+x)^y-42)', '((3-g)^z+2)']

>>> lvln.findall(eqn2)
['(b)', '((c))', '(((d)))']

>>> eqn3 = '(3+a) * ((r-2)*(t+2)/6) + 42 * (a(b(c(d(e)))))'
>>> lvln.findall(eqn3)
['(3+a)', '((r-2)*(t+2)/6)', '(a(b(c(d(e)))))']

Named character sets

A named character set is defined by a name enclosed between [: and :] and has to be used within a character class [], along with any other characters as needed. Using [:^ instead of [: will negate the named character set. See regular-expressions: POSIX Bracket for full list, and refer to pypi: regex for notes on Unicode.

# similar to: r'\d+' or r'[0-9]+'
>>> regex.split(r'[[:digit:]]+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']
# similar to: r'[a-zA-Z]+'
>>> regex.sub(r'[[:alpha:]]+', ':', 'Sample123string42with777numbers')

# similar to: r'[\w\s]+'
>>> regex.findall(r'[[:word:][:space:]]+', 'tea sea-pit sit-lean\tbean')
['tea sea', 'pit sit', 'lean\tbean']
# similar to: r'\S+'
>>> regex.findall(r'[[:^space:]]+', 'tea sea-pit sit-lean\tbean')
['tea', 'sea-pit', 'sit-lean', 'bean']

# words not surrounded by punctuation characters
>>> regex.findall(r'(?<![[:punct:]])\b\w+\b(?![[:punct:]])', 'tie. ink eat;')

Set operations

Set operators can be used inside character class between sets. Mostly used to get intersection or difference between two sets, where one/both of them is a character range or a predefined character set. To aid in such definitions, you can use [] in nested fashion. The four operators, in increasing order of precedence, are:

  • || union
  • ~~ symmetric difference
  • && intersection
  • -- difference
# [^aeiou] will match any non-vowel character
# which means space is also a valid character to be matched
>>> regex.findall(r'\b[^aeiou]+\b', 'tryst glyph pity why')
['tryst glyph ', ' why']
# intersection or difference can be used here
# to get a positive definition of characters to match
>>> regex.findall(r'\b[a-z&&[^aeiou]]+\b', 'tryst glyph pity why')
['tryst', 'glyph', 'why']

# [[a-l]~~[g-z]] is same as [a-fm-z]
>>> regex.findall(r'\b[[a-l]~~[g-z]]+\b', 'gets eat top sigh')
['eat', 'top']

# remove all punctuation characters except . ! and ?
>>> para = '"Hi", there! How *are* you? All fine here.'
>>> regex.sub(r'[[:punct:]--[.!?]]+', '', para)
'Hi there! How are you? All fine here.'

info These set operators may get added to re module in future.

Unicode character sets

Similar to named character classes and escape sequence character sets, the regex module also supports \p{} construct that offers various predefined sets to work with Unicode strings. See regular-expressions: Unicode for details.

# extract all consecutive letters
>>> regex.findall(r'\p{L}+', 'fox:αλεπού,eagle:αετός')
['fox', 'αλεπού', 'eagle', 'αετός']
# extract all consecutive Greek letters
>>> regex.findall(r'\p{Greek}+', 'fox:αλεπού,eagle:αετός')
['αλεπού', 'αετός']

# extract all words
>>> regex.findall(r'\p{Word}+', 'φοο12,βτ_4,foo')
['φοο12', 'βτ_4', 'foo']

# delete all characters other than letters
# \p{^L} can also be used instead of \P{L}
>>> regex.sub(r'\P{L}+', '', 'φοο12,βτ_4,foo')

Skipping matches

Sometimes, you want to change or extract all matches except particular matches. Usually, there are common characteristics between the two types of matches that makes it hard or impossible to define RE only for the required matches. For example, changing field values unless it is a particular name, or perhaps don't touch double quoted values and so on. To use the skipping feature, define the matches to be ignored suffixed by (*SKIP)(*FAIL) and then define the matches required as part of alternation. (*F) can also be used instead of (*FAIL).

# change lowercase whole words other than imp or rat
>>> words = 'tiger imp goat eagle rat'
>>> regex.sub(r'\b(?:imp|rat)\b(*SKIP)(*F)|[a-z]++', r'(\g<0>)', words)
'(tiger) imp (goat) (eagle) rat'

# change all commas other than those inside double quotes
>>> row = '1,"cat,12",nice,two,"dog,5"'
>>> regex.sub(r'"[^"]++"(*SKIP)(*F)|,', '|', row)

\m and \M word anchors

\m and \M anchors match only the start and end of word respectively.

>>> regex.sub(r'\b', ':', 'hi log_42 12b')
':hi: :log_42: :12b:'
>>> regex.sub(r'\m', ':', 'hi log_42 12b')
':hi :log_42 :12b'
>>> regex.sub(r'\M', ':', 'hi log_42 12b')
'hi: log_42: 12b:'

>>> regex.sub(r'\b..\b', r'[\g<0>]', 'I have 12, he has 2!')
'[I ]have [12][, ][he] has[ 2]!'
>>> regex.sub(r'\m..\M', r'[\g<0>]', 'I have 12, he has 2!')
'I have [12], [he] has 2!'

Overlapped matches

findall and finditer support overlapped optional argument. Setting it to True gives you overlapped matches.

>>> words = 'on vast ever road lane at peak'
>>> regex.findall(r'\b\w+ \w+\b', words)
['on vast', 'ever road', 'lane at']
>>> regex.findall(r'\b\w+ \w+\b', words, overlapped=True)
['on vast', 'vast ever', 'ever road', 'road lane', 'lane at', 'at peak']

>>> regex.findall(r'\w{2}', 'apple', overlapped=True)
['ap', 'pp', 'pl', 'le']

regex.REVERSE flag

The regex.R or regex.REVERSE flag will result in right-to-left processing instead of the usual left-to-right order.

>>> words = 'par spare lion part cool'

# replaces first match
>>> regex.sub(r'par', 'co', words, count=1)
'co spare lion part cool'
# replaces last match
>>> regex.sub(r'par', 'co', words, count=1, flags=regex.R)
'par spare lion cot cool'

>>> regex.findall(r'(?r)\w+', words)
['cool', 'part', 'lion', 'spare', 'par']

\X vs dot metacharacter

Some characters have more than one codepoint. These are handled in Unicode with grapheme clusters. The dot metacharacter will only match one codepoint at a time. You can use \X to match any character (including newline), even if it has multiple codepoints.

>>> [c.encode('unicode_escape') for c in 'g̈']
[b'g', b'\\u0308']

>>> regex.sub(r'a.e', 'o', 'cag̈ed')
>>> regex.sub(r'a..e', 'o', 'cag̈ed')
>>> regex.sub(r'a\Xe', 'o', 'cag̈ed')

# \X will match newline character as well
>>> regex.sub(r'e.a', 'ea', 'he\nat', flags=regex.S)
>>> regex.sub(r'e\Xa', 'ea', 'he\nat')

Cheatsheet and Summary

pypi: regexthird party module, has lots advanced features
default is VERSION0 which is compatible with re module
(?V1)inline flag to enable version 1 for regex module
regex.DEFAULT_VERSION=regex.VERSION1 can also be used
(?V0) or regex.VERSION0 to get back default version
possessiveenabled by appending + to greedy quantifier
like greedy, but no backtracking
(?>pat)atomic grouping, similar to possessive quantifier
(?N)subexpression call for Nth capture group
(?&name)subexpression call for named capture group
subexpression call is similar to functions, recursion also possible
r'\((?:[^()]++|(?0))++\)' matches nested sets of parentheses
pat\Kpat won't be part of matching portion
\K is used similar to positive lookbehind
regex module allows variable length lookbehinds
\Grestricts matching from start of string like \A
continues matching from end of match as new anchor until it fails
regex.findall(r'\G\d+-?', '12-34 42') gives ['12-', '34']
[[:digit:]]named character set for \d
[[:^digit:]]to indicate \D
See regular-expressions: POSIX Bracket for full list
set operationsfeature for character classes, nested [] allowed
|| union, ~~ symmetric difference
&& intersection, -- difference
[[:punct:]--[.!?]] punctuation except . ! and ?
\p{}Unicode character sets provided by regex module
see regular-expressions: Unicode for details
\P{L} or \p{^L}match characters other than \p{L} set
pat(*SKIP)(*F)ignore text matched by pat
"[^"]++"(*SKIP)(*F)|, will match , but not inside
double quoted pairs
\m and \Manchors for start and end of word respectively
overlappedset as True to match overlapping portions
regex.RREVERSE flag to match from right-to-left
\Xmatches any character even if it has multiple codepoints
\X will also match newline characters by default
whereas . requires re.S flag to match newline character

There's lots and lots of features provided by regex module. Some of them have not been covered in this chapter — for example, fuzzy matching and splititer. See pypi: regex for details and examples. For those familiar with Perl style regular expressions, this module offers easier transition compared to re module.


a) Filter all elements whose first non-whitespace character is not a # character. Any element made up of only whitespace characters should be ignored as well.

>>> items = ['    #comment', '\t\napple #42', '#oops', 'sure', 'no#1', '\t\r\f']

##### add your solution here
['\t\napple #42', 'sure', 'no#1']

b) Replace sequences made up of words separated by : or . by the first word of the sequence and the separator. Such sequences will end when : or . is not followed by a word character.

>>> ip = 'wow:Good:2_two:five: hi bye kite.777.water.'

##### add your solution here
'wow: hi bye kite.'

c) The given list of strings has fields separated by : character. Delete : and the last field if there is a digit character anywhere before the last field.

>>> items = ['42:cat', 'twelve:a2b', 'we:be:he:0:a:b:bother']

##### add your solution here
['42', 'twelve:a2b', 'we:be:he:0:a:b']

d) Extract all whole words unless they are preceded by : or <=> or ---- or #.

>>> ip = '::very--at<=>row|in.a_b#b2c=>lion----east'

##### add your solution here
['at', 'in', 'a_b', 'lion']

e) The given input string has fields separated by : character. Extract all fields if the previous field contains a digit character.

>>> ip = 'vast:a2b2:ride:in:awe:b2b:3list:end'

##### add your solution here
['ride', '3list', 'end']

f) The given input string has fields separated by : character. Delete all fields, including the separator, unless the field contains a digit character. Stop deleting once a field with digit character is found.

>>> row1 = 'vast:a2b2:ride:in:awe:b2b:3list:end'
>>> row2 = 'um:no:low:3e:s4w:seer'

>>> pat = regex.compile()      ##### add your solution here

>>> pat.sub('', row1)
>>> pat.sub('', row2)

g) For the given input strings, extract if followed by any number of nested parentheses. Assume that there will be only one such pattern per input string.

>>> ip1 = 'for (((i*3)+2)/6) if(3-(k*3+4)/12-(r+2/3)) while()'
>>> ip2 = 'if+while if(a(b)c(d(e(f)1)2)3) for(i=1)'

>>> pat = regex.compile()       ##### add your solution here


h) Read about POSIX flag from Is the following code snippet showing the correct output?

>>> words = 'plink incoming tint winter in caution sentient'

>>> change = regex.compile(r'int|in|ion|ing|inco|inter|ink', flags=regex.POSIX)

>>> change.sub('X', words)
'plX XmX tX wX X cautX sentient'

i) Extract all whole words for the given input strings. However, based on user input ignore, do not match words if they contain any character present in the ignore variable.

>>> s1 = 'match after the last newline character'
>>> s2 = 'and then you want to test'

>>> ignore = 'aty'
>>> regex.findall()     ##### add your solution here for s1
>>> regex.findall()     ##### add your solution here for s2

>>> ignore = 'esw'
>>> regex.findall()     ##### add your solution here for s1
>>> regex.findall()     ##### add your solution here for s2
['and', 'you', 'to']

j) Retain only punctuation characters for the given strings (generated from codepoints). Use Unicode character set definition for punctuation for solving this exercise.

>>> s1 = ''.join(chr(c) for c in range(0, 0x80))
>>> s2 = ''.join(chr(c) for c in range(0x80, 0x100))
>>> s3 = ''.join(chr(c) for c in range(0x2600, 0x27ec))

>>> pat = regex.compile()       ##### add your solution here

>>> pat.sub('', s1)
>>> pat.sub('', s2)
>>> pat.sub('', s3)

k) For the given markdown file, replace all occurrences of the string python (irrespective of case) with the string Python. However, any match within code blocks that start with whole line ```python and end with whole line ``` shouldn't be replaced. Consider the input file to be small enough to fit memory requirements.

Refer to github: exercises folder for files and required to solve this exercise.

>>> ip_str = open('', 'r').read()
>>> pat = regex.compile()      ##### add your solution here
>>> with open('', 'w') as op_file:
...     ##### add your solution here
>>> assert open('').read() == open('').read()

l) For the given input strings, construct a word that is made up of last characters of all the words in the input. Use last character of last word as first character, last character of last but one word as second character and so on.

>>> s1 = 'knack tic pi roar what'
>>> s2 = '42;rod;t2t2;car'

>>> pat = regex.compile()       ##### add your solution here

##### add your solution here for s1
##### add your solution here for s2

m) Replicate str.rpartition functionality with regular expressions. Split into three parts based on last match of sequences of digits, which is 777 and 12 for the given input strings.

>>> s1 = 'Sample123string42with777numbers'
>>> s2 = '12apples'

##### add your solution here for s1
['Sample123string42with', '777', 'numbers']
##### add your solution here for s2
['', '12', 'apples']

n) Read about fuzzy matching on For the given input strings, return True if they are exactly same as cat or there is exactly one character difference. Ignore case when comparing differences. For example, Ca2 should give True. act will be False even though the characters are same because position should be maintained.

>>> pat = regex.compile()       ##### add your solution here

>>> bool(pat.fullmatch('CaT'))
>>> bool(pat.fullmatch('scat'))
>>> bool(pat.fullmatch('ca.'))
>>> bool(pat.fullmatch('ca#'))
>>> bool(pat.fullmatch('c#t'))
>>> bool(pat.fullmatch('at'))
>>> bool(pat.fullmatch('act'))
>>> bool(pat.fullmatch('2a1'))