Unicode

The examples so far had input strings made up of ASCII characters only. However, the re module's matching is Unicode by default. See docs.python: Unicode for a tutorial on Unicode support in Python. This chapter will briefly discuss a few things related to Unicode matching.

re.ASCII

Flags can be used to override the default Unicode setting. The re.A or re.ASCII flag will change \b, \w, \d, \s and their opposites to match only based on ASCII characters.

# \w is Unicode aware
>>> re.findall(r'\w+', 'fox:αλεπού')
['fox', 'αλεπού']

# restrict matching to only ASCII characters
>>> re.findall(r'\w+', 'fox:αλεπού', flags=re.A)
['fox']
# or, explicitly define the characters to match using character class
>>> re.findall(r'[a-zA-Z0-9_]+', 'fox:αλεπού')
['fox']

However, the four characters shown in the code snippet below are also matched when re.I is used without the re.A flag. Here's the relevant quote from the docs:

Note that when the Unicode patterns [a-z] or [A-Z] are used in combination with the IGNORECASE flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: İ (U+0130, Latin capital letter I with dot above), ı (U+0131, Latin small letter dotless i), ſ (U+017F, Latin small letter long s) and K (U+212A, Kelvin sign). If the ASCII flag is used, only letters a to z and A to Z are matched.

>>> bool(re.search(r'[a-zA-Z]', 'İıſK'))
False

>>> re.search(r'[a-z]+', 'İıſK', flags=re.I)[0]
'İıſK'

>>> bool(re.search(r'[a-z]', 'İıſK', flags=re.I|re.A))
False

Use re.L or re.LOCALE to work based on the locale settings for bytes data type.

Codepoints and Unicode escapes

You can use escapes \u and \U to specify Unicode characters with 4 and 8 hexadecimal digits respectively. You'll also see how to get codepoints (numerical value of a character) in the illustration below.

# to get codepoints for ASCII characters
>>> [ord(c) for c in 'fox']
[102, 111, 120]
>>> [hex(ord(c)) for c in 'fox']
['0x66', '0x6f', '0x78']

# to get codepoints for Unicode characters
>>> [c.encode('unicode_escape') for c in 'αλεπού']
[b'\\u03b1', b'\\u03bb', b'\\u03b5', b'\\u03c0', b'\\u03bf', b'\\u03cd']
>>> [c.encode('unicode_escape') for c in 'İıſK']
[b'\\u0130', b'\\u0131', b'\\u017f', b'\\u212a']

# character range example using \u
# English lowercase letters
>>> re.findall(r'[\u0061-\u007a]+', 'fox:αλεπού,eagle:αετός')
['fox', 'eagle']

See also: codepoints.net, a site dedicated for Unicode characters.

\N escape sequence

You can also specify a Unicode character using the \N{name} escape sequence. See unicode: NamesList for a full list of names. From the Python docs:

Changed in version 3.8: The \N{name} escape sequence has been added. As in string literals, it expands to the named Unicode character (e.g. \N{EM DASH}).

# can also use '\N{em dash}'
>>> '\N{EM DASH}'
'—'

>>> '\N{LATIN SMALL LETTER TURNED DELTA}'
'ƍ'

Cheatsheet and Summary

Note	Description
docs.python: Unicode	tutorial on Unicode support in Python
`re.ASCII` or `re.A`	match only ASCII characters for `\b`, `\w`, `\d`, `\s`
	and their opposites, when using Unicode patterns
`re.LOCALE` or `re.L`	use locale settings for byte patterns and 8-bit locales
`İıſK`	characters that can match if `re.I` is used but not `re.A`
`ord(c)`	get codepoint for ASCII character `c`
`c.encode('unicode_escape')`	get codepoint for Unicode character `c`
`\uXXXX`	codepoint defined using 4 hexadecimal digits
`\UXXXXXXXX`	codepoint defined using 8 hexadecimal digits
`\N{name}`	Unicode character defined by its name
	See unicode: NamesList for full list

A comprehensive discussion on RE usage with Unicode characters is out of scope for this book. Resources like regular-expressions: unicode and Programmers introduction to Unicode are recommended for further study. See also the Unicode character sets section.

Exercises

a) Output True or False depending on input string made up of ASCII characters or not. Consider the input to be non-empty strings and any character that isn't part of 7-bit ASCII set should give False. Do you need regular expressions for this?

>>> str1 = '123—456'
>>> str2 = 'good fοοd'
>>> str3 = 'happy learning!'
>>> str4 = 'İıſK'
>>> str5 = 'àpple'

##### add your solution here for str1
False
##### add your solution here for str2
False
##### add your solution here for str3
True
##### add your solution here for str4
False
##### add your solution here for str5
False

b) Does the . quantifier match non-ASCII characters even with the re.ASCII flag enabled?

c) Explore the following stackoverflow Q&A threads.

Understanding Python re(gex)?