Unicode
The examples so far had input strings made up of ASCII characters only. However, the re
module's matching is Unicode by default. See docs.python: Unicode for a tutorial on Unicode support in Python. This chapter will briefly discuss a few things related to Unicode matching.
re.ASCII
Flags can be used to override the default Unicode setting. The re.A
or re.ASCII
flag will change \b
, \w
, \d
, \s
and their opposites to match only based on ASCII characters.
# \w is Unicode aware
>>> re.findall(r'\w+', 'fox:αλεπού')
['fox', 'αλεπού']
# restrict matching to only ASCII characters
>>> re.findall(r'\w+', 'fox:αλεπού', flags=re.A)
['fox']
# or, explicitly define the characters to match using character class
>>> re.findall(r'[a-zA-Z0-9_]+', 'fox:αλεπού')
['fox']
However, the four characters shown in the code snippet below are also matched when re.I
is used without the re.A
flag. Here's the relevant quote from the docs:
Note that when the Unicode patterns
[a-z]
or[A-Z]
are used in combination with theIGNORECASE
flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters:İ
(U+0130, Latin capital letter I with dot above),ı
(U+0131, Latin small letter dotless i),ſ
(U+017F, Latin small letter long s) andK
(U+212A, Kelvin sign). If theASCII
flag is used, only lettersa
toz
andA
toZ
are matched.
>>> bool(re.search(r'[a-zA-Z]', 'İıſK'))
False
>>> re.search(r'[a-z]+', 'İıſK', flags=re.I)[0]
'İıſK'
>>> bool(re.search(r'[a-z]', 'İıſK', flags=re.I|re.A))
False
Use
re.L
orre.LOCALE
to work based on the locale settings for bytes data type.
Codepoints and Unicode escapes
You can use escapes \u
and \U
to specify Unicode characters with 4 and 8 hexadecimal digits respectively. You'll also see how to get codepoints (numerical value of a character) in the illustration below.
# to get codepoints for ASCII characters
>>> [ord(c) for c in 'fox']
[102, 111, 120]
>>> [hex(ord(c)) for c in 'fox']
['0x66', '0x6f', '0x78']
# to get codepoints for Unicode characters
>>> [c.encode('unicode_escape') for c in 'αλεπού']
[b'\\u03b1', b'\\u03bb', b'\\u03b5', b'\\u03c0', b'\\u03bf', b'\\u03cd']
>>> [c.encode('unicode_escape') for c in 'İıſK']
[b'\\u0130', b'\\u0131', b'\\u017f', b'\\u212a']
# character range example using \u
# English lowercase letters
>>> re.findall(r'[\u0061-\u007a]+', 'fox:αλεπού,eagle:αετός')
['fox', 'eagle']
See also: codepoints.net, a site dedicated for Unicode characters.
\N escape sequence
You can also specify a Unicode character using the \N{name}
escape sequence. See unicode: NamesList for a full list of names. From the Python docs:
Changed in version 3.8: The
\N{name}
escape sequence has been added. As in string literals, it expands to the named Unicode character (e.g.\N{EM DASH}
).
# can also use '\N{em dash}'
>>> '\N{EM DASH}'
'—'
>>> '\N{LATIN SMALL LETTER TURNED DELTA}'
'ƍ'
Cheatsheet and Summary
Note | Description |
---|---|
docs.python: Unicode | tutorial on Unicode support in Python |
re.ASCII or re.A | match only ASCII characters for \b , \w , \d , \s |
and their opposites, when using Unicode patterns | |
re.LOCALE or re.L | use locale settings for byte patterns and 8-bit locales |
İıſK | characters that can match if re.I is used but not re.A |
ord(c) | get codepoint for ASCII character c |
c.encode('unicode_escape') | get codepoint for Unicode character c |
\uXXXX | codepoint defined using 4 hexadecimal digits |
\UXXXXXXXX | codepoint defined using 8 hexadecimal digits |
\N{name} | Unicode character defined by its name |
See unicode: NamesList for full list |
A comprehensive discussion on RE usage with Unicode characters is out of scope for this book. Resources like regular-expressions: unicode and Programmers introduction to Unicode are recommended for further study. See also the Unicode character sets section.
Exercises
a) Output True
or False
depending on input string made up of ASCII characters or not. Consider the input to be non-empty strings and any character that isn't part of 7-bit ASCII set should give False
. Do you need regular expressions for this?
>>> str1 = '123—456'
>>> str2 = 'good fοοd'
>>> str3 = 'happy learning!'
>>> str4 = 'İıſK'
>>> str5 = 'àpple'
##### add your solution here for str1
False
##### add your solution here for str2
False
##### add your solution here for str3
True
##### add your solution here for str4
False
##### add your solution here for str5
False
b) Does the .
quantifier match non-ASCII characters even with the re.ASCII
flag enabled?
c) Explore the following stackoverflow Q&A threads.