Unicode

The examples so far had input strings made up of ASCII characters only. The Regexp class uses source encoding by default. And the default string encoding is UTF-8. See ruby-doc: Encoding for details on working with different string encoding.

Encoding modifiers

Modifiers can be used to override the encoding to be used. For example, the n modifier will use ASCII-8BIT instead of the source encoding.

# example with ASCII characters only
>> 'apple - banana'.gsub(/\w+/n, '(\0)')
=> "(apple) - (banana)"

# example with non-ASCII characters as well
>> 'fox:αλεπού'.scan(/\w+/n)
(irb):2: warning: historical binary regexp match /.../n against UTF-8 string
=> ["fox"]

Character set escapes like \w match only ASCII characters whereas named character sets are Unicode aware. You can also use the (?u) inline modifier to allow character set escapes to match Unicode characters.

>> 'fox:αλεπού'.scan(/\w+/)
=> ["fox"]

>> 'fox:αλεπού'.scan(/[[:word:]]+/)
=> ["fox", "αλεπού"]

>> 'fox:αλεπού'.scan(/(?u)\w+/)
=> ["fox", "αλεπού"]

info See ruby-doc: Regexp Encoding for other such modifiers and details.

Unicode character sets

Similar to named character classes and escape sequences, the \p{} construct offers various predefined sets that will work for Unicode strings. See ruby-doc: Unicode Properties for full list and other details.

# extract all consecutive letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{L}+/)
=> ["fox", "αλεπού", "eagle", "αετός"]
# extract all consecutive Greek letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{Greek}+/)
=> ["αλεπού", "αετός"]

# extract all words
>> 'φοο12,βτ_4;cat'.scan(/\p{Word}+/)
=> ["φοο12", "βτ_4", "cat"]

# delete all characters other than letters
# \p{^L} can also be used instead of \P{L}
>> 'φοο12,βτ_4;cat'.gsub(/\P{L}+/, '')
=> "φοοβτcat"

Codepoints and Unicode escapes

For generic Unicode character ranges, you can specify codepoints using the \u{} construct. The below snippet also shows how to get codepoints (numerical value of a character) in Ruby.

# to get codepoints from string
>> 'fox:αλεπού'.codepoints.map { '%x' % _1 }
=> ["66", "6f", "78", "3a", "3b1", "3bb", "3b5", "3c0", "3bf", "3cd"]
# one or more codepoints can be specified inside \u{}
>> puts "\u{66 6f 78 3a 3b1 3bb 3b5 3c0 3bf 3cd}"
fox:αλεπού

# character range example using \u{}
# all english lowercase letters
>> 'fox:αλεπού,eagle:αετός'.scan(/[\u{61}-\u{7a}]+/)
=> ["fox", "eagle"]

info See also: codepoints, a site dedicated for Unicode characters.

\X vs dot metacharacter

Some characters have more than one codepoint. These are handled in Unicode with grapheme clusters. The dot metacharacter will only match one codepoint at a time. You can use \X to match any character, even if it has multiple codepoints.

>> 'g̈'.codepoints.map { '%x' % _1 }
=> ["67", "308"]
>> puts "\u{67 308}"
g̈

>> 'cag̈ed'.sub(/a.e/, 'o')
=> "cag̈ed"
>> 'cag̈ed'.sub(/a..e/, 'o')
=> "cod"

>> 'cag̈ed'.sub(/a\Xe/, 'o')
=> "cod"

Another difference is that \X will always match the newline characters.

>> "he\nat".sub(/e.a/, 'ea')
=> "he\nat"
>> "he\nat".sub(/e.a/m, 'ea')
=> "heat"

>> "he\nat".sub(/e\Xa/, 'ea')
=> "heat"

Cheatsheet and Summary

NoteDescription
ruby-doc: Encodingdetails on working with different string encodings
nmodifier to use ASCII-8BIT instead of the source encoding
(?u)inline modifier to allow escapes like \w to match unicode
\p{}Unicode character sets
see ruby-doc: Unicode Properties for full list and details
s.codepointsget codepoints of characters in the string s
\u{}specify characters using codepoints
.matches only a single codepoint at a time
\Xmatches any character even if it has multiple codepoints
\X will always match newline characters
whereas . requires the m modifier to match newline characters

A comprehensive discussion on regexp usage with Unicode characters is out of scope for this book. Also, it could throw up strange issues. Resources like regular-expressions: unicode and Programmers introduction to Unicode are recommended for further study.

Exercises

1) Output true or false depending on input string made up of ASCII characters or not. Consider the input to be non-empty strings and any character that isn't part of the 7-bit ASCII set should give false.

>> str1 = '123—456'
>> str2 = 'good fοοd'
>> str3 = 'happy learning!'

##### add your solution here for str1
=> false
##### add your solution here for str2
=> false
##### add your solution here for str3
=> true

2) Retain only punctuation characters for the given strings (generated from codepoints). Use the Unicode character set definition for punctuation for solving this exercise.

>> s1 = (0..0x7f).to_a.pack('U*')
>> s2 = (0x80..0xff).to_a.pack('U*')
>> s3 = (0x2600..0x27eb).to_a.pack('U*')

>> pat =        ##### add your solution here

>> s1.gsub(pat, '')
=> "!\"#%&'()*,-./:;?@[\\]_{}"
>> s2.gsub(pat, '')
=> "¡§«¶·»¿"
>> s3.gsub(pat, '')
=> "❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫"

3) Explore the following Q&A threads.