So far in the book, all examples were meant for strings made up of ASCII characters only. However, Regexp class uses source encoding by default. And the default string encoding is UTF-8. See ruby-doc: Encoding for details on working with different string encoding.

Encoding modifiers

Modifiers can be used to override the encoding to be used. For example, the n modifier will use ASCII-8BIT instead of source encoding.

# example with ASCII characters only
>> 'foo - baz'.gsub(/\w+/n, '(\0)')
=> "(foo) - (baz)"

# example with non-ASCII characters as well
>> 'fox:αλεπού'.scan(/\w+/n)
(irb):2: warning: historical binary regexp match /.../n against UTF-8 string
=> ["fox"]

Character set escapes like \w match only ASCII characters whereas named character sets are Unicode aware. You can also use (?u) inline modifier to allow character set escapes to match Unicode characters.

>> 'fox:αλεπού'.scan(/\w+/)
=> ["fox"]

>> 'fox:αλεπού'.scan(/[[:word:]]+/)
=> ["fox", "αλεπού"]

>> 'fox:αλεπού'.scan(/(?u)\w+/)
=> ["fox", "αλεπού"]

info See ruby-doc: Regexp Encoding for other such modifiers and details.

Unicode character sets

Similar to named character classes and escape sequences, the \p{} construct offers various predefined sets that will work for Unicode strings. See ruby-doc: Character Properties for full list and details.

# extract all consecutive letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{L}+/)
=> ["fox", "αλεπού", "eagle", "αετός"]
# extract all consecutive Greek letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{Greek}+/)
=> ["αλεπού", "αετός"]

# extract all words
>> 'φοο12,βτ_4,foo'.scan(/\p{Word}+/)
=> ["φοο12", "βτ_4", "foo"]

# delete all characters other than letters
# \p{^L} can also be used instead of \P{L}
>> 'φοο12,βτ_4,foo'.gsub(/\P{L}+/, '')
=> "φοοβτfoo"

Codepoints and Unicode escapes

For generic Unicode character ranges, specify codepoints using \u{} construct. The below snippet also shows how to get codepoints (numerical value of a character) in Ruby.

# to get codepoints from string
>> 'fox:αλεπού'.codepoints.map { |i| '%x' % i }
=> ["66", "6f", "78", "3a", "3b1", "3bb", "3b5", "3c0", "3bf", "3cd"]
# one or more codepoints can be specified inside \u{}
>> puts "\u{66 6f 78 3a 3b1 3bb 3b5 3c0 3bf 3cd}"

# character range example using \u{}
# all english lowercase letters
>> 'fox:αλεπού,eagle:αετός'.scan(/[\u{61}-\u{7a}]+/)
=> ["fox", "eagle"]

info See also: codepoints, a site dedicated for Unicode characters.

\X vs dot metacharacter

Some characters have more than one codepoint. These are handled in Unicode with grapheme clusters. The dot metacharacter will only match one codepoint at a time. You can use \X to match any character, even if it has multiple codepoints.

>> 'g̈'.codepoints.map { |i| '%x' % i }
=> ["67", "308"]
>> puts "\u{67 308}"

>> 'cag̈ed'.sub(/a.e/, 'o')
=> "cag̈ed"
>> 'cag̈ed'.sub(/a..e/, 'o')
=> "cod"

>> 'cag̈ed'.sub(/a\Xe/, 'o')
=> "cod"

Another difference is that \X will match newline characters by default.

>> "he\nat".sub(/e.a/, 'ea')
=> "he\nat"
>> "he\nat".sub(/e.a/m, 'ea')
=> "heat"

>> "he\nat".sub(/e\Xa/, 'ea')
=> "heat"

Cheatsheet and Summary

ruby-doc: Encodingdetails on working with different string encoding
nmodifier to use ASCII-8BIT instead of source encoding
(?u)inline modifier to allow escapes like \w to match unicode
\p{}Unicode character sets
see ruby-doc: Character Properties for full list and details
s.codepointsget codepoints of characters of string s
\u{}specify characters using codepoints
.matches only single codepoint at a time
\Xmatches any character even if it has multiple codepoints
\X will also match newline characters by default
whereas . requires m modifier to match newline character

A comprehensive discussion on regexp usage with Unicode characters is out of scope for this book. Also, it could throw up strange issues. Resources like regular-expressions: unicode and Programmers introduction to Unicode are recommended for further study.


a) Output true or false depending on input string made up of ASCII characters or not. Consider the input to be non-empty strings and any character that isn't part of 7-bit ASCII set should give false

>> str1 = '123—456'
>> str2 = 'good fοοd'
>> str3 = 'happy learning!'

##### add your solution here for str1
=> false
##### add your solution here for str2
=> false
##### add your solution here for str3
=> true

b) Retain only punctuation characters for the given strings (generated from codepoints). Use Unicode character set definition for punctuation for solving this exercise.

>> s1 = (0..0x7f).to_a.pack('U*')
>> s2 = (0x80..0xff).to_a.pack('U*')
>> s3 = (0x2600..0x27eb).to_a.pack('U*')

>> pat =        ##### add your solution here

>> s1.gsub(pat, '')
=> "!\"#%&'()*,-./:;?@[\\]_{}"
>> s2.gsub(pat, '')
=> "¡§«¶·»¿"
>> s3.gsub(pat, '')
=> "❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫"

c) Explore the following Q&A threads.