Unicode

So far in the book, all examples were meant for strings made up of ASCII characters only. A few years back that would've been sufficient for most of the use cases. These days it would be rare to encounter ASCII only project. This chapter will briefly discuss unicode matching.

Unicode character sets and u flag

Similar to escape sequence character sets, the \p{} construct offers various predefined sets to work with unicode. For negated sets, use \P{}. You'll also need to set the u flag.

// extract all consecutive letters
> 'fox:αλεπού,eagle:αετός'.match(/\p{L}+/gu)
< ["fox", "αλεπού", "eagle", "αετός"]

// delete all characters other than letters
> 'φοο12,βτ_4,foo'.replace(/\P{L}+/gu, '')
< "φοοβτfoo"

// extract whole words not surrounded by punctuation marks
> 'tie. ink east;'.match(/(?<!\p{P})\b\w+\b(?!\p{P})/gu)
< ["ink"]

info See MDN: Unicode property escapes for more details and list of \p{} sets. See also regular-expressions: unicode for an overview of challenges with unicode matching.

Codepoints

You can also use codepoints inside \u{} construct to specify unicode characters. This is similar to how \xhh can be used to specify ASCII characters with two hexadecimal digits.

// to get codepoints in hexadecimal from given string
> Array.from('fox:αλεπού', c => c.codePointAt().toString(16))
< ["66", "6f", "78", "3a", "3b1", "3bb", "3b5", "3c0", "3bf", "3cd"]

// using codepoint to represent a character
> '\u{3b1}'
< "α"

// character range for lowercase alphabets using \u{}
// note that \u{} will work with 'u' flag only
> 'fox:αλεπού,eagle:αετός'.match(/[\u{61}-\u{7a}]+/gu)
< ["fox", "eagle"]

info See also stackoverflow: Unicode string to hex.

Cheatsheet and Summary

NoteDescription
uflag to enable unicode matching
\p{}Unicode character sets
\P{}negated unicode character sets
see MDN: Unicode property escapes for details
\u{}specify unicode characters using codepoints

A comprehensive discussion on regexp usage with Unicode characters is out of scope for this book. Resources like regular-expressions: unicode and Programmers introduction to Unicode are recommended for further study.

Exercises

a) Check if given input strings are made up of ASCII characters only. Consider the input to be non-empty strings and any character that isn't part of 7-bit ASCII set should result in false as output.

> let str1 = '123 × 456'
> let str2 = 'good fοοd'
> let str3 = 'happy learning!'

> const pat1 =      // add your solution here

> pat1.test(str1)
< false
> pat1.test(str2)
< false
> pat1.test(str3)
< true

b) Retain only punctuation characters for the given string.

> let ip = '❨a❩❪1❫❬b❭❮2❯❰c❱❲3❳❴xyz❵⟅123⟆⟦⟧⟨like⟩⟪3.14⟫'

// add your solution here
< "❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪.⟫"

c) Is the following code snippet showing the correct output?

> 'fox:αλεπού'.match(/\w+/g)
< ["fox"]