Character class

This chapter will discuss how to create your own custom placeholders to match limited set of characters and various metacharacters applicable inside character classes. You'll also learn about escape sequences for predefined character sets.

Custom character sets

Characters enclosed inside [] metacharacters is a character class (or set). It will result in matching any one of those characters once. It is similar to using single character alternations inside a grouping, but without the additional effects of a capture group. In addition, character classes have their own versions of metacharacters and provide special predefined sets for common use cases. Quantifiers are applicable to character classes as well.

// same as: /cot|cut/ or /c(o|u)t/
> ['cute', 'cat', 'cot', 'coat', 'cost'].filter(w => /c[ou]t/.test(w))
< ["cute", "cot"]

// same as: /(a|e|o)+t/g
> 'meeting cute boat site foot'.replace(/[aeo]+t/g, 'X')
< "mXing cute bX site fX"

> 'Sample123string42with777numbers'.match(/[0123456789]+/g)
< ["123", "42", "777"]

Range of characters

Character classes have their own metacharacters to help define the sets succinctly. Metacharacters outside of character classes like ^, $, () etc either don't have special meaning or have completely different one inside the character classes. First up, the - metacharacter that helps to define a range of characters instead of having to specify them all individually.

// all digits
> 'Sample123string42with777numbers'.match(/[0-9]+/g)
< ["123", "42", "777"]

// whole words made up of lowercase alphabets and digits only
> 'coat Bin food tar12 best'.match(/\b[a-z0-9]+\b/g)
< ["coat", "food", "tar12", "best"]

// whole words made up of lowercase alphabets, but starting with 'p' to 'z'
> 'coat tin food put stoop best'.match(/\b[p-z][a-z]*\b/g)
< ["tin", "put", "stoop"]

// whole words made up of only 'a' to 'f' and 'p' to 't' lowercase alphabets
> 'coat tin food put stoop best'.match(/\b[a-fp-t]+\b/g)
< ["best"]

Negating character sets

The ^ metacharacter has to specified as the first character of the character class. It negates the set, so all characters other than those specified will be matched.

// all non-digits
> 'Sample123string42with777numbers'.match(/[^0-9]+/g)
< ["Sample", "string", "with", "numbers"]

// deleting characters from start of string based on a delimiter
> 'foo=42; baz=123'.replace(/^[^=]+/, '')
< "=42; baz=123"
> 'foo:123:bar:baz'.replace(/^([^:]+:){2}/, '')
< "bar:baz"

// deleting characters at end of string based on a delimiter
> 'foo=42; baz=123'.replace(/=[^=]+$/, '')
< "foo=42; baz"

As highlighted earlier, handle negative logic with care, as you might end up matching more than you wanted. Sometimes, it is easier to use positive character class and inverting the test condition instead of using negated character class.

> let words = ['tryst', 'fun', 'glyph', 'pity', 'why']

// elements not containing vowel characters
> words.filter(w => /^[^aeiou]+$/.test(w))
< ["tryst", "glyph", "why"]
// easier to write and maintain, note the use of '!' operator
// but this'll match empty strings too unlike the previous solution
> words.filter(w => !/[aeiou]/.test(w))
< ["tryst", "glyph", "why"]

Matching metacharacters literally

Similar to other metacharacters, prefix \ to character class metacharacters to match them literally. Some of them can be achieved by different placement as well.

- should be first or last character or escaped using \.

> 'ab-cd gh-c 12-423'.match(/\b[a-z-]{2,}\b/g)
< ["ab-cd", "gh-c"]

> 'ab-cd gh-c 12-423'.match(/\b[a-z\-0-9]{2,}\b/g)
< ["ab-cd", "gh-c", "12-423"]

^ should be other than first character or escaped using \.

> 'f*(a^b) - 3*(a+b)'.match(/a[+^]b/g)
< ["a^b", "a+b"]

> 'f*(a^b) - 3*(a+b)'.match(/a[\^+]b/g)
< ["a^b", "a+b"]

[ doesn't need escaping, but you can escape it if you wish. ] should be escaped with \.

> 'words[5] = tea'.match(/[a-z[\]0-9]+/)[0]
< "words[5]"

\ should be escaped using \.

> console.log('5ba\\babc2'.match(/[a\\b]+/)[0])
< ba\bab

Escape sequence character sets

Commonly used character sets have predefined escape sequences:

  • \w is similar to [A-Za-z0-9_] for matching word characters (recall the definition for word boundaries)
  • \d is similar to [0-9] for matching digit characters
  • \s is similar to [ \t\r\n\f\v] for matching whitespace characters

These escape sequences can be used as a standalone sequence or inside a character class. As mentioned before, the examples and description will assume that the input is made up of ASCII characters only. Use \W, \D and \S respectively for their negated set.

> 'Sample123string42with777numbers'.split(/\d+/)
< ["Sample", "string", "with", "numbers"]

> 'sea eat car rat eel tea'.match(/\b\w/g).join('')
< "secret"

> 'tea sea-pit sit-lean bean'.match(/[\w\s]+/g)
< ["tea sea", "pit sit", "lean bean"]

> 'Sample123string42with777numbers'.replace(/\D+/g, '-')
< "-123-42-777-"

> '   1..3  \v\f  foo_baz 42\tzzz   \r\n1-2-3  '.match(/\S+/g)
< ["1..3", "foo_baz", "42", "zzz", "1-2-3"]

Numeric ranges

Character classes can also be used to construct numeric ranges.

// numbers between 10 to 29
> '23 154 12 26 98234'.match(/\b[12]\d\b/g)
< ["23", "12", "26"]

// numbers >= 100
> '23 154 12 26 98234'.match(/\b\d{3,}\b/g)
< ["154", "98234"]
// numbers >= 100 if there are leading zeros
> '0501 035 154 12 26 98234'.match(/\b0*[1-9]\d{2,}\b/g)
< ["0501", "154", "98234"]

However, it is easy to miss corner cases and some ranges are complicated to design. In such cases, it is better to match all the numbers and then add code to use actual numeric operations.

// numbers < 350
> '45 349 651 593 4 204'.match(/\d+/g).filter(n => n < 350)
< ["45", "349", "4", "204"]
> '45 349 651 593 4 204'.replace(/\d+/g, m => m < 350 ? 0 : 1)
< "0 0 1 1 0 0"

// numbers between 200 and 650
> '45 349 651 593 4 204'.match(/\d+/g).filter(n => n >= 200 && n <= 650)
< ["349", "593", "204"]

info See regular-expressions: matching numeric ranges for more examples.

Cheatsheet and Summary

NoteDescription
[ae;o]match any of these characters once
quantifiers are applicable to character classes too
[3-7]range of characters from 3 to 7
[^=b2]negated set, match other than = or b or 2
[a-z-]- should be first/last or escaped using \ to match literally
[+^]^ shouldn't be first character or escaped using \
[\]\\]] and \ should be escaped using \
[ doesn't need escaping, but \[ can also be used
\wsimilar to [A-Za-z0-9_] for matching word characters
\dsimilar to [0-9] for matching digit characters
\ssimilar to [ \t\n\r\f\v] for matching whitespace characters
assumes input encoding is ASCII
use \W, \D, and \S for their opposites respectively

This chapter focused on how to create custom placeholders to match limited set of characters. Grouping and character classes can be considered as two levels of abstractions. On the one hand, you can have character sets inside [] and on the other hand, you can have multiple alternations grouped inside () including character classes. As anchoring and quantifiers can be applied to both these abstractions, you can begin to see how regular expressions is considered a mini-programming language. In coming chapters, you'll even see how to negate groupings similar to negated character class in certain scenarios.

Exercises

a) For the array items, filter all elements starting with hand and ending with s or y or le. No other character in between, for example, hands should match but not hand-has.

> let items = ['-handy', 'hand', 'handy', 'handled', 'hands', 'handle']

// add your solution here
< ["handy", "hands", "handle"]

b) Replace all whole words reed or read or red with X.

> let ip = 'redo red credible :read: rod reed bred'

// add your solution here
< "redo X credible :X: rod X bred"

c) For the array words, filter all elements containing e or i followed by l or n. Note that the order mentioned should be followed.

> let words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']

// add your solution here
< ["surrender", "unicorn", "eel"]

d) For the array words, filter all elements containing e or i and l or n in any order.

> let words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']

// add your solution here
< ["surrender", "unicorn", "newer", "eel"]

e) Extract all hex character sequences, with 0x optional prefix. Match the characters case insensitively, and the sequences shouldn't be surrounded by other word characters.

> let str1 = '128A foo 0xfe32 34 0xbar'
> let str2 = '0XDEADBEEF place 0x0ff1ce bad'

> const hex_seq =       // add your solution here

> str1.match(hex_seq)
< ["128A", "0xfe32", "34"]
> str2.match(hex_seq)
< ["0XDEADBEEF", "0x0ff1ce", "bad"]

f) Delete from ( to the next occurrence of ) unless they contain parentheses characters in between.

> let str1 = 'def factorial()'
> let str2 = 'a/b(division) + c%d(#modulo) - (e+(j/k-3)*4)'
> let str3 = 'Hi there(greeting). Nice day(a(b)'

> const remove_parentheses =        // add your solution here

> str1.replace(remove_parentheses, '')
< "def factorial"
> str2.replace(remove_parentheses, '')
< "a/b + c%d - (e+*4)"
> str3.replace(remove_parentheses, '')
< "Hi there. Nice day(a"

g) For the array words, filter all elements not starting with e or p or u.

> let words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']

// add your solution here
< ["surrender", "newer", "door"]

h) For the array words, filter all elements not containing u or w or ee or -.

> let words = ['p-t', 'you', 'tea', 'heel', 'owe', 'new', 'reed', 'ear']

// add your solution here
< ["tea", "ear"]

i) The given input strings contain fields separated by , and fields can be empty too. Replace last three fields with WHTSZ323.

> let row1 = '(2),kite,12,,D,C,,'
> let row2 = 'hi,bye,sun,moon'

> const pat1 =      // add your solution here

// add your solution here for row1
< "(2),kite,12,,D,WHTSZ323"
// add your solution here for row2
< "hi,WHTSZ323"

j) Split the given strings based on consecutive sequence of digit or whitespace characters.

> let s1 = 'lion \t Ink32onion Nice'
> let s2 = '**1\f2\n3star\t7 77\r**'

> const pat2 =      // add your solution here

> s1.split(pat2)
< ["lion", "Ink", "onion", "Nice"]
> s2.split(pat2)
< ["**", "star", "**"]

k) Delete all occurrences of the sequence <characters> where characters is one or more non > characters and cannot be empty.

> let ip = 'a<apple> 1<> b<bye> 2<> c<cat>'

// add your solution here
< "a 1<> b 2<> c"

l) \b[a-z](on|no)[a-z]\b is same as \b[a-z][on]{2}[a-z]\b. True or False? Sample input lines shown below might help to understand the differences, if any.

> console.log('known\nmood\nknow\npony\ninns')
  known
  mood
  know
  pony
  inns

m) For the given array, filter all elements containing any number sequence greater than 624.

> let items = ['h0000432ab', 'car00625', '42_624 0512', '3.14 96 2 foo1234baz']

// add your solution here
< ["car00625", "3.14 96 2 foo1234baz"]

n) Convert the given input string to two different arrays as shown below.

> let ip = 'price_42 roast^\t\n^-ice==cat\neast'

// add your solution here
< ["price_42", "roast", "ice", "cat", "east"]

// add your solution here
< ["price_42", " ", "roast", "^	\n^-", "ice", "==", "cat", "\n", "east"]

o) Filter all elements whose first non-whitespace character is not a # character. Any element made up of only whitespace characters should be ignored as well.

> let items = ['    #comment', '\t\napple #42', '#oops', 'sure', 'no#1', '\t\r\f']

// add your solution here
< ["	\napple #42", "sure", "no#1"]

p) For the given string, surround all whole words with {} except for whole words par and cat.

> let ip = 'part; cat {super} rest_42 par scatter'

// add your solution here
< "{part}; cat {{super}} {rest_42} par {scatter}"