Character class

This chapter will discuss how to create your own custom placeholders to match limited set of characters and various metacharacters applicable inside character classes. You'll also learn about escape sequences for predefined character sets.

Custom character sets

Characters enclosed inside [] metacharacters is a character class (or set). It will result in matching any one of those characters once. It is similar to using single character alternations inside a grouping, but terser and without the drawbacks of a capture group. In addition, character classes have their own versions of metacharacters and provide special predefined sets for common use cases. Quantifiers are applicable to character classes as well.

# same as: /cot|cut/ or /c(o|u)t/
>> %w[cute cat cot coat cost scuttle].grep(/c[ou]t/)
=> ["cute", "cot", "scuttle"]

# /.(a|e|o)+t/ won't work as capture group prevents getting the entire match
>> 'meeting cute boat site foot'.scan(/.[aeo]+t/)
=> ["meet", "boat", "foot"]

Range of characters

Character classes have their own metacharacters to help define the sets succinctly. Metacharacters outside of character classes like ^, $, () etc either don't have special meaning or have a completely different one inside the character classes. First up, the - metacharacter that helps to define a range of characters instead of having to specify them all individually.

# all digits, same as: scan(/[0123456789]+/)
>> 'Sample123string42with777numbers'.scan(/[0-9]+/)
=> ["123", "42", "777"]

# whole words made up of lowercase alphabets and digits only
>> 'coat Bin food tar12 best Apple fig_42'.scan(/\b[a-z0-9]+\b/)
=> ["coat", "food", "tar12", "best"]

# whole words starting with 'p' to 'z' and having lowercase alphabets only
>> 'coat tin food put stoop best fig_42 Pet'.scan(/\b[p-z][a-z]*\b/)
=> ["tin", "put", "stoop"]

# whole words made up of only 'a' to 'f' and 'p' to 't' lowercase alphabets
>> 'coat tin food put stoop best fig_42 Pet'.scan(/\b[a-fp-t]+\b/)
=> ["best"]

Negating character sets

Next metacharacter is ^ which has to specified as the first character of the character class. It negates the set of characters, so all characters other than those specified will be matched. As highlighted earlier, handle negative logic with care, you might end up matching more than you wanted.

# non-digit characters
>> 'Sample123string42with777numbers'.scan(/[^0-9]+/)
=> ["Sample", "string", "with", "numbers"]

# remove the first two columns where : is delimiter
>> 'apple:123:banana:cherry'.sub(/\A([^:]+:){2}/, '')
=> "banana:cherry"

# deleting characters at the end of string based on a delimiter
>> 'apple=42; cherry=123'.sub(/=[^=]+\z/, '')
=> "apple=42; cherry"

>> dates = '2024/04/25,1986/Mar/02,77/12/31'
# note that the third character set negates comma
# and comma is matched optionally outside the capture groups
>> dates.scan(%r{([^/]+)/([^/]+)/([^/,]+),?})
=> [["2024", "04", "25"], ["1986", "Mar", "02"], ["77", "12", "31"]]

Sometimes, it is easier to use positive character class and invert the boolean result instead of using a negated character class.

>> words = %w[tryst fun glyph pity why]

# words not containing vowel characters
>> words.grep(/\A[^aeiou]+\z/)
=> ["tryst", "glyph", "why"]

# easier to write and maintain
# but this'll match empty strings too unlike the previous solution
# you can add \A\z as an alternate pattern to avoid empty matches
>> words.grep_v(/[aeiou]/)
=> ["tryst", "glyph", "why"]

Set intersection

Using && between two sets of characters will result in matching only the intersection of those two sets. To aid in such definitions, you can use [] in nested fashion.

# [^aeiou] will match any non-vowel character
# which means space is also a valid character to be matched
>> 'tryst glyph pity why'.scan(/\b[^aeiou]+\b/)
=> ["tryst glyph ", " why"]

# [a-z&&[^aeiou]] will be intersection of a-z and non-vowel characters
# this results in positive definition of characters to match
>> 'tryst glyph pity why'.scan(/\b[a-z&&[^aeiou]]+\b/)
=> ["tryst", "glyph", "why"]

Matching metacharacters literally

You can prefix a \ to metacharacters to match them literally. Some of them can be achieved by different placement as well.

- should be the first or last character or escaped using \.

>> 'ab-cd gh-c 12-423'.scan(/\b[a-z-]{2,}\b/)
=> ["ab-cd", "gh-c"]

>> 'ab-cd gh-c 12-423'.scan(/\b[a-z\-0-9]{2,}\b/)
=> ["ab-cd", "gh-c", "12-423"]

^ should be other than the first character or escaped using \.

>> 'f*(a^b) - 3*(a+b)'.scan(/a[+^]b/)
=> ["a^b", "a+b"]

>> 'f*(a^b) - 3*(a+b)'.scan(/a[\^+]b/)
=> ["a^b", "a+b"]

[, ] and \ should be escaped using \.

>> 'words[5] = tea'[/[a-z\[\]0-9]+/]
=> "words[5]"

>> puts '5ba\babc2'[/[a\\b]+/]
ba\bab

Escape sequence sets

Commonly used character sets have predefined escape sequences:

\w is equivalent to [A-Za-z0-9_] for matching word characters (recall the definition for word boundaries)
\d is equivalent to [0-9] for matching digit characters
\s is equivalent to [ \t\r\n\f\v] for matching whitespace characters
\h is equivalent to [0-9a-fA-F] for matching hexadecimal characters

These escape sequences can be used as a standalone pattern or inside a character class.

>> '128A foo1 fe32 34 bar'.scan(/\b\h+\b/)
=> ["128A", "fe32", "34"]
>> '128A foo1 fe32 34 bar'.scan(/\b\h+\b/).map(&:hex)
=> [4746, 65074, 52]

>> 'Sample123string42with777numbers'.split(/\d+/)
=> ["Sample", "string", "with", "numbers"]
>> 'apple=5, banana=3; x=83, y=120'.scan(/\d+/).map(&:to_i)
=> [5, 3, 83, 120]

>> 'sea eat car rat eel tea'.scan(/\b\w/).join
=> "secret"

>> "tea sea-Pit Sit;(lean_2\tbean_3)".scan(/[\w\s]+/)
=> ["tea sea", "Pit Sit", "lean_2\tbean_3"]

And negative logic strikes again. Use \W, \D, \S and \H respectively for their negated sets.

>> 'Sample123string42with777numbers'.gsub(/\D+/, '-')
=> "-123-42-777-"

>> 'apple=5, banana=3; x=83, y=120'.gsub(/\W+/, '')
=> "apple5banana3x83y120"

# this output can be achieved with a normal string method too, guess which one?!
>> "   1..3  \v\f  fig_tea 42\tzzz   \r\n1-2-3  ".scan(/\S+/)
=> ["1..3", "fig_tea", "42", "zzz", "1-2-3"]

\R matches line break characters \n, \v, \f, \r, \u0085 (next line), \u2028 (line separator), \u2029 (paragraph separator) or \r\n. Unlike other escapes, \R cannot be used inside a character class.

>> "food\r\ngood\napple\vbanana".gsub(/\R/, " ")
=> "food good apple banana"

>> "food\r\ngood"[/\w+\R/]
=> "food\r\n"

Here's an example with possessive quantifiers. The goal is to match strings whose first non-whitespace character is not a # character. A matching string should have at least one non-# character, so empty strings and those with only whitespace characters should not match.

>> ip = ['#comment', 'c = "#"', "\t #comment", 'fig', '', " \t "]

# this solution with greedy quantifiers fails because \s* can backtrack
# and [^#] can match a whitespace character as well
>> ip.grep(/\A\s*[^#]/)
=> ["c = \"#\"", "\t #comment", "fig", " \t "]

# this works because \s*+ will not give back any whitespace characters
>> ip.grep(/\A\s*+[^#]/)
=> ["c = \"#\"", "fig"]

# workaround if you use only greedy quantifiers
>> ip.grep(/\A\s*[^#\s]/)
=> ["c = \"#\"", "fig"]

Named character sets

Ruby also provides named character sets, which are Unicode aware unlike escape sequence sets which are limited only to ASCII characters. A named character set is defined by a name enclosed between [: and :] and has to be used within a character class [], along with any other characters as needed. Using [:^ instead of [: will negate the named set.

Four of the escape sequences presented above have named set equivalents. See ruby-doc: POSIX Bracket Expressions for full list and details.

# similar to: /\d+/ or /[0-9]+/
>> 'Sample123string42with777numbers'.split(/[[:digit:]]+/)
=> ["Sample", "string", "with", "numbers"]

# similar to: /\S+/
>> "   1..3  \v\f  fig_tea 42\tzzz   \r\n1-2-3  ".scan(/[[:^space:]]+/)
=> ["1..3", "fig_tea", "42", "zzz", "1-2-3"]

# similar to: /[\w\s]+/
>> "tea sea-Pit Sit;(lean_2\tbean_3)".scan(/[[:word:][:space:]]+/)
=> ["tea sea", "Pit Sit", "lean_2\tbean_3"]

Here are some named character sets which do not have escape sequence versions:

# similar to: /[a-zA-Z]+/
>> 'Sample123string42with777numbers'.scan(/[[:alpha:]]+/)
=> ["Sample", "string", "with", "numbers"]

# remove all punctuation characters
>> ip = '"Hi", there! How *are* you? All fine here.'
>> ip.gsub(/[[:punct:]]+/, '')
=> "Hi there How are you All fine here"
# remove all punctuation characters except . ! and ?
>> ip.gsub(/[[^.!?]&&[:punct:]]+/, '')
=> "Hi there! How are you? All fine here."

Numeric ranges

Character classes can also be used to construct numeric ranges.

# numbers between 10 to 29
>> '23 154 12 26 98234'.scan(/\b[12]\d\b/)
=> ["23", "12", "26"]

# numbers >= 100
>> '23 154 12 26 98234'.scan(/\b\d{3,}\b/)
=> ["154", "98234"]

# numbers >= 100 if there are leading zeros
>> '0501 035 154 12 26 98234'.scan(/\b0*+\d{3,}\b/)
=> ["0501", "154", "98234"]

However, it is easy to miss corner cases and some ranges are complicated to design. In such cases, it is better to convert the matched portion to appropriate numeric format first.

# numbers < 350
>> '45 349 651 593 4 204'.scan(/\d+/).filter { _1.to_i < 350 }
=> ["45", "349", "4", "204"]

# numbers between 200 and 650
>> '45 349 651 593 4 204'.gsub(/\d+/) { (200..650) === $&.to_i ? 0 : 1 }
=> "1 0 1 0 1 0"

Cheatsheet and Summary

Note	Description
`[ae;o]`	match any of these characters once
	quantifiers are applicable to character classes too
`[3-7]`	range of characters from `3` to `7`
`[^=b2]`	match other than `=` or `b` or `2`
`[a-z&&[^aeiou]]`	intersection of `a-z` and `[^aeiou]`
`[a-z-]`	`-` should be the first/last or escaped using `\` to match literally
`[+^]`	`^` shouldn't be the first character or escaped using `\`
`[a-z\[\]\\]`	`[`, `]` and `\` should be escaped using `\`
`\w`	similar to `[A-Za-z0-9_]` for matching word characters
`\d`	similar to `[0-9]` for matching digit characters
`\s`	similar to `[ \t\n\r\f\v]` for matching whitespace characters
`\h`	similar to `[0-9a-fA-F]` for matching hexadecimal characters
	use `\W`, `\D`, `\S` and `\H` for their opposites respectively
	these escapes can be used inside character class as well
`[[:alpha:]]`	named character set to match alphabets
`[[:punct:]]`	match punctuation characters
`[[:^punct:]]`	match other than punctuation characters
	see ruby-doc: POSIX Bracket Expressions for full list
`\R`	matches line breaks `\n`, `\v`, `\f`, `\r`, `\u0085` (next line)
	`\u2028` (line separator), `\u2029` (paragraph separator) or `\r\n`
	`\R` has no special meaning inside a character class

This chapter focused on how to create custom placeholders for limited set of characters. Grouping and character classes can be considered as two levels of abstractions. On the one hand, you can have character sets inside [] and on the other, you can have multiple alternations grouped inside () including character classes. As anchoring and quantifiers can be applied to both these abstractions, you can begin to see how regular expressions is considered a mini-programming language.

In the coming chapters, you'll even see how to negate groupings similar to negated character class in certain scenarios.

Exercises

1) For the array items, filter all elements starting with hand and ending immediately with s or y or le.

>> items = %w[-handy hand handy unhand hands hand-icy handle]

##### add your solution here
=> ["handy", "hands", "handle"]

2) Replace all whole words reed or read or red with X.

>> ip = 'redo red credible :read: rod reed'

##### add your solution here
=> "redo X credible :X: rod X"

3) For the array words, filter all elements containing e or i followed by l or n. Note that the order mentioned should be followed.

>> words = %w[surrender unicorn newer door empty eel pest]

##### add your solution here
=> ["surrender", "unicorn", "eel"]

4) For the array words, filter all elements containing e or i and l or n in any order.

>> words = %w[surrender unicorn newer door empty eel pest]

##### add your solution here
=> ["surrender", "unicorn", "newer", "eel"]

5) Convert the comma separated strings to corresponding hash objects as shown below.

>> row1 = 'name:rohan,maths:75,phy:89'
>> row2 = 'name:rose,maths:88,phy:92'

>> pat =        ##### add your solution here

##### add your solution here for row1
=> {"name"=>"rohan", "maths"=>"75", "phy"=>"89"}
##### add your solution here for row2
=> {"name"=>"rose", "maths"=>"88", "phy"=>"92"}

6) Delete from ( to the next occurrence of ) unless they contain parentheses characters in between.

>> str1 = 'def factorial()'
>> str2 = 'a/b(division) + c%d(#modulo) - (e+(j/k-3)*4)'
>> str3 = 'Hi there(greeting). Nice day(a(b)'

>> remove_parentheses =     ##### add your solution here

>> str1.gsub(remove_parentheses, '')
=> "def factorial"
>> str2.gsub(remove_parentheses, '')
=> "a/b + c%d - (e+*4)"
>> str3.gsub(remove_parentheses, '')
=> "Hi there. Nice day(a"

7) For the array words, filter all elements not starting with e or p or u.

>> words = %w[surrender unicorn newer door empty eel (pest)]

##### add your solution here
=> ["surrender", "newer", "door", "(pest)"]

8) For the array words, filter all elements not containing u or w or ee or -.

>> words = %w[p-t you tea heel owe new reed ear]

##### add your solution here
=> ["tea", "ear"]

9) The given input strings contain fields separated by , and fields can be empty too. Replace the last three fields with WHTSZ323.

>> row1 = '(2),kite,12,,D,C,,'
>> row2 = 'hi,bye,sun,moon'

>> pat =        ##### add your solution here

##### add your solution here for row1
=> "(2),kite,12,,D,WHTSZ323"
##### add your solution here for row2
=> "hi,WHTSZ323"

10) Split the given strings based on consecutive sequence of digit or whitespace characters.

>> str1 = "lion \t Ink32onion Nice"
>> str2 = "**1\f2\n3star\t7 77\r**"

>> pat =        ##### add your solution here

>> str1.split(pat)
=> ["lion", "Ink", "onion", "Nice"]
>> str2.split(pat)
=> ["**", "star", "**"]

11) Delete all occurrences of the sequence <characters> where characters is one or more non > characters and cannot be empty.

>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'

##### add your solution here
=> "a 1<> b 2<> c"

12) \b[a-z](on|no)[a-z]\b is same as \b[a-z][on]{2}[a-z]\b. True or False? Sample input lines shown below might help to understand the differences, if any.

>> puts "known\nmood\nknow\npony\ninns"
known
mood
know
pony
inns

13) For the given array, filter elements containing any number sequence greater than 624.

>> items = ['h0000432ab', 'car00625', '42_624 0512', '96 foo1234baz 3.14 2']

##### add your solution here
=> ["car00625", "96 foo1234baz 3.14 2"]

14) Count the maximum depth of nested braces for the given strings. Unbalanced or wrongly ordered braces should return -1. Note that this will require a mix of regular expressions and Ruby code.

?> def max_nested_braces(ip)
##### add your solution here
>> end

>> max_nested_braces('a*b')
=> 0
>> max_nested_braces('}a+b{')
=> -1
>> max_nested_braces('a*b+{}')
=> 1
>> max_nested_braces('{{a+2}*{b+c}+e}')
=> 2
>> max_nested_braces('{{a+2}*{b+{c*d}}+e}')
=> 3
>> max_nested_braces("{{a+2}*{\n{b+{c*d}}+e*d}}")
=> 4
>> max_nested_braces('a*{b+c*{e*3.14}}}')
=> -1

15) By default, the split method will split on whitespace and remove empty strings from the result. Which regexp based method would you use to replicate this functionality?

>> ip = " \t\r  so  pole\t\t\t\n\nlit in to \r\n\v\f  "

>> ip.split
=> ["so", "pole", "lit", "in", "to"]

##### add your solution here
=> ["so", "pole", "lit", "in", "to"]

16) Convert the given input string to two different arrays as shown below. You can optimize the regexp based on characters present in the input string.

>> ip = "price_42 roast^\t\n^-ice==cat\neast"

##### add your solution here
=> ["price_42", "roast", "ice", "cat", "east"]

##### add your solution here
=> ["price_42", " ", "roast", "^\t\n^-", "ice", "==", "cat", "\n", "east"]

17) Filter all elements whose first non-whitespace character is not a # character. Any element made up of only whitespace characters should be ignored as well.

>> items = ['    #comment', "\t\napple #42", '#oops', 'sure', 'no#1', "\t\r\f"]

##### add your solution here
=> ["\t\napple #42", "sure", "no#1"]

18) Extract all whole words for the given input strings. However, based on user input ignore, do not match words if they contain any character present in the ignore variable. Assume that ignore variable will not contain any regexp metacharacters.

>> s1 = 'match after the last newline character'
>> s2 = 'and then you want to test'

>> ignore = 'aty'
>> pat =        ##### add your solution here
>> s1.scan(pat)
=> ["newline"]
>> s2.scan(pat)
=> []

>> ignore = 'esw'
>> pat =        ##### add your solution here
>> s1.scan(pat)
=> ["match"]
>> s2.scan(pat)
=> ["and", "you", "to"]

19) Filter all whole elements with optional whitespaces at the start followed by three to five non-digit characters. Whitespaces at the start should not be part of the calculation for non-digit characters.

>> items = ["\t \ncat", 'goal', ' oh', 'he-he', 'goal2', 'ok ', 'sparrow']

##### add your solution here
=> ["\t \ncat", "goal", "he-he", "ok "]

20) Modify the given regexp such that it gives the expected result.

>> ip = '( S:12 E:5 S:4 and E:123 ok S:100 & E:10 S:1 - E:2 S:42 E:43 )'

# wrong output
>> ip.scan(/S:\d+.*?E:\d{2,}/)
=> ["S:12 E:5 S:4 and E:123", "S:100 & E:10", "S:1 - E:2 S:42 E:43"]

# expected output
##### add your solution here
=> ["S:4 and E:123", "S:100 & E:10", "S:42 E:43"]

Understanding Ruby Regexp