Alternation and Grouping

Many a times, you want to check if the input string matches multiple patterns. For example, whether a product's color is green or blue or red. This chapter will show how to use alternation for such cases. These patterns can have some common elements between them, in which case grouping helps to form terser regexps. This chapter will also discuss the precedence rules used to determine which alternation wins.

Alternation

A conditional expression combined with logical OR evaluates to true if any of the conditions is satisfied. Similarly, in regular expressions, you can use the | metacharacter to combine multiple patterns to indicate logical OR. The matching will succeed if any of the alternate patterns is found in the input string. These alternatives have the full power of a regular expression, for example they can have their own independent anchors. Here are some examples.

# match either 'cat' or 'dog'
>> pet = /cat|dog/
>> 'I like cats'.match?(pet)
=> true
>> 'I like dogs'.match?(pet)
=> true
>> 'I like parrots'.match?(pet)
=> false

# replace 'cat' at the start of string or 'cat' at the end of word
>> 'catapults concatenate cat scat cater'.gsub(/\Acat|cat\b/, 'X')
=> "Xapults concatenate X sX cater"

# replace 'cat' or 'dog' or 'fox' with 'mammal'
>> 'cat dog bee parrot fox'.gsub(/cat|dog|fox/, 'mammal')
=> "mammal mammal bee parrot mammal"

Regexp.union method

You might infer from the above examples that there can be cases where many alternations are required. The Regexp.union method can be used to build the alternation list automatically. It accepts an array as an argument or a list of comma separated arguments.

>> Regexp.union('car', 'jeep')
=> /car|jeep/

>> words = %w[cat dog fox]
>> pat = Regexp.union(words)
>> pat
=> /cat|dog|fox/
>> 'cat dog bee parrot fox'.gsub(pat, 'mammal')
=> "mammal mammal bee parrot mammal"

In the above examples, the elements do not contain any special regexp characters. Handling strings that contain metacharacters will be discussed in the Regexp.escape method section.

Grouping

Often, there are some common portions among the regexp alternatives. It could be common characters, qualifiers like the anchors and so on. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd in maths, you get a(b|c)d = abd|acd in regular expressions.

# without grouping
>> 'red reform read arrest'.gsub(/reform|rest/, 'X')
=> "red X read arX"
# with grouping
>> 'red reform read arrest'.gsub(/re(form|st)/, 'X')
=> "red X read arX"

# without grouping
>> 'par spare part party'.gsub(/\bpar\b|\bpart\b/, 'X')
=> "X spare X party"
# taking out common anchors
>> 'par spare part party'.gsub(/\b(par|part)\b/, 'X')
=> "X spare X party"
# taking out common characters as well
# you'll later learn a better technique instead of using empty alternates
>> 'par spare part party'.gsub(/\bpar(|t)\b/, 'X')
=> "X spare X party"

There are many more uses for grouping than just forming a terser regexp. They will be discussed as they become relevant in the coming chapters.

Regexp.source method

The Regexp.source method helps to interpolate a regexp literal inside another regexp. For example, adding anchors to an alternation list created using the Regexp.union method.

>> words = %w[cat par]
>> alt = Regexp.union(words)
>> alt
=> /cat|par/
>> alt_w = /\b(#{alt.source})\b/
>> alt_w
=> /\b(cat|par)\b/

>> 'cater cat concatenate par spare'.gsub(alt, 'X')
=> "Xer X conXenate X sXe"
>> 'cater cat concatenate par spare'.gsub(alt_w, 'X')
=> "cater X concatenate X spare"

The above example will work without the Regexp.source method too, but you'll see that /\b(#{alt})\b/ gives /\b((?-mix:cat|par))\b/ instead of /\b(cat|par)\b/. Their meaning will be explained in the Modifiers chapter.

Precedence rules

There are tricky situations when using alternation. There is no ambiguity if it is used to get a boolean result by testing a match against a string input. However, for cases like string replacement, it depends on a few factors. Say, you want to replace either are or spared — which one should get precedence? The bigger word spared or the substring are inside it or based on something else?

In Ruby, the alternative which matches earliest in the input string gets precedence. The regexp operator =~ is handy to illustrate this concept.

>> words = 'lion elephant are rope not'

>> words =~ /on/
=> 2
>> words =~ /ant/
=> 10

# starting index of 'on' < index of 'ant' for the given string input
# so 'on' will be replaced irrespective of the order
>> words.sub(/on|ant/, 'X')
=> "liX elephant are rope not"
>> words.sub(/ant|on/, 'X')
=> "liX elephant are rope not"

What happens if alternatives have the same starting index? The precedence is left-to-right in the order of declaration.

>> mood = 'best years'

>> mood =~ /year/
=> 5
>> mood =~ /years/
=> 5

# starting index for 'year' and 'years' will always be the same
# so, which one gets replaced depends on the order of alternation
>> mood.sub(/year|years/, 'X')
=> "best Xs"
>> mood.sub(/years|year/, 'X')
=> "best X"

Another example with gsub to drive home the issue:

>> words = 'ear xerox at mare part learn eye'

# same as: gsub(/ar/, 'X')
>> words.gsub(/ar|are|art/, 'X')
=> "eX xerox at mXe pXt leXn eye"

# same as: gsub(/are|ar/, 'X')
>> words.gsub(/are|ar|art/, 'X')
=> "eX xerox at mX pXt leXn eye"

# phew, finally this one works as needed
>> words.gsub(/are|art|ar/, 'X')
=> "eX xerox at mX pX leXn eye"

If you do not want substrings to sabotage your replacements, a robust workaround is to sort the alternations based on length, longest first.

>> words = %w[hand handy handful]

>> alt = Regexp.union(words.sort_by { |w| -w.length })
>> alt
=> /handful|handy|hand/

>> 'hands handful handed handy'.gsub(alt, 'X')
=> "Xs X Xed X"

# alternation order will come into play if you don't sort them properly
>> 'hands handful handed handy'.gsub(Regexp.union(words), 'X')
=> "Xs Xful Xed Xy"

Cheatsheet and Summary

Note	Description
`\|`	helps to combine multiple patterns as conditional OR
	each alternative can have independent anchors
`Regexp.union(array)`	programmatically combine multiple strings/regexps
`()`	group pattern(s)
`a(b\|c)d`	same as `abd\|acd`
`/#{pat.source}/`	interpolate a regexp literal inside another regexp
Alternation precedence	pattern which matches earliest in the input gets precedence
	tie-breaker is left-to-right if patterns have the same starting location
	robust solution: sort the alternations based on length, longest first
	for ex: `Regexp.union(words.sort_by { \|w\| -w.length })`

So, this chapter was about specifying one or more alternate matches within the same regexp using the | metacharacter. Which can further be simplified using () grouping if the alternations have common portions. Among the alternations, earliest matching pattern gets precedence. Left-to-right ordering is used as a tie-breaker if multiple alternations have the same starting location. You also learnt couple of Regexp methods that help to programmatically construct a regexp literal.

Exercises

1) For the given input array, filter all elements that start with den or end with ly.

>> items = ['lovely', "1\ndentist", '2 lonely', 'eden', "fly\n", 'dent']

>> items.grep()     ##### add your solution here
=> ["lovely", "2 lonely", "dent"]

2) For the given array, filter elements having a line starting with den or ending with ly.

>> items = ['lovely', "1\ndentist", '2 lonely', 'eden', "fly\nfar", 'dent']

>> items.grep()     ##### add your solution here
=> ["lovely", "1\ndentist", "2 lonely", "fly\nfar", "dent"]

3) For the given strings, replace all occurrences of removed or reed or received or refused with X.

>> s1 = 'creed refuse removed read'
>> s2 = 'refused reed redo received'

>> pat =        ##### add your solution here

>> s1.gsub(pat, 'X')
=> "cX refuse X read"
>> s2.gsub(pat, 'X')
=> "X X redo X"

4) For the given strings, replace all matches from the array words with A.

>> s1 = 'plate full of slate'
>> s2 = "slated for later, don't be late"
>> words = %w[late later slated]

>> pat =        ##### add your solution here

>> s1.gsub(pat, 'A')
=> "pA full of sA"
>> s2.gsub(pat, 'A')
=> "A for A, don't be A"

5) Filter all whole elements from the input array items that exactly matches any of the elements present in the array words.

>> items = ['slate', 'later', 'plate', 'late', 'slates', 'slated ']
>> words = %w[late later slated]

>> pat =        ##### add your solution here

>> items.grep(pat)
=> ["later", "late"]

Understanding Ruby Regexp