Dot metacharacter and Quantifiers
This chapter introduces the dot metacharacter and metacharacters related to quantifiers. As the name implies, quantifiers allows you to specify how many times a character or grouping should be matched. With the *
string operator, you can do something like 'no' * 5
to get 'nonononono'
. This saves you manual repetition as well as gives the ability to programmatically repeat a string object as many times as you need. Quantifiers support this simple repetition as well as ways to specify a range of repetition. This range has the flexibility of being bounded or unbounded with respect to the start and end values. Combined with the dot metacharacter (and alternation if needed), quantifiers allow you to construct conditional AND logic between patterns.
Dot metacharacter
The dot metacharacter serves as a placeholder to match any character except the newline character.
# matches character 'c', any character and then character 't'
>> 'tac tin c.t abc;tuv acute'.gsub(/c.t/, 'X')
=> "taXin X abXuv aXe"
# matches character 'r', any two characters and then character 'd'
>> 'breadth markedly reported overrides'.gsub(/r..d/) { _1.upcase }
=> "bREADth maRKEDly repoRTED oveRRIDes"
# matches character '2', any character and then character '3'
>> "42\t35".sub(/2.3/, '8')
=> "485"
# by default, the dot metacharacter doesn't match the newline character
>> "a\nb".match?(/a.b/)
=> false
See the m modifier section to know how the .
metacharacter can match newlines as well. The Character class chapter will discuss how to define your own custom placeholder for limited set of characters.
split method
This chapter will additionally use the split
method to illustrate examples. The split
method separates the string based on a given regexp (or string) and returns an array of strings.
# same as: 'apple-85-mango-70'.split('-')
>> 'apple-85-mango-70'.split(/-/)
=> ["apple", "85", "mango", "70"]
>> 'bus:3:car:-:van'.split(/:.:/)
=> ["bus", "car", "van"]
# optional limit can be specified as the second argument
# when the limit is positive, you get a maximum of limit-1 splits
>> 'apple-85-mango-70'.split(/-/, 2)
=> ["apple", "85-mango-70"]
See the split with capture groups section for details of how capture groups affect the output of the split
method.
Greedy quantifiers
Quantifiers have functionality like the string repetition operator and the range method. They can be applied to characters and groupings (and more, as you'll see in later chapters). Apart from the ability to specify exact quantity and bounded range, these can also match unbounded varying quantities. If the input string can satisfy a pattern with varying quantities in multiple ways, you can choose among three types of quantifiers to narrow down possibilities. In this section, greedy type of quantifiers is covered.
First up, the ?
metacharacter which quantifies a character or group to match 0
or 1
times. In other words, you make that character or group as something to be optionally matched. This leads to a terser regexp compared to alternation and grouping.
# same as: /ear|ar/
>> 'far feat flare fear'.gsub(/e?ar/, 'X')
=> "fX feat flXe fX"
# same as: /\bpar(t|)\b/
>> 'par spare part party'.gsub(/\bpart?\b/, 'X')
=> "X spare X party"
# same as: /\b(re.d|red)\b/
>> words = %w[red read ready re;d road redo reed rod]
>> words.grep(/\bre.?d\b/)
=> ["red", "read", "re;d", "reed"]
# same as: /part|parrot/
>> 'par part parrot parent'.gsub(/par(ro)?t/, 'X')
=> "par X X parent"
# same as: /part|parrot|parent/
>> 'par part parrot parent'.gsub(/par(en|ro)?t/, 'X')
=> "par X X X"
The *
metacharacter quantifies a character or group to match 0
or more times.
# match 't' followed by zero or more of 'a' followed by 'r'
>> 'tr tear tare steer sitaara'.gsub(/ta*r/, 'X')
=> "X tear Xe steer siXa"
# match 't' followed by zero or more of 'e' or 'a' followed by 'r'
>> 'tr tear tare steer sitaara'.gsub(/t(e|a)*r/, 'X')
=> "X X Xe sX siXa"
# match zero or more of '1' followed by '2'
>> '3111111111125111142'.gsub(/1*2/, 'X')
=> "3X511114X"
Here are some examples with split
and related methods. The partition
method splits the input string on the first match and the text matched by the regexp is also present in the output. rpartition
is like partition
but splits on the last match.
# note how '25' and '42' gets split, there is '1' zero times in between them
>> '3111111111125111142'.split(/1*/)
=> ["3", "2", "5", "4", "2"]
# there is '1' zero times at end of string as well, note the use of -1 for limit
>> '3111111111125111142'.split(/1*/, -1)
=> ["3", "2", "5", "4", "2", ""]
>> '3111111111125111142'.partition(/1*2/)
=> ["3", "11111111112", "5111142"]
# last element is empty because there is nothing after 2 at the end of string
>> '3111111111125111142'.rpartition(/1*2/)
=> ["311111111112511114", "2", ""]
The +
metacharacter quantifies a character or group to match 1
or more times. Similar to the *
quantifier, there is no upper bound. More importantly, this doesn't have surprises like matching an empty string at unexpected places.
>> 'tr tear tare steer sitaara'.gsub(/ta+r/, 'X')
=> "tr tear Xe steer siXa"
>> 'tr tear tare steer sitaara'.gsub(/t(e|a)+r/, 'X')
=> "tr X Xe sX siXa"
>> '3111111111125111142'.gsub(/1+2/, 'X')
=> "3X5111142"
>> '3111111111125111142'.split(/1+/)
=> ["3", "25", "42"]
You can specify a range of integer numbers, both bounded and unbounded, using the {}
metacharacters. There are four ways to use this quantifier as shown below:
Pattern | Description |
---|---|
{m,n} | match m to n times |
{m,} | match at least m times |
{,n} | match up to n times (including 0 times) |
{n} | match exactly n times |
>> repeats = %w[abc ac adc abbc xabbbcz bbb bc abbbbbc]
>> repeats.grep(/ab{1,4}c/)
=> ["abc", "abbc", "xabbbcz"]
>> repeats.grep(/ab{3,}c/)
=> ["xabbbcz", "abbbbbc"]
>> repeats.grep(/ab{,2}c/)
=> ["abc", "ac", "abbc"]
>> repeats.grep(/ab{3}c/)
=> ["xabbbcz"]
The
{}
metacharacters have to be escaped to match them literally. However, unlike the()
metacharacters, these have more leeway. For example, escaping{
alone is enough, or if it doesn't conform strictly to any of the four forms listed above, escaping is not needed at all. Also, if you are applying the{}
quantifier to the#
character, you need to escape#
to override interpolation.>> 'a{5} = 10'.sub(/a\{5}/, 'a{6}') => "a{6} = 10" >> 'report_{a,b}.txt'.sub(/_{a,b}/, '-{c,d}') => "report-{c,d}.txt" >> '# heading ### sub-heading'.gsub(/\#{2,}/, '%') => "# heading % sub-heading"
Conditional AND
Next up, how to construct AND conditional using the dot metacharacter and quantifiers.
# match 'Error' followed by zero or more characters followed by 'valid'
>> 'Error: not a valid input'.match?(/Error.*valid/)
=> true
>> 'Error: key not found'.match?(/Error.*valid/)
=> false
To allow matching in any order, you'll have to bring in alternation as well. That is somewhat manageable for 2 or 3 patterns. See the Conditional AND with lookarounds section for an easier approach.
>> seq1, seq2 = ['cat and dog', 'dog and cat']
>> seq1.match?(/cat.*dog|dog.*cat/)
=> true
>> seq2.match?(/cat.*dog|dog.*cat/)
=> true
# if you just need true/false result, this would be a scalable approach
>> patterns = [/cat/, /dog/]
>> patterns.all? { seq1.match?(_1) }
=> true
>> patterns.all? { seq2.match?(_1) }
=> true
What does greedy mean?
When you use the ?
quantifier, how does Ruby decide to match 0
or 1
times, if both quantities can satisfy the regexp? For example, consider this substitution expression 'foot'.sub(/f.?o/, 'X')
— should foo
be replaced or fo
? It will always replace foo
because these are greedy quantifiers, meaning they try to match as much as possible.
>> 'foot'.sub(/f.?o/, 'X')
=> "Xt"
# a more practical example
# prefix '<' with '\' if it is not already prefixed
# both '<' and '\<' will get replaced with '\<'
>> puts 'blah \< fig < apple \< blah < cat'.gsub(/\\?</, '\<')
blah \< fig \< apple \< blah \< cat
# say goodbye to /handful|handy|hand/ shenanigans
>> 'hand handy handful'.gsub(/hand(y|ful)?/, 'X')
=> "X X X"
But wait, then how did the /Error.*valid/
example work? Shouldn't .*
consume all the characters after Error
? Good question. The regular expression engine actually does consume all the characters. Then realizing that the match failed, it gives back one character from the end of string and checks again if the overall regexp is satisfied. This process is repeated until a match is found or failure is confirmed. In regular expression parlance, this is known as backtracking.
>> sentence = 'that is quite a fabricated tale'
# /t.*a/ will always match from the first 't' to the last 'a'
# which implies that there cannot be more than one match for such patterns
>> sentence.sub(/t.*a/, 'X')
=> "Xle"
>> 'star'.sub(/t.*a/, 'X')
=> "sXr"
# matching first 't' to last 'a' for t.*a won't work for these cases
# so, the engine backtracks until the overall regexp can be matched
>> sentence.sub(/t.*a.*q.*f/, 'X')
=> "Xabricated tale"
>> sentence.sub(/t.*a.*u/, 'X')
=> "Xite a fabricated tale"
Non-greedy quantifiers
As the name implies, these quantifiers will try to match as minimally as possible. Also known as lazy or reluctant quantifiers. Appending a ?
to greedy quantifiers makes them non-greedy.
>> 'foot'.sub(/f.??o/, 'X')
=> "Xot"
>> 'frost'.sub(/f.??o/, 'X')
=> "Xst"
>> '123456789'.sub(/.{2,5}?/, 'X')
=> "X3456789"
Like greedy quantifiers, lazy quantifiers will try to satisfy the overall regexp. For example, .*?
will first start with an empty match and then move forward one character at a time until a match is found.
# /:.*:/ will match from the first ':' to the last ':'
>> 'green:3.14:teal::brown:oh!:blue'.split(/:.*:/)
=> ["green", "blue"]
# /:.*?:/ will match from ':' to the very next ':'
>> 'green:3.14:teal::brown:oh!:blue'.split(/:.*?:/)
=> ["green", "teal", "brown", "blue"]
Possessive quantifiers
The difference between greedy and possessive quantifiers is that possessive will not backtrack to find a match. In other words, possessive quantifiers will always consume every character that matches the pattern on which it is applied. Syntax wise, you need to append +
to greedy quantifiers to make it possessive, similar to adding ?
for the non-greedy case.
Unlike greedy and non-greedy quantifiers, a pattern like :.*+apple
will never result in a match because .*+
will consume rest of the line, leaving no way to match apple
.
>> ip = 'fig:mango:pineapple:guava:apples:orange'
>> ip.gsub(/:.*+/, 'X')
=> "figX"
>> ip.match?(/:.*+apple/)
=> false
Here's a more practical example. Suppose you want to match integer numbers greater than or equal to 100
where these numbers can optionally have leading zeros. This illustration will use features yet to introduced. The scan method returns all the matched portions as an array. The pattern [1-9]
matches any of the digits from 1
to 9
and \d
matches digits 0
to 9
. See the Character class chapter for more details and the Escape sequence sets section for another practical example.
>> numbers = '42 314 001 12 00984'
# this solution fails because 0* and \d{3,} can both match leading zeros
# and greedy quantifiers give up characters to help the overall regexp succeed
>> numbers.scan(/0*\d{3,}/)
=> ["314", "001", "00984"]
# here 0*+ will never give back leading zeros
>> numbers.scan(/0*+\d{3,}/)
=> ["314", "00984"]
# workaround with just greedy quantifiers
>> numbers.scan(/0*[1-9]\d{2,}/)
=> ["314", "00984"]
Atomic grouping
(?>pat)
is an atomic group, where pat
is the pattern you want to safeguard from further backtracking. You can think of it as a special group that is isolated from the other parts of the regular expression.
Here's an example with greedy quantifier:
>> numbers = '42 314 001 12 00984'
# 0* is greedy and the (?>) grouping prevents backtracking
# same as: numbers.scan(/0*+\d{3,}/)
>> numbers.scan(/(?>0*)\d{3,}/)
=> ["314", "00984"]
Here's an example with non-greedy quantifier. The match method is used here to extract only the matching portion.
>> ip = 'fig::mango::pineapple::guava::apples::orange'
# this matches from the first '::' to the first occurrence of '::apple'
>> ip.match(/::.*?::apple/)[0]
=> "::mango::pineapple::guava::apple"
# '(?>::.*?::)' will match only from '::' to the very next '::'
# '::mango::' fails because 'apple' isn't found afterwards
# similarly '::pineapple::' fails
# '::guava::' succeeds because it is followed by 'apple'
>> ip.match(/(?>::.*?::)apple/)[0]
=> "::guava::apple"
Catastrophic Backtracking
Backtracking can become significantly time consuming for certain corner cases. Which is why some regular expression engines do not use them, at the cost of not supporting some features like lookarounds. If your application accepts user defined regexp, you might need to protect against such catastrophic patterns. From wikipedia: ReDoS:
A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or becoming unresponsive.
Ruby can apply an optimization to prevent ReDoS attacks for certain cases. See ruby-doc: Regexp Optimization for details. Another option is to set a timeout limit, either globally via Regexp.timeout
or by setting the timeout
keyword argument via Regexp.new
. See ruby-doc: Timeouts for details. These features were introduced in Ruby 3.2 version, see release notes for details and links to proposals.
More examples and mitigation strategies can be found in the following links:
- The Explosive Quantifier Trap
- Runaway Regular Expressions: Catastrophic Backtracking
- Details of the Cloudflare outage on July 2, 2019
Cheatsheet and Summary
Note | Description |
---|---|
. | match any character except the newline character |
greedy | match as much as possible |
? | greedy quantifier, match 0 or 1 times |
* | greedy quantifier, match 0 or more times |
+ | greedy quantifier, match 1 or more times |
{m,n} | greedy quantifier, match m to n times |
{m,} | greedy quantifier, match at least m times |
{,n} | greedy quantifier, match up to n times (including 0 times) |
{n} | greedy quantifier, match exactly n times |
pat1.*pat2 | any number of characters between pat1 and pat2 |
pat1.*pat2|pat2.*pat1 | match both pat1 and pat2 in any order |
non-greedy | append ? to greedy quantifiers |
match as minimally as possible | |
possessive | append + to greedy quantifiers |
like greedy, but no backtracking | |
(?>pat) | atomic grouping, isolates pat from rest of the regexp |
s.split(/pat/) | split a string based on pat |
accepts an optional limit argument to control no. of splits | |
s.partition(/pat/) | returns array of 3 elements based on the first match |
portion before match, matched portion, portion after match | |
s.rpartition(/pat/) | returns array of 3 elements based on the last match |
This chapter introduced the concept of specifying a placeholder instead of fixed strings. When combined with quantifiers, you've seen a glimpse of how a simple regexp can match wide ranges of text. In the coming chapters, you'll learn how to create your own restricted set of placeholder characters.
Exercises
Since the
.
metacharacter doesn't match newline characters by default, assume that the input strings in the following exercises will not contain newline characters.
1) Replace 42//5
or 42/5
with 8
for the given input.
>> ip = 'a+42//5-c pressure*3+42/5-14256'
>> ip.gsub() ##### add your solution here
=> "a+8-c pressure*3+8-14256"
2) For the array items
, filter all elements starting with hand
and ending immediately with at most one more character or le
.
>> items = %w[handed hand handled handy unhand hands handle]
>> items.grep() ##### add your solution here
=> ["hand", "handy", "hands", "handle"]
3) Use the split
method to get the output as shown for the given input strings.
>> eqn1 = 'a+42//5-c'
>> eqn2 = 'pressure*3+42/5-14256'
>> eqn3 = 'r*42-5/3+42///5-42/53+a'
>> pat = ##### add your solution here
>> eqn1.split(pat)
=> ["a+", "-c"]
>> eqn2.split(pat)
=> ["pressure*3+", "-14256"]
>> eqn3.split(pat)
=> ["r*42-5/3+42///5-", "3+a"]
4) For the given input strings, remove everything from the first occurrence of i
till the end of the string.
>> s1 = 'remove the special meaning of such constructs'
>> s2 = 'characters while constructing'
>> s3 = 'input output'
>> pat = ##### add your solution here
>> s1.sub(pat, '')
=> "remove the spec"
>> s2.sub(pat, '')
=> "characters wh"
>> s3.sub(pat, '')
=> ""
5) For the given strings, construct a regexp to get the output as shown below.
>> str1 = 'a+b(addition)'
>> str2 = 'a/b(division) + c%d(#modulo)'
>> str3 = 'Hi there(greeting). Nice day(a(b)'
>> remove_parentheses = ##### add your solution here
>> str1.gsub(remove_parentheses, '')
=> "a+b"
>> str2.gsub(remove_parentheses, '')
=> "a/b + c%d"
>> str3.gsub(remove_parentheses, '')
=> "Hi there. Nice day"
6) Correct the given regexp to get the expected output.
>> words = 'plink incoming tint winter in caution sentient'
# wrong output
>> change = /int|in|ion|ing|inco|inter|ink/
>> words.gsub(change, 'X')
=> "plXk XcomXg tX wXer X cautX sentient"
# expected output
>> change = ##### add your solution here
>> words.gsub(change, 'X')
=> "plX XmX tX wX X cautX sentient"
7) For the given greedy quantifiers, what would be the equivalent form using the {m,n}
representation?
?
is same as*
is same as+
is same as
8) (a*|b*)
is same as (a|b)*
— true or false?
9) For the given input strings, remove everything from the first occurrence of test
(irrespective of case) till the end of the string, provided test
isn't at the end of the string.
>> s1 = 'this is a Test'
>> s2 = 'always test your RE for corner cases'
>> s3 = 'a TEST of skill tests?'
>> pat = ##### add your solution here
>> s1.sub(pat, '')
=> "this is a Test"
>> s2.sub(pat, '')
=> "always "
>> s3.sub(pat, '')
=> "a "
10) For the input array words
, filter all elements starting with s
and containing e
and t
in any order.
>> words = ['sequoia', 'subtle', 'exhibit', 'a set', 'sets', 'tests', 'site']
>> words.grep() ##### add your solution here
=> ["subtle", "sets", "site"]
11) For the input array words
, remove all elements having less than 6
characters.
>> words = %w[sequoia subtle exhibit asset sets tests site]
>> words.grep() ##### add your solution here
=> ["sequoia", "subtle", "exhibit"]
12) For the input array words
, filter all elements starting with s
or t
and having a maximum of 6
characters.
>> words = ['sequoia', 'subtle', 'exhibit', 'asset', 'sets', 't set', 'site']
>> words.grep() ##### add your solution here
=> ["subtle", "sets", "t set", "site"]
13) Can you reason out why this code results in the output shown? The aim was to remove all <characters>
patterns but not the <>
ones. The expected result was 'a 1<> b 2<> c'
.
>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'
>> ip.gsub(/<.+?>/, '')
=> "a 1 2"
14) Use the split
method to get the output as shown below for the given input strings.
>> s1 = 'go there :: this :: that'
>> s2 = 'a::b :: c::d e::f :: 4::5'
>> s3 = '42:: hi::bye::see :: carefully'
>> pat = ##### add your solution here
>> s1.split(pat, 2)
=> ["go there", "this :: that"]
>> s2.split(pat, 2)
=> ["a::b", "c::d e::f :: 4::5"]
>> s3.split(pat, 2)
=> ["42:: hi::bye::see", "carefully"]
15) For the given input strings, match if the string starts with optional space characters followed by at least two #
characters.
>> s1 = ' ## header2'
>> s2 = '#### header4'
>> s3 = '# comment'
>> s4 = 'normal string'
>> s5 = 'nope ## not this'
>> pat = ##### add your solution here
>> s1.match?(pat)
=> true
>> s2.match?(pat)
=> true
>> s3.match?(pat)
=> false
>> s4.match?(pat)
=> false
>> s5.match?(pat)
=> false
16) Modify the given regular expression such that it gives the expected results.
>> s1 = 'appleabcabcabcapricot'
>> s2 = 'bananabcabcabcdelicious'
# wrong output
>> pat = /(abc)+a/
>> pat.match?(s1)
=> true
>> pat.match?(s2)
=> true
# expected output
# 'abc' shouldn't be considered when trying to match 'a' at the end
>> pat = ##### add your solution here
>> pat.match?(s1)
=> true
>> pat.match?(s2)
=> false