Working with matched portions
You have already seen a few features that can match varying text. In this chapter, you'll learn how to extract and work with those matching portions. First, you'll learn about the match
method and the resulting MatchData
object. Then you'll learn to use the scan
method to get all the matches instead of just the first match. You'll also see how capture groups affects the scan
and split
methods and how to use global variables related to regexp.
match method
First up, the match
method which is similar to the match?
method. Both these methods accept a regexp and an optional index to indicate the starting location. Furthermore, these methods treat a string argument as if it was a regexp all along (which is not the case with other string methods like sub
, split
, etc). The match
method returns a MatchData
object from which various details can be extracted like the matched portion of string, location of the matched portion, etc. nil
is returned if there's no match for the given regexp.
# only the first matching portion is considered
>> 'too soon a song snatch'.match(/so+n/)
=> #<MatchData "soon">
# string argument is treated the same as a regexp
>> 'too soon a song snatch'.match('a.*g')
=> #<MatchData "a song">
# second argument specifies the starting location to search for a match
>> 'too soon a song snatch'.match(/so+n/, 7)
=> #<MatchData "son">
The ()
grouping is also known as a capture group. It has multiple uses, one of which is the ability to work with matched portions of those groups. When capture groups are used with the match
method, they can be retrieved using array index slicing on the MatchData
object. The first element is always the entire matched portion and rest of the elements are for capture groups if they are present. The leftmost (
will get group number 1
, second leftmost (
will get group number 2
and so on.
# retrieving the entire matched portion using index 0
>> 'too soon a song snatch'.match(/on.*g/)[0]
=> "on a song"
# capture group example
>> purchase = 'coffee:100g tea:250g sugar:75g chocolate:50g'
>> m = purchase.match(/:(.*?)g.*?:(.*?)g.*?chocolate:(.*?)g/)
# entire matching portion and capture group portions
>> m.to_a
=> [":100g tea:250g sugar:75g chocolate:50g", "100", "250", "50"]
# only the capture group portions
>> m.captures
=> ["100", "250", "50"]
# getting a specific capture group portion
>> m[1]
=> "100"
The offset
method gives the starting and ending + 1 indexes of the matching portion. It accepts an argument to indicate the entire matching portion or a specific capture group. You can also use the begin
and end
methods to get either of those locations.
>> m = 'awesome'.match(/w(.*)me/)
>> m.offset(0)
=> [1, 7]
>> m.offset(1)
=> [2, 5]
>> m.begin(0)
=> 1
>> m.end(1)
=> 5
There are many more methods available. See ruby-doc: MatchData for details.
>> m = 'THIS is goodbye then'.match(/hi.*bye/i) >> m.regexp => /hi.*bye/i >> m.string => "THIS is goodbye then"
The named_captures
method will be covered in the Named capture groups section.
match method with block
The match
method also supports the block form, which is executed only if the regexp matching succeeds.
>> 'THIS is goodbye then'.match(/T(.*S).*(g.*?e)/) { |m| puts m[2], m[1] }
goodbye
HIS
>> 'apple mango'.match(/xyz/) { puts 2 * 3 }
=> nil
Using regexp as a string index
If you are a fan of code golfing, you can use a regexp inside []
on a string object to replicate some features of the match
and sub!
methods.
# same as: match(/so+n/)[0]
>> 'too soon a song snatch'[/so+n/]
=> "soon"
# same as: match(/(t.*?s).*(s.*g)/)[2]
>> 'too soon a song snatch'[/(t.*?s).*(s.*g)/, 2]
=> "song"
# same as:match(/so+n/, 7)[0]
>> 'too soon a song snatch'[7..][/so+n/]
=> "son"
>> word = 'elephant'
# same as: word.sub!(/e.*h/, 'w')
>> word[/e.*h/] = 'w'
=> "w"
>> word
=> "want"
scan method
The scan
method returns all the matched portions as an array. With the match
method you can get only the first matching portion.
>> 'too soon a song snatch'.scan(/so*n/)
=> ["soon", "son", "sn"]
>> 'too soon a song snatch'.scan(/so+n/)
=> ["soon", "son"]
>> s = 'PAR spar apparent SpArE part pare'
>> s.scan(/\bs?pare?\b/i)
=> ["PAR", "spar", "SpArE", "pare"]
It is a useful method for debugging purposes as well, for example to see what is going on under the hood before applying substitution methods.
>> s = 'green:3.14:teal::brown:oh!:blue'
>> s.scan(/:.*:/)
=> [":3.14:teal::brown:oh!:"]
>> s.scan(/:.*?:/)
=> [":3.14:", "::", ":oh!:"]
>> s.scan(/:.*+:/)
=> []
If capture groups are used, each element of the output will be an array of strings of all the capture groups. Text matched by regexp outside of capture groups won't be present in the output array. Also, you'll get an empty string if a particular capture group didn't match any character. See the Non-capturing groups section if you need to use groupings without affecting the scan
output.
>> purchase = 'coffee:100g tea:250g sugar:75g chocolate:50g'
# without capture groups
>> purchase.scan(/:.*?g/)
=> [":100g", ":250g", ":75g", ":50g"]
# with a single capture group
>> purchase.scan(/:(.*?)g/)
=> [["100"], ["250"], ["75"], ["50"]]
# multiple capture groups
# note that the last date didn't match because there's no comma at the end
# you'll later learn better ways to match such patterns
>> '2023/04/25,1986/Mar/02,77/12/31'.scan(%r{(.*?)/(.*?)/(.*?),})
=> [["2023", "04", "25"], ["1986", "Mar", "02"]]
Use block form to iterate over the matched portions.
>> 'too soon a song snatch'.scan(/so+n/) { puts _1.upcase }
SOON
SON
>> 'xx:yyy x: x:yy :y'.scan(/(x*):(y*)/) { puts _1.size + _2.size }
5
1
3
1
split with capture groups
Capture groups affects the split
method as well. If the regexp used to split contains capture groups, the portions matched by those groups will also be a part of the output array.
# without capture groups
>> '31111111111251111426'.split(/1*4?2/)
=> ["3", "5", "6"]
# to include the matching portions of the regexp as well in the output
>> '31111111111251111426'.split(/(1*4?2)/)
=> ["3", "11111111112", "5", "111142", "6"]
If part of the regexp is outside a capture group, the text thus matched won't be in the output. If a capture group didn't participate, that element will be totally absent in the output.
# here 4?2 is outside capture group, so that portion won't be in output
>> '31111111111251111426'.split(/(1*)4?2/)
=> ["3", "1111111111", "5", "1111", "6"]
# multiple capture groups example
# note that the portion matched by b+ isn't present in the output
>> '3.14aabccc42'.split(/(a+)b+(c+)/)
=> ["3.14", "aa", "ccc", "42"]
# here (4)? matches zero times on the first occasion, thus absent
>> '31111111111251111426'.split(/(1*)(4)?2/)
=> ["3", "1111111111", "5", "1111", "4", "6"]
Use of capture groups and optional limit as 2 gives behavior similar to the partition
method.
# same as: partition(/a+b+c+/)
>> '3.14aabccc42abc88'.split(/(a+b+c+)/, 2)
=> ["3.14", "aabccc", "42abc88"]
regexp global variables
An expression involving regexp also sets regexp related global variables, except for the match?
method. Assume m
is a MatchData
object in the below description of four of the regexp related global variables.
$~
containsMatchData
object, same asm
$`
string before the matched portion, same asm.pre_match
$&
matched portion, same asm[0]
$'
string after the matched portion, same asm.post_match
Here's an example:
>> sentence = 'that is quite a fabricated tale'
>> sentence =~ /q.*b/
=> 8
>> $~
=> #<MatchData "quite a fab">
>> $~[0]
=> "quite a fab"
>> $`
=> "that is "
>> $&
=> "quite a fab"
>> $'
=> "ricated tale"
For methods that match multiple times, like scan
and gsub
, the global variables will be updated for each match. Referring to them in later instructions will give you information only for the final match.
# same as: { puts _1.upcase }
>> 'too soon a song snatch'.scan(/so+n/) { puts $&.upcase }
SOON
SON
# using 'gsub' for illustration purpose here, can also use 'scan'
>> 'too soon a song snatch'.gsub(/so+n/) { puts $~.begin(0) }
4
11
# using global variables afterwards will give info only for the final match
>> $~
=> #<MatchData "son">
>> $`
=> "too soon a "
If you need to apply methods like
map
along with regexp global variables, usegsub
instead ofscan
.>> sentence = 'that is quite a fabricated tale' # you'll only get information for the last match with 'scan' >> sentence.scan(/t.*?a/).map { $~.begin(0) } => [23, 23, 23] # 'gsub' will get you information for each match >> sentence.gsub(/t.*?a/).map { $~.begin(0) } => [0, 3, 23]
In addition to using $~
, you can also use $N
where N is the capture group you want. $1
will have string matched by the first group, $2
will have string matched by the second group and so on. As a special case, $+
will have string matched by the last group. Default value is nil
if that particular capture group wasn't used in the regexp.
>> sentence = 'that is quite a fabricated tale'
>> sentence =~ /a.*(q.*(f.*b).*c)(.*a)/
=> 2
>> $&
=> "at is quite a fabricated ta"
# same as $~[1]
>> $1
=> "quite a fabric"
>> $2
=> "fab"
>> $+
=> "ated ta"
>> $4
=> nil
# $~ is handy if array slicing, negative index, etc are needed
>> $~[-2]
=> "fab"
>> $~.values_at(1, 3)
=> ["quite a fabric", "ated ta"]
Using hashes
With the help of block form and global variables, you can use a hash variable to determine the replacement string based on the matched text. If the requirement is as simple as passing entire matched portion to the hash variable, both sub
and gsub
methods accept a hash instead of string in the replacement section.
# one to one mappings
>> h = { '1' => 'one', '2' => 'two', '4' => 'four' }
# same as: '9234012'.gsub(/1|2|4/) { h[$&] }
>> '9234012'.gsub(/1|2|4/, h)
=> "9two3four0onetwo"
# if the matched text doesn't exist as a key, the default value will be used
>> h.default = 'X'
>> '9234012'.gsub(/./, h)
=> "XtwoXfourXonetwo"
For swapping two or more strings without using intermediate result, using a hash object is recommended.
>> swap = { 'cat' => 'tiger', 'tiger' => 'cat' }
>> 'cat tiger dog tiger cat'.gsub(/cat|tiger/, swap)
=> "tiger cat dog cat tiger"
For hashes that have many entries and likely to undergo changes during development, building alternation list manually is not a good choice. Also, recall that as per precedence rules, longest length string should come first.
>> h = { 'hand' => 1, 'handy' => 2, 'handful' => 3, 'a^b' => 4 }
>> pat = Regexp.union(h.keys.sort_by { |w| -w.length })
>> pat
=> /handful|handy|hand|a\^b/
>> 'handful hand pin handy (a^b)'.gsub(pat, h)
=> "3 1 pin 2 (4)"
Substitution in conditional expression
The sub!
and gsub!
methods return nil
if the substitution fails. That makes them usable as part of a conditional expression leading to creative and terser solutions.
# display results only if the substitution succeeds
>> num = '4'
>> puts "#{num} apples" if num.sub!(/5/) { $&.to_i ** 2 }
=> nil
>> puts "#{num} apples" if num.sub!(/4/) { $&.to_i ** 2 }
16 apples
# delete 'fin' and keep repeating the process on the modified string
# 'cnt' keeps track of number of substitutions made
>> word, cnt = ['coffining', 0]
>> cnt += 1 while word.sub!(/fin/, '')
=> nil
>> [word, cnt]
=> ["cog", 2]
Cheatsheet and Summary
Note | Description |
---|---|
s.match(/pat/) | returns a MatchData object |
which has details like matched portions, location, etc | |
match and match? methods treat string argument as regexp | |
m[0] | entire matched portion of MatchData object m |
m[1] | matched portion of the first capture group |
m[2] | matched portion of the second capture group and so on |
m.to_a | array of the entire matched portion and capture groups |
m.captures | array of only the capture group portions |
m.offset(N) | array of start and end+1 index of Nth group |
m.begin(N) | start index of Nth group |
m.end(N) | end+1 index of Nth group |
s[/pat/] | same as s.match(/pat/)[0] |
s[/pat/, N] | same as s.match(/pat/)[N] |
s[/pat/] = 'replace' | same as s.sub!(/pat/, 'replace') |
s.scan(/pat/) | returns all the matches as an array |
if capture groups are used, only its matches are returned | |
each element will be an array of capture group matches | |
match and scan methods also support block form | |
split | capture groups affects split method too |
text matched by the groups will be part of the output | |
portion matched by pattern outside group won't be in output | |
group that didn't match will be absent from the output | |
$~ | contains MatchData object |
$` | string before the matched portion |
$& | matched portion |
$' | string after the matched portion |
$N | matched portion of Nth capture group |
$+ | matched portion of the last group |
s.gsub(/pat/, h) | replacement string based on the matched text as hash key |
applicable for the sub method as well | |
in-place substitution | sub! and gsub! return nil if substitution fails |
makes them usable as part of a conditional expression | |
ex: cnt += 1 while word.sub!(/fin/, '') |
This chapter introduced different ways to work with various matching portions of the input string. The match
method returns a MatchData
object that helps you get the portion matched by the regexp, capture groups, location of the match, etc. To get all the matching portions as an array of strings instead of just the first match, you can use the scan
method. You also learnt how capture groups affect the output of the scan
and split
methods.
You'll see many more uses of groupings in the coming chapters. All regexp usage also sets global variables (except the match?
method) which provides information similar to the MatchData
object. You also learnt tricks like passing blocks to methods, using hash as a source of replacement string, regexp as string index, etc.
Exercises
1) For the given strings, extract the matching portion from the first is
to the last t
.
>> str1 = 'This the biggest fruit you have seen?'
>> str2 = 'Your mission is to read and practice consistently'
>> pat = ##### add your solution here
##### add your solution here for str1
=> "is the biggest fruit"
##### add your solution here for str2
=> "ission is to read and practice consistent"
2) Find the starting index of the first occurrence of is
or the
or was
or to
for the given input strings.
>> s1 = 'match after the last newline character'
>> s2 = 'and then you want to test'
>> s3 = 'this is good bye then'
>> s4 = 'who was there to see?'
>> pat = ##### add your solution here
##### add your solution here for s1
=> 12
##### add your solution here for s2
=> 4
##### add your solution here for s3
=> 2
##### add your solution here for s4
=> 4
3) Find the starting index of the last occurrence of is
or the
or was
or to
for the given input strings.
>> s1 = 'match after the last newline character'
>> s2 = 'and then you want to test'
>> s3 = 'this is good bye then'
>> s4 = 'who was there to see?'
>> pat = ##### add your solution here
##### add your solution here for s1
=> 12
##### add your solution here for s2
=> 18
##### add your solution here for s3
=> 17
##### add your solution here for s4
=> 14
4) Extract everything after the :
character, which occurs only once in the input.
>> ip = 'fruits:apple, mango, guava, blueberry'
##### add your solution here
=> "apple, mango, guava, blueberry"
5) The given input strings contains some text followed by -
followed by a number. Replace that number with its log
value using Math.log()
.
>> s1 = 'first-3.14'
>> s2 = 'next-123'
>> pat = ##### add your solution here
##### add your solution here for s1
=> "first-1.144222799920162"
##### add your solution here for s2
=> "next-4.812184355372417"
6) Replace all occurrences of par
with spar
, spare
with extra
and park
with garden
for the given input strings.
>> str1 = 'apartment has a park'
>> str2 = 'do you have a spare cable'
>> str3 = 'write a parser'
##### add your solution here for str1
=> "aspartment has a garden"
##### add your solution here for str2
=> "do you have a extra cable"
##### add your solution here for str3
=> "write a sparser"
7) Extract all words between (
and )
from the given input string as an array. Assume that the input will not contain any broken parentheses.
>> ip = 'another (way) to reuse (portion) matched (by) capture groups'
# as nested array
##### add your solution here
=> [["way"], ["portion"], ["by"]]
# as array of strings
##### add your solution here
=> ["way", "portion", "by"]
8) Extract all occurrences of <
up to the next occurrence of >
, provided there is at least one character in between <
and >
.
>> ip = 'a<apple> 1<> b<bye> 2<> c<cat>'
##### add your solution here
=> ["<apple>", "<> b<bye>", "<> c<cat>"]
9) Use scan
to get the output as shown below for the given input strings. Note the characters used in the input strings carefully.
>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '
>> pat = ##### add your solution here
>> row1.scan(pat)
=> [["-2", "5"], ["4", "+3"], ["+42", "-53"], ["4356246", "-357532354"]]
>> row2.scan(pat)
=> [["1.32", "-3.14"], ["634", "5.63"], ["63.3e3", "9907809345343.235"]]
10) This is an extension to the previous question.
- For
row1
, find the sum of integers of each array element. For example, sum of-2
and5
is3
. - For
row2
, find the sum of floating-point numbers of each array element. For example, sum of1.32
and-3.14
is-1.82
.
>> row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
>> row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '
# should be same as the previous question
>> pat = ##### add your solution here
##### add your solution here for row1
=> [3, 7, -11, -353176108]
##### add your solution here for row2
=> [-1.82, 639.63, 9907809408643.234]
11) Use the split
method to get the output as shown below.
>> ip = '42:no-output;1000:car-tr:u-ck;SQEX49801'
>> ip.split() ##### add your solution here
=> ["42", "output", "1000", "tr:u-ck", "SQEX49801"]
12) Convert the comma separated strings to corresponding hash
objects as shown below. Note that the input strings have an extra ,
at the end.
>> row1 = 'name:rohan,maths:75,phy:89,'
>> row2 = 'name:rose,maths:88,phy:92,'
>> pat = ##### add your solution here
##### add your solution here for row1
=> {"name"=>"rohan", "maths"=>"75", "phy"=>"89"}
##### add your solution here for row2
=> {"name"=>"rose", "maths"=>"88", "phy"=>"92"}