Groupings and backreferences

This chapter will show how to reuse portion matched by capture groups via backreferences within regexp definition and replacement section. You'll also learn some of the special grouping syntax for cases where plain capture groups isn't enough.

Backreferences

Backreferences are like variables in a programming language. You have already seen how to use MatchData object and regexp global variables to refer to the text matched by capture groups. Backreferences provide the same functionality, with the advantage that these can be directly used in regexp definition as well as replacement section. Another advantage is that you can apply quantifiers to backreferences.

The syntax for replacement section is \N where N is the capture group you want.

  • use \1, \2 up to \9 to refer to the corresponding capture group
    • use block form with global variables if you need more than 9 backreferences in replacement section
  • use \0 or \& to get entire the matched portion (same as $&)
    • similarly, \` and \' are equivalents for $` and $' respectively
# remove square brackets that surround digit characters
>> '[52] apples [and] [31] mangoes'.gsub(/\[(\d+)\]/, '\1')
=> "52 apples [and] 31 mangoes"

# replace __ with _ and delete _ if it is alone
>> '_foo_ __123__ _baz_'.gsub(/(_)?_/, '\1')
=> "foo _123_ baz"

# add something around the matched strings
>> '52 apples and 31 mangoes'.gsub(/\d+/, '(\0)')
=> "(52) apples and (31) mangoes"
>> 'Hello world'.sub(/.*/, 'Hi. \0. Have a nice day')
=> "Hi. Hello world. Have a nice day"

# duplicate first field and add it as last field
>> 'fork,42,nice,3.14'.sub(/,.+/, '\0,\`')
=> "fork,42,nice,3.14,fork"

# swap words that are separated by a comma
>> 'good,bad 42,24'.gsub(/(\w+),(\w+)/, '\2,\1')
=> "bad,good 24,42"

info If double quotes is used in replacement section, you'll have to use \\1, \\2, \\3 and so on.

In regexp definition, the syntax is \N or \k<N> to refer to the Nth capture group. Unlike replacement section, there is no upper limit of 9. Here's some examples for using backreferences within regexp definition.

# elements that have at least one consecutive repeated character
>> %w[effort flee facade oddball rat tool].grep(/(\w)\1/)
=> ["effort", "flee", "oddball", "tool"]

# remove any number of consecutive duplicate words separated by space
# note the use of quantifier on backreferences
# use \W+ instead of space to cover cases like 'a;a<-;a'
>> 'aa a a a 42 f_1 f_1 f_13.14'.gsub(/\b(\w+)( \1)+\b/, '\1')
=> "aa a 42 f_1 f_13.14"

Use \k<N> or escape sequences to avoid ambiguity between normal digit characters and backreferences.

# \12 is treated as 12th backreference even though there's only one capture group
>> 'two one 5 one2 three'.match?(/([a-z]+).*\12/)
=> false
# there's no ambiguity here as \k<1> can only mean 1st backreference
>> 'two one 5 one2 three'.match?(/([a-z]+).*\k<1>2/)
=> true

>> s = 'abcdefghijklmna1d'
# there are 12 capture groups here
# requirement is \1 as backreference and 1 as normal digit
# using escapes is another way to avoid ambiguity
>> s.sub(/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.).*\1\x31/, 'X')
=> "Xd"

info See Negative backreferences section for another use of \k<N> backreference syntax.

Non-capturing groups

Grouping has many uses like applying quantifier on a regexp portion, creating terse regexp by factoring common portions and so on. It also affects the output of scan and split methods as seen in Working with matched portions chapter.

When backreferencing is not required, you can use a non-capturing group to avoid behavior change of scan and split methods. It also helps to avoid keeping a track of capture group numbers when that particular group is not needed for backreferencing. The syntax is (?:pat) to define a non-capturing group. Recall that (?>pat) is used to define atomic groups. More such special groups starting with (? syntax will be discussed later on.

# normal capture group will hinder ability to get whole match
# non-capturing group to the rescue
>> 'cost akin more east run against'.scan(/\b\w*(?:st|in)\b/)
=> ["cost", "akin", "east", "against"]

# capturing wasn't needed here, only common grouping and quantifier
>> '123hand42handy777handful500'.split(/hand(?:y|ful)?/)
=> ["123", "42", "777", "500"]

# with normal grouping, need to keep track of all the groups
>> '1,2,3,4,5,6,7'.sub(/\A(([^,]+,){3})([^,]+)/, '\1(\3)')
=> "1,2,3,(4),5,6,7"
# using non-capturing groups, only relevant groups have to be tracked
>> '1,2,3,4,5,6,7'.sub(/\A((?:[^,]+,){3})([^,]+)/, '\1(\2)')
=> "1,2,3,(4),5,6,7"

Referring to text matched by a capture group with a quantifier will give only the last match, not entire match. Use a capture group around the grouping and quantifier together to get the entire matching portion. In such cases, the inner grouping is an ideal candidate to use non-capturing group.

>> s = 'hi 123123123 bye 456123456'
>> s.scan(/(123)+/)
=> [["123"], ["123"]]
>> s.scan(/(?:123)+/)
=> ["123123123", "123"]
# note that this issue doesn't affect substitutions
>> s.gsub(/(123)+/, 'X')
=> "hi X bye 456X456"

>> row = 'one,2,3.14,42,five'
# surround only fourth column with double quotes
# note the loss of columns in the first case
>> puts row.sub(/\A([^,]+,){3}([^,]+)/, '\1"\2"')
3.14,"42",five
>> puts row.sub(/\A((?:[^,]+,){3})([^,]+)/, '\1"\2"')
one,2,3.14,"42",five

However, there are situations where capture groups cannot be avoided. In such cases, gsub can be used instead of scan. Recall that gsub can return an Enumerator which can be hacked to simulate scan like behavior.

# same as: scan(/\b\w*(?:st|in)\b/) but using capture group for gsub
>> 'cost akin more east run against'.gsub(/\b\w*(st|in)\b/).to_a
=> ["cost", "akin", "east", "against"]
# same as: scan(/\b\w*(?:st|in)\b/).map(&:upcase)
>> 'cost akin more east run against'.gsub(/\b\w*(st|in)\b/).map(&:upcase)
=> ["COST", "AKIN", "EAST", "AGAINST"]

# now for an example that is not possible with scan
# get whole words containing at least one consecutive repeated character
>> 'effort flee facade oddball rat tool'.gsub(/\b\w*(\w)\1\w*\b/).to_a
=> ["effort", "flee", "oddball", "tool"]

Subexpression calls

It may be obvious, but it should be noted that backreference will provide the string that was matched, not the regexp that was inside the capture group. For example, if (\d[a-f]) matches 3b, then backreferencing will only give 3b and not any other valid match of regexp like 8f, 0a etc. This is akin to how variables behave in programming, only the result of expression stays after variable assignment, not the expression itself.

To refer to the regexp itself, use \g<0>, \g<1>, \g<2> etc. This is applicable only in regexp definition, not in replacement sections. This behavior, which is similar to function call, is known as subexpression call in regular expression parlance. Recursion will be discussed in the next section.

>> row = 'today,2008-03-24,food,2012-08-12,nice,5632'

# same as: /\d{4}-\d{2}-\d{2}.*\d{4}-\d{2}-\d{2}/
>> row[/(\d{4}-\d{2}-\d{2}).*\g<1>/]
=> "2008-03-24,food,2012-08-12"

\g affects what text is represented by the original capture group in context of scan and split methods and normal backreference. You will get only the latest matched text.

>> d = '2008-03-24,2012-08-12 2017-06-27,2018-03-25 1999-12-23,2001-05-08'

# output has the value matched by \g<1> and not the original capture group
>> d.scan(/(\d{4}-\d{2}-\d{2}),\g<1>/)
=> [["2012-08-12"], ["2018-03-25"], ["2001-05-08"]]

# this will retain the second date of each pair
>> d.gsub(/(\d{4}-\d{2}-\d{2}),\g<1>/, '\1')
=> "2012-08-12 2018-03-25 2001-05-08"
# to retain the first date of each pair, use another capture group
# and adjust the backreference numbers
>> d.gsub(/((\d{4}-\d{2}-\d{2})),\g<2>/, '\1')
=> "2008-03-24 2017-06-27 1999-12-23"

info See also stackoverflow: How to prevent \g subexpression from overwriting the grouped content. The trick is to switch the subexpression call and the capture group.

Recursive matching

The \g<N> subexpression call supports recursion as well. Useful to match nested patterns, which is usually not recommended to be done with regular expressions. Indeed, use a proper parser library if you are looking to parse file formats like html, xml, json, csv, etc. But for some cases, a parser might not be available and using regexp might be simpler than writing a parser from scratch.

First up, matching a set of parentheses that is not nested (termed as level-one regexp for reference).

# note the use of possessive quantifier
>> eqn0 = 'a + (b * c) - (d / e)'
>> eqn0.scan(/\([^()]++\)/)
=> ["(b * c)", "(d / e)"]

>> eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
>> eqn1.scan(/\([^()]++\)/)
=> ["(f+x)", "(3-g)"]

Next, matching a set of parentheses which may optionally contain any number of non-nested sets of parentheses (termed as level-two regexp for reference). See debuggex for a railroad diagram, which visually shows the recursive nature of this regexp.

>> eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
# note the use of non-capturing group
>> eqn1.scan(/\((?:[^()]++|\([^()]++\))++\)/)
=> ["((f+x)^y-42)", "((3-g)^z+2)"]

>> eqn2 = 'a + (b) + ((c)) + (((d)))'
>> eqn2.scan(/\((?:[^()]++|\([^()]++\))++\)/)
=> ["(b)", "((c))", "((d))"]

That looks very cryptic. You can use x modifier for clarity as well as to make it easy to compare against the recursive version. Breaking down the regexp, you can see ( and ) have to be matched literally. Inside that, valid string is made up of either non-parentheses characters or a non-nested parentheses sequence (level-one regexp).

>> lvl2 = /\(               #literal (
             (?:            #start of non-capturing group
               [^()]++      #non-parentheses characters
               |            #OR
               \([^()]++\)  #level-one regexp
             )++            #end of non-capturing group, 1 or more times
           \)               #literal )
          /x

>> eqn1.scan(lvl2)
=> ["((f+x)^y-42)", "((3-g)^z+2)"]

>> eqn2.scan(lvl2)
=> ["(b)", "((c))", "((d))"]

To recursively match any number of nested sets of parentheses, use a capture group and call it within the capture group itself. Since entire regexp needs to be called here, you can use the default zeroth capture group (this also helps to avoid having to use gsub+to_a trick). Comparing with level-two regexp, the only change is that \g<0> is used instead of the level-one regexp in the second alternation.

>> lvln = /\(               #literal (
             (?:            #start of non-capturing group
               [^()]++      #non-parentheses characters
               |            #OR
               \g<0>        #recursive call
             )++            #end of non-capturing group, 1 or more times
           \)               #literal )
          /x

>> eqn0.scan(lvln)
=> ["(b * c)", "(d / e)"]

>> eqn1.scan(lvln)
=> ["((f+x)^y-42)", "((3-g)^z+2)"]

>> eqn2.scan(lvln)
=> ["(b)", "((c))", "(((d)))"]

>> eqn3 = '(3+a) * ((r-2)*(t+2)/6) + 42 * (a(b(c(d(e)))))'
>> eqn3.scan(lvln)
=> ["(3+a)", "((r-2)*(t+2)/6)", "(a(b(c(d(e)))))"]

Named capture groups

Regexp can get cryptic and difficult to maintain, even for seasoned programmers. There are a few constructs to help add clarity. One such is naming the capture groups and using that name for backreferencing instead of plain numbers. The syntax is (?<name>pat) or (?'name'pat) for naming the capture groups and \k<name> for backreferencing. Any other normal capture group will be treated as if it was a non-capturing group.

# giving names to first and second captured words
>> 'good,bad 42,24'.gsub(/(?<fw>\w+),(?<sw>\w+)/, '\k<sw>,\k<fw>')
=> "bad,good 24,42"

# alternate syntax
>> 'good,bad 42,24'.gsub(/(?'fw'\w+),(?'sw'\w+)/, '\k<sw>,\k<fw>')
=> "bad,good 24,42"

Named capture group can be used for subexpression call using \g<name> syntax.

>> row = 'today,2008-03-24,food,2012-08-12,nice,5632'

>> row[/(?<date>\d{4}-\d{2}-\d{2}).*\g<date>/]
=> "2008-03-24,food,2012-08-12"

If you recall, both /pat/ =~ string and string =~ /pat/ forms can be used. One advantage of first syntax is that named capture groups will also create variables with those names and can be used instead of relying on global variables.

>> details = '2018-10-25,car'

>> /(?<date>[^,]+),(?<product>[^,]+)/ =~ details
=> 0

# same as: $1
>> date
=> "2018-10-25"
# same as: $2
>> product
=> "car"

You can use named_captures method on the MatchData object to extract the portions matched by named capture groups as a hash object. The capture group name will be the key and the portion matched by the group will be the value.

# single match
>> details = '2018-10-25,car,2346'
>> details.match(/(?<date>[^,]+),(?<product>[^,]+)/).named_captures
=> {"date"=>"2018-10-25", "product"=>"car"}

# normal groups won't be part of the output
>> details.match(/(?<date>[^,]+),([^,]+)/).named_captures
=> {"date"=>"2018-10-25"}

# multiple matches
>> s = 'good,bad 42,24'
>> s.gsub(/(?<fw>\w+),(?<sw>\w+)/).map { $~.named_captures }
=> [{"fw"=>"good", "sw"=>"bad"}, {"fw"=>"42", "sw"=>"24"}]

Negative backreferences

Another approach when there are multiple capture groups is to use negative backreference. The negative numbering starts with -1 to refer to capture group closest to the backreference that was defined before the use of negative backreference. In other words, the highest numbered capture group prior to the negative backreference will be -1, the second highest will be -2 and so on. The \k<N> syntax with negative N becomes a negative backreference. This can only be used in regexp definition section as \k in replacement section is reserved for named references.

# check if third and fourth columns have same data
# same as: match?(/\A([^,]+,){2}([^,]+),\2,/)
>> '1,2,3,3,5'.match?(/\A([^,]+,){2}([^,]+),\k<-1>,/)
=> true

Conditional groups

This special grouping allows you to add a condition that depends on whether a capture group succeeded in matching. You can also add an optional else condition. The syntax as per Onigmo doc is shown below.

(?(cond)yes-subexp|no-subexp)

Here cond is N for normal backreference and <name> or 'name' for named backreference. no-subexp is optional. Here's an example with yes-subexp alone being used. The task is to match elements containing word characters only or if it additionally starts with a < character, it must end with a > character.

>> words = %w[<hi> bye bad> <good> 42 <3]
>> words.grep(/\A(<)?\w+(?(1)>)\z/)
=> ["<hi>", "bye", "<good>", "42"]

# for this simple case, you can also expand it manually
# but for complex patterns, it is better to use conditional groups
# as it will avoid repeating the complex pattern
>> words.grep(/\A(?:<\w+>|\w+)\z/)
=> ["<hi>", "bye", "<good>", "42"]

# cannot simply use ? quantifier as they are independent, not constrained
>> words.grep(/\A(?:<?\w+>?)\z/)
=> ["<hi>", "bye", "bad>", "<good>", "42", "<3"]

Here's an example with no-subexp as well.

# filter elements containing word characters surrounded by ()
# or, containing word characters separated by a hyphen
>> words = ['(hi)', 'good-bye', 'bad', '(42)', '-oh', 'i-j', '(-)']

# same as: /\A(?:\(\w+\)|\w+-\w+)\z/ or /\A(?:\((\w+)\)|\g<1>-\g<1>)\z/
>> words.grep(/\A(?:(\()?\w+(?(1)\)|-\w+))\z/)
=> ["(hi)", "good-bye", "(42)", "i-j"]

Conditional groups have a very specific use case, and it is generally helpful for those cases. The main advantage is that it prevents pattern duplication, although that can also be achieved using Subexpression calls. Another advantage is that if the common pattern uses capture groups, then backreference numbers will not require adjustment unlike the duplication alternation method.

Cheatsheet and Summary

NoteDescription
\Nbackreference, gives matched portion of Nth capture group
applies to both search and replacement sections
possible values: \1, \2 up to \9 in replacement section
not limited to 9 in regexp definition
\0 or \& in replacement section gives entire matched portion
\` and \' are equivalents for $` and $' respectively
\k<N>N can be both positive/negative (applicable in regexp definition)
negative is relative, -1 indicates closest group to the left of reference
\g<N>subexpression call for Nth capture group
applicable only in regexp definition, recursion also possible
ex: /\((?:[^()]++|\g<0>)++\)/ matches nested parentheses
(?:pat)non-capturing group
useful wherever grouping is required, but not backreference
(?<name>pat)named capture group
(?'name'pat)another way to define named capture group
use \k<name> for backreferencing
use \g<name> for subexpression call
/pat/ =~ snamed capture groups here will create variables with those names
named_capturesmethod applied on a MatchData object
gives named capture group portions as a hash object
(?(cond)yes|no)conditional group
match yes-subexp if backreferenced group succeeded
else, match no-subexp which is optional

This chapter showed how to use backreferences and subexpression calls to refer to portions matched by capture groups in regexp definition and replacement section. When capture groups results in unwanted behavior change (ex: scan and split), you can use non-capturing groups instead. Named capture groups add clarity to patterns and you can use named_captures method on a MatchData object to get a hash of matched portions. Conditional groups allows you to take an action based on another capture group succeeding or failing to match. There are more special groups to be discussed in coming chapters.

Exercises

a) Replace the space character that occurs after a word ending with a or r with a newline character.

>> ip = 'area not a _a2_ roar took 22'

>> puts ip.gsub()       ##### add your solution here
area
not a
_a2_ roar
took 22

b) Add [] around words starting with s and containing e and t in any order.

>> ip = 'sequoia subtle exhibit asset sets tests site'

##### add your solution here
=> "sequoia [subtle] exhibit asset [sets] tests [site]"

c) Replace all whole words with X that start and end with the same word character. Single character word should get replaced with X too, as it satisfies the stated condition.

>> ip = 'oreo not a _a2_ roar took 22'

##### add your solution here
=> "X not X X X took X"

d) Convert the given markdown headers to corresponding anchor tag. Consider the input to start with one or more # characters followed by space and word characters. The name attribute is constructed by converting the header to lowercase and replacing spaces with hyphens. Can you do it without using a capture group?

>> header1 = '# Regular Expressions'
>> header2 = '## Named capture groups'

>> anchor =     ##### add your solution here

##### add your solution here for header1
=> "# <a name='regular-expressions'></a>Regular Expressions"
##### add your solution here for header2
=> "## <a name='named-capture-groups'></a>Named capture groups"

e) Convert the given markdown anchors to corresponding hyperlinks.

>> anchor1 = "# <a name='regular-expressions'></a>Regular Expressions"
>> anchor2 = "## <a name='subexpression-calls'></a>Subexpression calls"

>> hyperlink =      ##### add your solution here

##### add your solution here for anchor1
=> "[Regular Expressions](#regular-expressions)"
##### add your solution here for anchor2
=> "[Subexpression calls](#subexpression-calls)"

f) Count the number of whole words that have at least two occurrences of consecutive repeated alphabets. For example, words like stillness and Committee should be counted but not words like root or readable or rotational.

'> ip = %q{oppressed abandon accommodation bloodless
'> carelessness committed apparition innkeeper
'> occasionally afforded embarrassment foolishness
'> depended successfully succeeded
>> possession cleanliness suppress}

##### add your solution here
=> 13

g) For the given input string, replace all occurrences of digit sequences with only the unique non-repeating sequence. For example, 232323 should be changed to 23 and 897897 should be changed to 897. If there no repeats (for example 1234) or if the repeats end prematurely (for example 12121), it should not be changed.

>> ip = '1234 2323 453545354535 9339 11 60260260'

##### add your solution here
=> "1234 23 4535 9339 1 60260260"

h) Replace sequences made up of words separated by : or . by the first word of the sequence. Such sequences will end when : or . is not followed by a word character.

>> ip = 'wow:Good:2_two:five: hi-2 bye kite.777.water.'

##### add your solution here
=> "wow hi-2 bye kite"

i) Replace sequences made up of words separated by : or . by the last word of the sequence. Such sequences will end when : or . is not followed by a word character.

>> ip = 'wow:Good:2_two:five: hi-2 bye kite.777.water.'

##### add your solution here
=> "five hi-2 bye water"

j) Split the given input string on one or more repeated sequence of cat.

>> ip = 'firecatlioncatcatcatbearcatcatparrot'

##### add your solution here
=> ["fire", "lion", "bear", "parrot"]

k) For the given input string, find all occurrences of digit sequences with at least one repeating sequence. For example, 232323 and 897897. If the repeats end prematurely, for example 12121, it should not be matched.

>> ip = '1234 2323 453545354535 9339 11 60260260'

>> pat =        ##### add your solution here

# entire sequences in the output
##### add your solution here
=> ["2323", "453545354535", "11"]

# only the unique sequence in the output
##### add your solution here
=> ["23", "4535", "1"]

l) Convert the comma separated strings to corresponding hash objects as shown below. The keys are name, maths and phy for the three fields in the input strings.

>> row1 = 'rohan,75,89'
>> row2 = 'rose,88,92'

>> pat =        ##### add your solution here

##### add your solution here for row1
=> {"name"=>"rohan", "maths"=>"75", "phy"=>"89"}
##### add your solution here for row2
=> {"name"=>"rose", "maths"=>"88", "phy"=>"92"}

m) Surround all whole words with (). Additionally, if the whole word is imp or ant, delete them. Can you do it with single substitution?

>> ip = 'tiger imp goat eagle ant important'

##### add your solution here
=> "(tiger) () (goat) (eagle) () (important)"

n) Filter all elements that contains a sequence of lowercase alphabets followed by - followed by digits. They can be optionally surrounded by {{ and }}. Any partial match shouldn't be part of the output.

>> ip = %w[{{apple-150}} {{mango2-100}} {{cherry-200 grape-87 {{go-to}}]

##### add your solution here
=> ["{{apple-150}}", "grape-87"]

o) Extract all hex character sequences, with 0x optional prefix. Match the characters case insensitively, and the sequences shouldn't be surrounded by other word characters.

>> str1 = '128A foo 0xfe32 34 0xbar'
>> str2 = '0XDEADBEEF place 0x0ff1ce bad'

>> hex_seq =        ##### add your solution here

>> str1.scan(hex_seq)
=> ["128A", "0xfe32", "34"]
>> str2.scan(hex_seq)
=> ["0XDEADBEEF", "0x0ff1ce", "bad"]

p) Replace sequences made up of words separated by : or . by the first/last word of the sequence and the separator. Such sequences will end when : or . is not followed by a word character.

>> ip = 'wow:Good:2_two:five: hi-2 bye kite.777.water.'

# first word of the sequence
##### add your solution here
=> "wow: hi-2 bye kite."

# last word of the sequence
##### add your solution here
=> "five: hi-2 bye water."

q) For the given input strings, extract if followed by any number of nested parentheses. Assume that there will be only one such pattern per input string.

>> ip1 = 'for (((i*3)+2)/6) if(3-(k*3+4)/12-(r+2/3)) while()'
>> ip2 = 'if+while if(a(b)c(d(e(f)1)2)3) for(i=1)'

>> pat =        ##### add your solution here

>> ip1[pat]
=> "if(3-(k*3+4)/12-(r+2/3))"
>> ip2[pat]
=> "if(a(b)c(d(e(f)1)2)3)"