Groupings and backreferences

This chapter will show how to reuse portions matched by capture groups via backreferences. These can be used within the regexp definition as well as the replacement section. You'll also learn some special grouping syntax for cases where plain capture groups aren't enough.

Backreferences

First up, how to refer to capture group portions directly in the regexp definition and the replacement section. You have already seen how to refer to text captured by groups with the match() and matchAll() methods. You've also seen how to pass the captured portions to a function in the replace() method.

More directly, you can use a backreference \N (within the regexp definition) and $N (replacement section), where N is the capture group you want. What's more, you can also apply quantifiers to backreferences when used in the regexp definition. All the various forms are listed below:

in the replacement section, use $1, $2, etc to refer to the corresponding capture group
in the replacement section, use $& to refer to the entire matched portion
- $` gives the string before the matched portion
- $' gives the string after the matched portion
within the regexp definition, use \1, \2, etc to refer to the corresponding capture group

// remove square brackets that surround digit characters
> '[52] apples [and] [31] mangoes'.replace(/\[(\d+)\]/g, '$1')
< '52 apples [and] 31 mangoes'

// replace __ with _ and delete _ if it is alone
> '_apple_ __123__ _banana_'.replace(/(_)?_/g, '$1')
< 'apple _123_ banana'

// swap words that are separated by a comma
> 'good,bad 42,24 x,y'.replace(/(\w+),(\w+)/g, '$2,$1')
< 'bad,good 24,42 y,x'

Here are some examples for using backreferences available by default without needing capture groups.

// add something around the entire matched portion
> '52 apples and 31 mangoes'.replace(/\d+/g, '($&)')
< '(52) apples and (31) mangoes'
> 'Hello world'.replace(/.*/, 'Hi. $&. Have a nice day')
< 'Hi. Hello world. Have a nice day'

// capture the first field and duplicate it as the last field
> 'fork,42,nice,3.14'.replace(/,.+/, '$&,$`')
< 'fork,42,nice,3.14,fork'

And here are some examples for using backreferences within the regexp definition.

// elements that have at least one consecutive repeated word character
> let words = ['moon', 'mono', 'excellent', 'POLL', 'a22b']
> words.filter(w => /(\w)\1/.test(w))
< ['moon', 'excellent', 'POLL', 'a22b']

// remove any number of consecutive duplicate words separated by space
// note the use of quantifier on backreferences
// use \W+ instead of space to cover cases like 'a;a<-;a'
> 'aa a a a 42 f_1 f_1 f_13.14'.replace(/\b(\w+)( \1)+\b/g, '$1')
< 'aa a 42 f_1 f_13.14'

Backreference oddities

Since $ is special in the replacement section, there's an issue to represent it literally if followed by numbers. Usually, escaping is used for such purposes, but here you need to use $$.

// no capture group used, so '$1' is inserted literally
> 'cat'.replace(/a/, '{$1}')
< 'c{$1}t'
// capture group used, '\$1' is same as '$1' here
> 'cat'.replace(/(a)/, '{\$1}')
< 'c{a}t'

// use '$$' to insert '$' literally
> 'cat'.replace(/(a)/, '{$$1}')
< 'c{$1}t'

Another issue is how to avoid ambiguity when you have normal digits immediately following a backreference? It'll depend on how many backreferences are present in the pattern and whether you need to avoid ambiguity in the regexp definition or the replacement section. For example, if there are less than 10 groups, then something like $12 will refer to the first capture group and 2 as a character. If there are no capture groups, then something like $5 will get inserted literally as $ and 5.

// $15 here will backreference the 1st group and use 5 as a character
> '[52] apples and [31] mangoes'.replace(/\[(\d+)\]/g, '($15)')
< '(525) apples and (315) mangoes'

// $3 will be inserted literally since there is only one capture group
> '[52] apples and [31] mangoes'.replace(/\[(\d+)\]/g, '$3')
< '$3 apples and $3 mangoes'

// $1 will be inserted literally since there are no capture groups
> '[52] apples and [31] mangoes'.replace(/\[\d+\]/g, '$1')
< '$1 apples and $1 mangoes'

On the other hand, if you have more than 9 but less than 100 groups, then there would be an issue if you want to refer to a single digit group followed by literal digit characters. The workaround is to prefix 0 such that the number of digits after $ equals the number of digits required for the highest capture group. So, if you have more than 9 but less than 100 groups, $05 will refer to the 5th capture group and any digit after that will be treated literally.

// for illustration purposes, 12 capture groups have been defined here
// if you want to reference 2-digit group, there's no issue
> 'abcdefghijklmn'.replace(/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/, '$11')
< 'kmn'

// what if you wanted to reference 1st group followed by '1' as a character?
// using \x31 wouldn't work as it still results in $11
> 'abcdefghijklmn'.replace(/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/, '$1\x31')
< 'kmn'
// prefix a '0' so that '$01' becomes the reference and '1' becomes character
> 'abcdefghijklmn'.replace(/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/, '$011')
< 'a1mn'

The workaround is simpler in the regexp definition. The 0 prefix trick doesn't work, but using an ASCII code with \xhh works. For example, \1\x31 will refer to the first capture group followed by 1 as a character.

> 'abcdefghijklmna1d'.replace(/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.).*\1\x31/, 'X')
< 'Xd'

Non-capturing groups

Grouping has many uses like applying quantifiers on a regexp portion, creating terse regexp by factoring common portions and so on. It also affects the behavior of the split() method as seen in the split() with capture groups section. Unlike similar methods in other languages, the match() method with the g flag isn't affected by capture groups and returns the entire matched portions.

// split method without capture group
> 'Sample123string42with777numbers'.split(/\d+/)
< ['Sample', 'string', 'with', 'numbers']
// split method with capture group
> 'Sample123string42with777numbers'.split(/(\d+)/)
< ['Sample', '123', 'string', '42', 'with', '777', 'numbers']

// match method example with the 'g' flag and capture groups
> 'effort flee facade oddball rat tool'.match(/\b\w*(\w)\1\w*\b/g)
< ['effort', 'flee', 'oddball', 'tool']
// here's another example
> 'hi 123123123 bye 456123456'.match(/(123)+/g)
< ['123123123', '123']

When backreferencing is not required, you can use a non-capturing group to avoid the behavior change of split() method. It also helps to avoid keeping a track of capture group numbers when that particular group is not needed for backreferencing. The syntax is (?:pat) to define a non-capturing group, where pat is an abbreviation for a portion of the regular expression pattern. More such special groups starting with (? syntax will be discussed later on.

// here, grouping is needed to take out common portion and apply quantifier
// but using capture group will not give expected output
> '123hand42handy777handful500'.split(/hand(y|ful)?/)
< ['123', undefined, '42', 'y', '777', 'ful', '500']
// non-capturing group to the rescue
> '123hand42handy777handful500'.split(/hand(?:y|ful)?/)
< ['123', '42', '777', '500']

// with normal grouping, need to keep track of all the groups
> '1,2,3,4,5,6,7'.replace(/^(([^,]+,){3})([^,]+)/, '$1($3)')
< '1,2,3,(4),5,6,7'
// using non-capturing groups, only the relevant groups have to be tracked
> '1,2,3,4,5,6,7'.replace(/^((?:[^,]+,){3})([^,]+)/, '$1($2)')
< '1,2,3,(4),5,6,7'

Referring to the text matched by a capture group with a quantifier will give only the last match, not the entire match. Use a capture group around the grouping and quantifier together to get the entire matching portion. In such cases, the inner grouping is an ideal candidate to be specified as non-capturing.

// '$1' here contains only the fourth field
> 'so:cat:rest:in:put:to'.replace(/^([^:]+:){4}/, '($1)')
< '(in:)put:to'

// '$1' will now contain the entire matching portion
> 'so:cat:rest:in:put:to'.replace(/^((?:[^:]+:){4})/, '($1)')
< '(so:cat:rest:in:)put:to'

Named capture groups

Regexp can get cryptic and difficult to maintain, even for seasoned programmers. There are a few constructs to help add clarity. One such is naming the capture groups and using that name for backreferencing instead of plain numbers. In addition, the named capture group portions can be extracted as key-value pairs using the groups property. The syntax is:

(?<name>pat) for naming the capture groups
\k<name> for backreferencing in the regexp definition
$<name> for backreferencing in the replacement section

> let row = 'today,2008-03-24,food,2008-03-24,nice,2018-10-25,5632'

// same as: /(\d{4}-\d{2}-\d{2}).*\1/
> row.match(/(?<date>\d{4}-\d{2}-\d{2}).*\k<date>/)[0]
< '2008-03-24,food,2008-03-24'

// giving names to the first and second captured words
// same as: replace(/(\w+),(\w+)/g, '$2,$1')
> 'good,bad 42,24 x,y'.replace(/(?<fw>\w+),(?<sw>\w+)/g, '$<sw>,$<fw>')
< 'bad,good 24,42 y,x'

Here's an example with the groups property.

> let m = '2018-10-25,car,2346'.match(/(?<date>[^,]+),(?<product>[^,]+)/)

> m.groups
< {date: '2018-10-25', product: 'car'}
> m.groups.date
< '2018-10-25'
> m.groups.product
< 'car'

Cheatsheet and Summary

Note	Description
Backreference	gives the matched portion of the Nth capture group
	use `$1`, `$2`, `$3`, etc in the replacement section
	`$&` gives the entire matched portion
	$` gives the string before the matched portion
	`$'` gives the string after the matched portion
	use `\1`, `\2`, `\3`, etc within the regexp definition
`$$`	insert `$` literally in the replacement section
`$0N`	same as `$N`, allows to separate backreference and other digits
`\N\xhh`	allows to separate backreference and digits in the regexp definition
`(?:pat)`	non-capturing group
`(?<name>pat)`	named capture group
	use `\k<name>` for backreferencing in the regexp definition
	use `$<name>` for backreferencing in the replacement section
	named captures are also accessible via `groups` property

This chapter covered many more features related to grouping — backreferencing to get the matched portion of capture groups and naming the groups to add clarity. When capture groups result in unwanted behavior change (for ex: split() method), you can use non-capturing groups instead. You'll see more such special groups in the Lookarounds chapter.

Exercises

1) Replace the space character that occurs after a word ending with a or r with a newline character.

> let ip = 'area not a _a2_ roar took 22'

> console.log()     // add your solution here
  area
  not a
  _a2_ roar
  took 22

2) Add [] around words starting with s and containing e and t in any order.

> let ip = 'sequoia subtle exhibit asset sets2 tests si_te'

// add your solution here
< 'sequoia [subtle] exhibit asset [sets2] tests [si_te]'

3) Replace all whole words with X that start and end with the same word character (irrespective of case). Single character word should get replaced with X too, as it satisfies the stated condition.

> let ip = 'oreo not a _a2_ Roar took 22'

// add your solution here
< 'X not X X X took X'

4) Convert the given markdown headers to corresponding anchor tags. Consider the input to start with one or more # characters followed by space and word characters. The name attribute is constructed by converting the header to lowercase and replacing spaces with hyphens. Can you do it without using a capture group?

> let header1 = '# Regular Expressions'
> let header2 = '## Named capture groups'

> function hyphenify(m) {
      // add your solution here
  }

> header1.replace()     // add your solution here
< "# <a name='regular-expressions'></a>Regular Expressions"
> header2.replace()     // add your solution here
< "## <a name='named-capture-groups'></a>Named capture groups"

5) Convert the given markdown anchors to corresponding hyperlinks.

> let anchor1 = "# <a name='regular-expressions'></a>Regular Expressions"
> let anchor2 = "## <a name='subexpression-calls'></a>Subexpression calls"

> const hyperlink =         // add your solution here

> anchor1.replace()         // add your solution here
< '[Regular Expressions](#regular-expressions)'
> anchor2.replace()         // add your solution here
< '[Subexpression calls](#subexpression-calls)'

6) Check if the given input strings have words with at least two consecutive repeated alphabets irrespective of case. For example, words like stillnesS and Committee should return true but words like root or readable or rotational should return false. Consider word to be as defined in regular expression parlance.

> let s1 = 'readable COMMItTEe'
> let s2 = 'rotational sti1lness _foot_'
> let s3 = 'needed repeated'
> let s4 = 'offsh00t'

> const pat1 =      // add your solution here

> pat1.test(s1)
true
> pat1.test(s2)
false
> pat1.test(s3)
false
> pat1.test(s4)
true

7) For the given input string, replace all occurrences of digit sequences with only the unique non-repeating sequence. For example, 232323 should be changed to 23 and 897897 should be changed to 897. If there are no repeats (for example 1234) or if the repeats end prematurely (for example 12121), it should not be changed.

> let ip = '1234 2323 453545354535 9339 11 60260260'

// add your solution here
< '1234 23 4535 9339 1 60260260'

8) Replace sequences made up of words separated by : or . by the first word of the sequence. Such sequences will end when : or . is not followed by a word character.

> let ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'

// add your solution here
< 'wow hi-2 bye kite'

9) Replace sequences made up of words separated by : or . by the last word of the sequence. Such sequences will end when : or . is not followed by a word character.

> let ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'

// add your solution here
< 'five hi-2 bye water'

10) Split the given input string on one or more repeated sequence of cat.

> let ip = 'firecatlioncatcatcatbearcatcatparrot'

// add your solution here
< ['fire', 'lion', 'bear', 'parrot']

11) For the given input string, find all occurrences of digit sequences with at least one repeating sequence. For example, 232323 and 897897. If the repeats end prematurely, for example 12121, it should not be matched.

> let ip = '1234 2323 453545354535 9339 11 60260260'

> const pat2 =      // add your solution here

// entire sequences in the output
// add your solution here
< ['2323', '453545354535', '11']

// only the unique sequence in the output
// add your solution here
< ['23', '4535', '1']

12) Convert the comma separated strings to corresponding key-value pair mapping as shown below. The keys are name, maths and phy for the three fields in the input strings.

> let row1 = 'rohan,75,89'
> let row2 = 'rose,88,92'

> const pat3 =      // add your solution here

// add your solution here for row1
< {name: 'rohan', maths: '75', phy: '89'}

// add your solution here for row2
< {name: 'rose', maths: '88', phy: '92'}

13) Surround all whole words with (). Additionally, if the whole word is imp or ant, delete them. Can you do it with just a single substitution?

> let ip = 'tiger imp goat eagle ant important'

// add your solution here
< '(tiger) () (goat) (eagle) () (important)'

Understanding JavaScript RegExp