Lookarounds

You've already seen how to create custom character classes and various avatars of special groupings. In this chapter you'll learn more groupings, known as lookarounds, that help to create custom anchors and add conditions within the regexp definition. These assertions are also known as zero-width patterns because they add restrictions similar to anchors and are not part of the matched portions. Also, you will learn how to negate a grouping, similar to negated character sets.

Conditional expressions

Before you get used to lookarounds too much, it is good to remember that JavaScript is a programming language. You have control structures and you can combine multiple conditions using logical operators, methods like every()/some(), etc. Also, do not forget that regexp is only one of the tools available for text processing.

> let items = ['1,2,3,4', 'a,b,c,d', '#apple 123']

// filter elements containing digit and '#' characters
> items.filter(s => /\d/.test(s) && s.includes('#'))
< ['#apple 123']
// modify elements only if it doesn't start with '#'
> items.filter(s => s[0] != '#').map(s => s.replace(/,.+,/, ' '))
< ['1 4', 'a d']

Negative lookarounds

Lookaround assertions can be added in two ways — lookbehind and lookahead. Each of these can be a positive or a negative assertion. Syntax wise, lookbehind has an extra < compared to the lookahead version. Negative lookarounds can be identified by the use of ! whereas = is used for positive lookarounds. This section is about negative lookarounds, whose complete syntax is shown below.

  • (?!pat) for negative lookahead assertion
  • (?<!pat) for negative lookbehind assertion

As mentioned earlier, lookarounds are not part of the matched portions and do not affect capture group numbering.

// change 'cat' only if it is not followed by a digit character
// note that the end of string satisfies the given assertion
// 'catcat' has two matches as the assertion doesn't consume characters
> 'hey cats! cat42 cat_5 catcat'.replace(/cat(?!\d)/g, 'dog')
< 'hey dogs! cat42 dog_5 dogdog'

// change 'cat' only if it is not preceded by _
// note how 'cat' at the start of string is matched as well
> 'cat _cat 42catcat'.replace(/(?<!_)cat/g, 'dog')
< 'dog _cat 42dogdog'

// overlap example
// the final _ was replaced as well as played a part in the assertion
> 'cats _cater 42cat_cats'.replace(/(?<!_)cat./g, 'dog')
< 'dog _cater 42dogcats'

Lookarounds can be mixed with already existing anchors and other features to define truly powerful restrictions:

// change whole word only if it is not preceded by : or --
> ':cart apple --rest ;tea'.replace(/(?<!:|--)\b\w+/g, 'X')
< ':cart X --rest ;X'

// add space to word boundaries, but not at the start or end of string
// similar to: replace(/\b/g, ' ').trim()
> 'output=num1+35*42/num2'.replace(/(?<!^)\b(?!$)/g, ' ')
< 'output = num1 + 35 * 42 / num2'

In all the examples so far, lookahead grouping was placed as a suffix and lookbehind as a prefix. This is how they are used most of the time, but not the only way to use them. Lookarounds can be placed anywhere and multiple lookarounds can be combined in any order. They do not consume characters nor do they play a role in matched portions. They just let you know whether the condition you want to test is satisfied from the current location in the input string.

// these two are equivalent
// replace a character as long as it is not preceded by 'p' or 'r'
> 'spare'.replace(/(?<![pr])./g, '*')
< '**a*e'
> 'spare'.replace(/.(?<![pr].)/g, '*')
< '**a*e'

// replace 'par' as long as 's' is not present later in the input
// this assumes that the lookaround doesn't conflict with the search pattern
// i.e. 's' will not conflict 'par' but would affect if it was 'r' and 'par'
> 'par spare part party'.replace(/par(?!.*s)/g, '[$&]')
< 'par s[par]e [par]t [par]ty'
> 'par spare part party'.replace(/(?!.*s)par/g, '[$&]')
< 'par s[par]e [par]t [par]ty'

// since the three assertions used here are all zero-width,
// all of the 6 possible combinations will be equivalent
> 'output=num1+35*42/num2'.replace(/(?!$)\b(?<!^)/g, ' ')
< 'output = num1 + 35 * 42 / num2'

info See this stackoverflow Q&A for a workaround if lookbehind isn't supported.

Positive lookarounds

Unlike negative lookarounds, absence of something will not satisfy positive lookarounds. Instead, for the condition to satisfy, the pattern has to match actual characters and/or zero-width assertions. Positive lookaround can be identified by use of = in the grouping. Syntax is shown below:

  • (?=pat) for positive lookahead assertion
  • (?<=pat) for positive lookbehind assertion
// extract digits only if it is followed by ,
// note that end of string doesn't qualify as this is positive assertion
> '42 apple-5, fig3; x83, y-20; f12'.match(/\d+(?=,)/g)
< ['5', '83']
// extract digits only if it is preceded by - and followed by ; or :
> '42 apple-5, fig3; x-83, y-20: f12'.match(/(?<=-)\d+(?=[;:])/g)
< ['20']

// same as: match(/\b\w/g).join('')
> 'sea eat car rat eel tea'.replace(/(?<=\b\w)\w*\W*/g, '')
< 'secret'

// replace 'par' as long as 'part' occurs as a whole word later in the line
> 'par spare part party'.replace(/par(?=.*\bpart\b)/g, '[$&]')
< '[par] s[par]e part party'

Lookarounds can be quite handy for simple field based processing.

// except the first and last fields
> '1,two,3,four,5'.match(/(?<=,)[^,]+(?=,)/g)
< ['two', '3', 'four']

// replace empty fields with NA
// note that in this case, order of lookbehind and lookahead doesn't matter
> ',1,,,two,3,,'.replace(/(?<=^|,)(?=,|$)/g, 'NA')
< 'NA,1,NA,NA,two,3,NA,NA'
// same thing with negative lookarounds
> ',1,,,two,3,,'.replace(/(?![^,])(?<![^,])/g, 'NA')
< 'NA,1,NA,NA,two,3,NA,NA'

// there is an extra empty string match at the end of non-empty columns
> ',cat,tiger'.replace(/[^,]*/g, '{$&}')
< '{},{cat}{},{tiger}{}'
// lookarounds to the rescue
> ',cat,tiger'.replace(/(?<=^|,)[^,]*/g, '{$&}')
< '{},{cat},{tiger}'

Capture groups inside positive lookarounds

Even though lookarounds are not part of the matched portions, capture groups can be used inside positive lookarounds. Can you reason out why it won't work for negative lookarounds?

> console.log('a b c d e'.replace(/(\S+\s+)(?=(\S+)\s)/g, '$1$2\n'))
< a b
  b c
  c d
  d e

AND conditional with lookarounds

As promised earlier, here are some examples that show how lookarounds make it simpler to construct AND conditionals.

> let words = ['sequoia', 'subtle', 'questionable', 'exhibit', 'equation']

// words containing 'b' and 'e' and 't' in any order
// same as: /b.*e.*t|b.*t.*e|e.*b.*t|e.*t.*b|t.*b.*e|t.*e.*b/
> words.filter(w => /(?=.*b)(?=.*e).*t/.test(w))
< ['subtle', 'questionable', 'exhibit']

// words containing all lowercase vowels in any order
> words.filter(w => /(?=.*a)(?=.*e)(?=.*i)(?=.*o).*u/.test(w))
< ['sequoia', 'questionable', 'equation']

// words containing ('ab' or 'at') and 'q' but not 'n' at the end of the element
> words.filter(w => /(?!.*n$)(?=.*a[bt]).*q/.test(w))
< ['questionable']

Variable length lookbehind

In some of the regexp engine implementations, lookbehind doesn't work if the pattern can match varying number of characters. For example, (?<=fig\d+) is looking behind for fig followed by one or more of digit characters. There is no such restriction in JavaScript. Here are some examples:

// positive lookbehind examples
> '=314not :,2irk ,:3cool =42,error'.match(/(?<=[:=]\d+)[a-z]+/g)
< ['not', 'cool']
// replace only the third occurrence of 'cat'
> 'cat scatter cater scat'.replace(/(?<=(cat.*?){2})cat/, 'X')
< 'cat scatter Xer scat'

// negative lookbehind examples
// match only if 'cat' doesn't occur before 'dog'
> /(?<!cat.*)dog/.test('fox,cat,dog,parrot')
< false
// match only if 'parrot' doesn't occur before 'dog'
> /(?<!parrot.*)dog/.test('fox,cat,dog,parrot')
< true

Negated groups

You've seen a few cases where negated character classes were useful over positive sets. For example, in field based processing, it is needed to match the field contents by creating a negated character set of the delimiter character. In a similar manner, there are cases where you need to negate a regexp pattern. This is made possible by using a negative lookahead and advancing one character at a time as shown below.

// match if 'go' is not there between 'at' and 'par'
> /at((?!go).)*par/.test('fox,cat,dog,parrot')
< true
// match if 'do' is not there between 'at' and 'par'
> /at((?!do).)*par/.test('fox,cat,dog,parrot')
< false

// if it gets confusing, use the 'match' method to see the matching portions
> 'fox,cat,dog,parrot'.match(/at((?!go).)*par/)[0]
< 'at,dog,par'
> 'at,baz,a2z,bad-zoo'.match(/a((?!\d).)*z/g)
< ['at,baz', 'ad-z']

Cheatsheet and Summary

NoteDescription
lookaroundscustom assertions, zero-width like anchors
(?!pat)negative lookahead assertion
(?<!pat)negative lookbehind assertion
(?=pat)positive lookahead assertion
(?<=pat)positive lookbehind assertion
variable length lookbehind is allowed
(?!pat1)(?=pat2)multiple assertions can be specified next to each other in any order
as they mark a matching location without consuming characters
((?!pat).)*Negates a regexp pattern

In this chapter, you learnt how to use lookarounds to create custom restrictions and also how to use negated grouping. With this, most of the powerful features of regexp have been covered. The next chapter will give a brief introduction to working with unicode characters.

Exercises

info Use lookarounds for solving the following exercises even if they are not required.

1) Replace all whole words with X unless it is preceded by a ( character.

> let ip = '(apple) guava berry) apple (mango) (grape'

// add your solution here
< '(apple) X X) X (mango) (grape'

2) Replace all whole words with X unless it is followed by a ) character.

> let ip = '(apple) guava berry) apple (mango) (grape'

// add your solution here
< '(apple) X berry) X (mango) (X'

3) Replace all whole words with X unless it is preceded by ( or followed by ) characters.

> let ip = '(apple) guava berry) apple (mango) (grape'

// add your solution here
< '(apple) X berry) X (mango) (grape'

4) Extract all whole words that do not end with e or n.

> let ip = 'a_t row on Urn e note Dust n end a2-e|u'

// add your solution here
< ['a_t', 'row', 'Dust', 'end', 'a2', 'u']

5) Extract all whole words that do not start with a or d or n.

> let ip = 'a_t row on Urn e note Dust n end a2-e|u'

// add your solution here
< ['row', 'on', 'Urn', 'e', 'Dust', 'end', 'e', 'u']

6) Extract all whole words only if they are followed by : or , or -.

> let ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-sit'

// add your solution here
< ['Poke', 'so_good', 'ever2']

7) Extract all whole words only if they are preceded by = or / or -.

> let ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-sit'

// add your solution here
< ['so_good', 'is', 'sit']

8) Extract all whole words only if they are preceded by = or : and followed by : or ..

> let ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-sit'

// add your solution here
< ['so_good', 'ink']

9) Extract all whole words only if they are preceded by = or : or . or ( or - and not followed by . or /.

> let ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-sit'

// add your solution here
< ['so_good', 'vast', 'sit']

10) Remove the leading and trailing whitespaces from all the individual fields where , is the field separator.

> let csv1 = ' comma  ,separated ,values \t\r '
> let csv2 = 'good bad,nice  ice  , 42 , ,   stall   small'

> const trim_whitespace =       // add your solution here

> csv1.replace(trim_whitespace, '')
< 'comma,separated,values'
> csv2.replace(trim_whitespace, '')
< 'good bad,nice  ice,42,,stall   small'

11) Filter elements that satisfy all of these rules:

  • should have at least two alphabets
  • should have at least three digits
  • should have at least one special character among % or * or # or $
  • should not end with a whitespace character
> let pwds = ['hunter2', 'F2h3u%9', '*X3Yz3.14\t', 'r2_d2_42', 'A $B C1234']

// add your solution here
< ['F2h3u%9', 'A $B C1234']

12) For the given string, surround all whole words with {} except for whole words par and cat and apple.

> let ip = 'part; cat {super} rest_42 par scatter apple spar'

// add your solution here
< '{part}; cat {{super}} {rest_42} par {scatter} apple {spar}'

13) Extract the integer portion of floating-point numbers for the given string. A number ending with . and no further digits should not be considered.

> let ip = '12 ab32.4 go 5 2. 46.42 5'

// add your solution here
< ['32', '46']

14) For the given input strings, extract all overlapping two character sequences.

> let s1 = 'apple'
> let s2 = '1.2-3:4'

> const pat1 =      // add your solution here

// add your solution here for s1
< ['ap', 'pp', 'pl', 'le']
// add your solution here for s2
< ['1.', '.2', '2-', '-3', '3:', ':4']

15) The given input strings contain fields separated by the : character. Delete : and the last field if there is a digit character anywhere before the last field.

> let s1 = '42:cat'
> let s2 = 'twelve:a2b'
> let s3 = 'we:be:he:0:a:b:bother'
> let s4 = 'apple:banana-42:cherry:'
> let s5 = 'dragon:unicorn:centaur'

> const pat2 =      // add your solution here

> s1.replace(pat2, '')
< '42'
> s2.replace(pat2, '')
< 'twelve:a2b'
> s3.replace(pat2, '')
< 'we:be:he:0:a:b'
> s4.replace(pat2, '')
< 'apple:banana-42:cherry'
> s5.replace(pat2, '')
< 'dragon:unicorn:centaur'

16) Extract all whole words unless they are preceded by : or <=> or ---- or #.

> let ip = '::very--at<=>row|in.a_b#b2c=>lion----east'

// add your solution here
< ['at', 'in', 'a_b', 'lion']

17) Match strings if it contains qty followed by price but not if there is any whitespace character or the string error between them.

> let str1 = '23,qty,price,42'
> let str2 = 'qty price,oh'
> let str3 = '3.14,qty,6,errors,9,price,3'
> let str4 = '42\nqty-6,apple-56,price-234,error'
> let str5 = '4,price,3.14,qty,4'
> let str6 = '(qtyprice) (hi-there)'

> const neg =       // add your solution here

> neg.test(str1)
< true
> neg.test(str2)
< false
> neg.test(str3)
< false
> neg.test(str4)
< true
> neg.test(str5)
< false
> neg.test(str6)
< true

18) Can you reason out why the following regular expressions behave differently?

> let ip = 'I have 12, he has 2!'

> ip.replace(/\b..\b/g, '{$&}')
< '{I }have {12}{, }{he} has{ 2}!'

> ip.replace(/(?<!\w)..(?!\w)/g, '{$&}')
< 'I have {12}, {he} has {2!}'

19) Simulate string partitioning to get an array of three elements — string before the separator, portion matched by the separator and string after the separator. For the first case, split the given input string on the first occurrence of digits. For the second case, split based on the last occurrence of digits.

> let w2 = 'Sample123string42with777numbers'

// add your solution here for splitting based on the first occurrence
< ['Sample', '123', 'string42with777numbers']

// add your solution here for splitting based on the last occurrence
< ['Sample123string42with', '777', 'numbers']

20) Find the starting index of the last occurrence of is or the or was or to for the given input strings using the search() method. Assume that there will be at least one match for each input string.

> let s1 = 'match after the last newline character'
> let s2 = 'and then you want to test'
> let s3 = 'this is good bye then'
> let s4 = 'who was there to see?'

> const pat3 =      // add your solution here

> s1.search(pat3)
< 12
> s2.search(pat3)
< 18
> s3.search(pat3)
< 17
> s4.search(pat3)
< 14