Anchors

Now that you're familiar with regexp syntax and some of the methods, the next step is to know about the special features of regular expressions. In this chapter, you'll be learning about qualifying a pattern. Instead of matching anywhere in the given input string, restrictions can be specified. For now, you'll see the ones that are already part of regular expression features. In later chapters, you'll learn how to define custom rules.

These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regexp parlance. In case you need to match those characters literally, you need to escape them with a \ character (discussed in the Escaping with backslash section).

String anchors

This restriction is about qualifying a regexp to match only at the start or end of an input string. These provide functionality similar to the string methods start_with? and end_with?. There are three different escape sequences related to string level regexp anchors. First up is \A which restricts the matching to the start of string.

# \A is placed as a prefix to the search term
>> 'cater'.match?(/\Acat/)
=> true
>> 'concatenation'.match?(/\Acat/)
=> false

>> "hi hello\ntop spot".match?(/\Ahi/)
=> true
>> "hi hello\ntop spot".match?(/\Atop/)
=> false

To restrict the match to the end of string, \z (lowercase z) is used.

# \z is placed as a suffix to the search term
>> 'spare'.match?(/are\z/)
=> true
>> 'nearest'.match?(/are\z/)
=> false

>> words = %w[surrender unicorn newer door empty eel pest]
>> words.grep(/er\z/)
=> ["surrender", "newer"]
>> words.grep(/t\z/)
=> ["pest"]

There is another end of string anchor \Z (uppercase). It is similar to \z but if newline is the last character, then \Z allows matching just before the newline character.

# same result for both \z and \Z
# as there is no newline character at the end of string
>> "spare\ndare".sub(/are\z/, 'X')
=> "spare\ndX"
>> "spare\ndare".sub(/are\Z/, 'X')
=> "spare\ndX"

# different results as there is a newline character at the end of string
>> "spare\ndare\n".sub(/are\z/, 'X')
=> "spare\ndare\n"
>> "spare\ndare\n".sub(/are\Z/, 'X')
=> "spare\ndX\n"

Combining both the start and end string anchors, you can restrict the matching to the whole string. Which is similar to comparing strings using the == operator.

>> 'cat'.match?(/\Acat\z/)
=> true
>> 'cater'.match?(/\Acat\z/)
=> false
>> 'concatenation'.match?(/\Acat\z/)
=> false

The anchors can be used by themselves as a pattern. Helps to insert text at the start or end of string, emulating string concatenation operations. These might not feel like useful capability, but combined with other regexp features they become quite a handy tool.

>> 'live'.sub(/\A/, 're')
=> "relive"
>> 'send'.sub(/\A/, 're')
=> "resend"

>> 'cat'.sub(/\z/, 'er')
=> "cater"
>> 'hack'.sub(/\z/, 'er')
=> "hacker"

Line anchors

A string input may contain single or multiple lines. The newline character \n is considered as the line separator. There are two line anchors, ^ metacharacter for matching the start of line and $ for matching the end of line. If there are no newline characters in the input string, these will behave exactly the same as \A and \z respectively.

>> pets = 'cat and dog'

>> pets.match?(/^cat/)
=> true
>> pets.match?(/^dog/)
=> false

>> pets.match?(/dog$/)
=> true
>> pets.match?(/^dog$/)
=> false

Here are some multiline examples to distinguish line anchors from string anchors.

# check if any line in the string starts with 'top'
>> "hi hello\ntop spot".match?(/^top/)
=> true

# check if any line in the string ends with 'er'
>> "spare\npar\nera\ndare".match?(/er$/)
=> false

# filter lines ending with 'are'
>> "spare\npar\ndare".each_line.grep(/are$/)
=> ["spare\n", "dare"]

# check if any whole line in the string is 'par'
>> "spare\npar\ndare".match?(/^par$/)
=> true

Just like string anchors, you can use the line anchors by themselves as a pattern. gsub and puts will be used here to better illustrate the transformation. The gsub method returns an Enumerator if you don't specify a replacement string nor pass a block. That paves a way to use all those wonderful Enumerator and Enumerable methods.

>> str = "catapults\nconcatenate\ncat"

>> puts str.gsub(/^/, '1: ')
1: catapults
1: concatenate
1: cat
>> puts str.gsub(/^/).with_index(1) { "#{_2}: " }
1: catapults
2: concatenate
3: cat

>> puts str.gsub(/$/, '.')
catapults.
concatenate.
cat.

If there is a newline character at the end of the input string, there is an additional end of line match but no additional start of line match.

>> puts "1\n2\n".gsub(/^/, 'fig ')
fig 1
fig 2
>> puts "1\n\n".gsub(/^/, 'fig ')
fig 1
fig 

# note the number of lines in the output
>> puts "1\n2\n".gsub(/$/, ' banana')
1 banana
2 banana
 banana
>> puts "1\n\n".gsub(/$/, ' banana')
1 banana
 banana
 banana

warning If you are dealing with Windows OS based text files, you'll have to convert the \r\n line endings to \n first. Which is easily handled by many of the Ruby methods. For example, you can specify the line ending to use for the File.open method, the split string method handles all whitespaces by default and so on. Or, you can handle \r as an optional character with quantifiers (see the Greedy quantifiers section for examples).

Word anchors

The third type of restriction is word anchors. Alphabets (irrespective of case), digits and the underscore character qualify as word characters. You might wonder why there are digits and underscores as well, why not just alphabets? This comes from variable and function naming conventions — typically alphabets, digits and underscores are allowed. So, the definition is more oriented to programming languages than natural ones.

The escape sequence \b denotes a word boundary. This works for both the start and end of word anchoring. Start of word means either the character prior to the word is a non-word character or there is no character (start of string). Similarly, end of word means the character after the word is a non-word character or no character (end of string). This implies that you cannot have word boundaries without a word character.

>> words = 'par spar apparent spare part'

# replace 'par' irrespective of where it occurs
>> words.gsub(/par/, 'X')
=> "X sX apXent sXe Xt"

# replace 'par' only at the start of word
>> words.gsub(/\bpar/, 'X')
=> "X spar apparent spare Xt"

# replace 'par' only at the end of word
>> words.gsub(/par\b/, 'X')
=> "X sX apparent spare part"

# replace 'par' only if it is not part of another word
>> words.gsub(/\bpar\b/, 'X')
=> "X spar apparent spare part"

Using word boundary as a pattern by itself can yield creative solutions:

# space separated words to double quoted csv
# note the use of the 'tr' string method
>> words = 'par spar apparent spare part'
>> puts words.gsub(/\b/, '"').tr(' ', ',')
"par","spar","apparent","spare","part"

>> '-----hello-----'.gsub(/\b/, ' ')
=> "----- hello -----"

# make a programming statement more readable
# shown for illustration purpose only, won't work for all cases
>> 'output=num1+35*42/num2'.gsub(/\b/, ' ')
=> " output = num1 + 35 * 42 / num2 "
# excess space at the start/end of string can be stripped off
# later you'll learn how to add a qualifier so that strip is not needed
>> 'output=num1+35*42/num2'.gsub(/\b/, ' ').strip
=> "output = num1 + 35 * 42 / num2"

Opposite Word anchors

The word boundary has an opposite anchor too. \B matches wherever \b doesn't match. This duality will be seen with some other escape sequences too. Negative logic is handy in many text processing situations. But use it with care, you might end up matching things you didn't intend!

>> words = 'par spar apparent spare part'

# replace 'par' if it is not at the start of word
>> words.gsub(/\Bpar/, 'X')
=> "par sX apXent sXe part"
# replace 'par' at the end of word but not the whole word 'par'
>> words.gsub(/\Bpar\b/, 'X')
=> "par sX apparent spare part"
# replace 'par' if it is not at the end of word
>> words.gsub(/par\B/, 'X')
=> "par spar apXent sXe Xt"
# replace 'par' if it is surrounded by word characters
>> words.gsub(/\Bpar\B/, 'X')
=> "par spar apXent sXe part"

Here are some standalone pattern usage to compare and contrast the two word anchors.

>> 'copper'.gsub(/\b/, ':')
=> ":copper:"
>> 'copper'.gsub(/\B/, ':')
=> "c:o:p:p:e:r"

>> '-----hello-----'.gsub(/\b/, ' ')
=> "----- hello -----"
>> '-----hello-----'.gsub(/\B/, ' ')
=> " - - - - -h e l l o- - - - - "

Cheatsheet and Summary

NoteDescription
\Arestricts the match to the start of string
\zrestricts the match to the end of string
\Zrestricts the match to end or just before a newline at the end of string
\nline separator
DOS-style files need special attention
metacharactercharacters with special meaning in regexp
^restricts the match to the start of line
$restricts the match to the end of line
\brestricts the match to the start and end of words
word characters: alphabets, digits, underscore
\Bmatches wherever \b doesn't match

In this chapter, you've begun to see building blocks of regular expressions and how they can be used in interesting ways. At the same time, regular expression is but another tool for text processing problems. Often, you'd get simpler solution by combining regular expressions with other string and Enumerable methods. Practice, experience and imagination would help you in constructing creative solutions. In the coming chapters, you'll see more applications of anchors as well as the \G anchor which is best understood in combination with other regexp features.

Exercises

1) Check if the given strings start with be.

>> line1 = 'be nice'
>> line2 = '"best!"'
>> line3 = 'better?'
>> line4 = 'oh no\nbear spotted'

>> pat =        ##### add your solution here

>> pat.match?(line1)
=> true
>> pat.match?(line2)
=> false
>> pat.match?(line3)
=> true
>> pat.match?(line4)
=> false

2) For the given input string, change only the whole word red to brown.

>> words = 'bred red spread credible red.'

>> words.gsub()     ##### add your solution here
=> "bred brown spread credible brown."

3) For the given input array, filter elements that contain 42 surrounded by word characters.

>> items = ['hi42bye', 'nice1423', 'bad42', 'cool_42a', '42fake', '_42_']

>> items.grep()     ##### add your solution here
=> ["hi42bye", "nice1423", "cool_42a", "_42_"]

4) For the given input array, filter elements that start with den or end with ly.

>> items = ['lovely', "1\ndentist", '2 lonely', 'eden', "fly\n", 'dent']

>> items.filter { }     ##### add your solution here
=> ["lovely", "2 lonely", "dent"]

5) For the given input string, change whole word mall to 1234 only if it is at the start of a line.

'> para = %q{(mall) call ball pall
'> ball fall wall tall
'> mall call ball pall
'> wall mall ball fall
'> mallet wallet malls
>> mall:call:ball:pall}

>> puts para.gsub()     ##### add your solution here
(mall) call ball pall
ball fall wall tall
1234 call ball pall
wall mall ball fall
mallet wallet malls
1234:call:ball:pall

6) For the given array, filter elements having a line starting with den or ending with ly.

>> items = ['lovely', "1\ndentist", '2 lonely', 'eden', "fly\nfar", 'dent']

>> items.filter { }     ##### add your solution here
=> ["lovely", "1\ndentist", "2 lonely", "fly\nfar", "dent"]

7) For the given input array, filter all whole elements 12\nthree irrespective of case.

>> items = ["12\nthree\n", "12\nThree", "12\nthree\n4", "12\nthree"]

>> items.grep()     ##### add your solution here
=> ["12\nThree", "12\nthree"]

8) For the given input array, replace hand with X for all elements that start with hand followed by at least one word character.

>> items = %w[handed hand handy unhanded handle hand-2]

>> items.map { }        ##### add your solution here
=> ["Xed", "hand", "Xy", "unhanded", "Xle", "hand-2"]

9) For the given input array, filter all elements starting with h. Additionally, replace e with X for these filtered elements.

>> items = %w[handed hand handy unhanded handle hand-2]

>> items.filter_map { }     ##### add your solution here
=> ["handXd", "hand", "handy", "handlX", "hand-2"]