Regular Expressions

Regular Expressions is a versatile tool for text processing. It helps to precisely define a matching criteria. For learning and understanding purposes, one can view regular expressions as a mini programming language in itself, specialized for text processing. Parts of a regular expression can be saved for future use, analogous to variables and functions. There are ways to perform AND, OR, NOT conditionals, features to concisely define repetition to avoid manual replication and so on.

Here's some common use cases.

  • Sanitizing a string to ensure that it satisfies a known set of rules. For example, to check if a given string matches password rules.
  • Filtering or extracting portions on an abstract level like alphabets, numbers, punctuation and so on.
  • Qualified string replacement. For example, at the start or the end of a string, only whole words, based on surrounding text, etc.

This chapter will cover regular expressions as implemented in awk. Most of awk's regular expression syntax is similar to Extended Regular Expression (ERE) found with grep -E and sed -E. Unless otherwise indicated, examples and descriptions will assume ASCII input.

info See also POSIX specification for regular expressions. And unix.stackexchange: Why does my regular expression work in X but not in Y?

Syntax and variable assignment

As seen in previous chapter, the syntax is string ~ /regexp/ to check if the given string satisfies the rules specified by the regexp. And string !~ /regexp/ to invert the condition. By default, $0 is checked if the string isn't specified. You can also save a regexp literal in a variable by prefixing @ symbol. The prefix is needed because /regexp/ by itself would mean $0 ~ /regexp/.

$ printf 'spared no one\ngrasped\nspar\n' | awk '/ed/'
spared no one
grasped

$ printf 'spared no one\ngrasped\nspar\n' | awk 'BEGIN{r = @/ed/} $0 ~ r'
spared no one
grasped

String Anchors

In the examples seen so far, the regexp was a simple string value without any special characters. Also, the regexp pattern evaluated to true if it was found anywhere in the string. Instead of matching anywhere in the string, restrictions can be specified. These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a \ (discussed in Matching the metacharacters section).

There are two string anchors:

  • ^ metacharacter restricts the matching to the start of string
  • $ metacharacter restricts the matching to the end of string
$ # string starting with 'sp'
$ printf 'spared no one\ngrasped\nspar\n' | awk '/^sp/'
spared no one
spar

$ # string ending with 'ar'
$ printf 'spared no one\ngrasped\nspar\n' | awk '/ar$/'
spar

$ # change only whole string 'spar'
$ # can also use: awk '/^spar$/{$0 = 123} 1' or awk '$0=="spar"{$0 = 123} 1'
$ printf 'spared no one\ngrasped\nspar\n' | awk '{sub(/^spar$/, "123")} 1'
spared no one
grasped
123

The anchors can be used by themselves as a pattern. Helps to insert text at the start or end of string, emulating string concatenation operations. These might not feel like useful capability, but combined with other features they become quite a handy tool.

$ printf 'spared no one\ngrasped\nspar\n' | awk '{gsub(/^/, "* ")} 1'
* spared no one
* grasped
* spar

$ # append only if string doesn't contain space characters
$ printf 'spared no one\ngrasped\nspar\n' | awk '!/ /{gsub(/$/, ".")} 1'
spared no one
grasped.
spar.

info See also Behavior of ^ and $ when string contains newline section.

Word Anchors

The second type of restriction is word anchors. A word character is any alphabet (irrespective of case), digit and the underscore character. You might wonder why there are digits and underscores as well, why not only alphabets? This comes from variable and function naming conventions — typically alphabets, digits and underscores are allowed. So, the definition is more programming oriented than natural language.

Use \< to indicate start of word anchor and \> to indicate end of word anchor. As an alternate, you can use \y to indicate both the start of word and end of word anchors.

info Typically \b is used to represent word anchor (for example, in grep, sed, perl, etc), but in awk the escape sequence \b refers to the backspace character.

$ cat word_anchors.txt
sub par
spar
apparent effort
two spare computers
cart part tart mart

$ # words starting with 'par'
$ awk '/\<par/' word_anchors.txt
sub par
cart part tart mart

$ # words ending with 'par'
$ awk '/par\>/' word_anchors.txt
sub par
spar

$ # only whole word 'par'
$ # note that only lines where substitution succeeded will be printed
$ # as return value of sub/gsub is number of substitutions made
$ awk 'gsub(/\<par\>/, "***")' word_anchors.txt
sub ***

warning See also Word boundary differences section.

\y has an opposite too. \B matches locations other than those places where the word anchor would match.

$ # match 'par' if it is surrounded by word characters
$ awk '/\Bpar\B/' word_anchors.txt
apparent effort
two spare computers

$ # match 'par' but not as start of word
$ awk '/\Bpar/' word_anchors.txt
spar
apparent effort
two spare computers

$ # match 'par' but not as end of word
$ awk '/par\B/' word_anchors.txt
apparent effort
two spare computers
cart part tart mart

Here's an example for using word boundaries by themselves as a pattern. It also neatly shows the opposite functionality of \y and \B.

$ echo 'copper' | awk '{gsub(/\y/, ":")} 1'
:copper:
$ echo 'copper' | awk '{gsub(/\B/, ":")} 1'
c:o:p:p:e:r

warning Negative logic is handy in many text processing situations. But use it with care, you might end up matching things you didn't intend.

Combining conditions

Before seeing the next regexp feature, it is good to note that sometimes using logical operators is easier to read and maintain compared to doing everything with regexp.

$ # string starting with 'b' but not containing 'at'
$ awk '/^b/ && !/at/' table.txt
blue cake mug shirt -7

$ # if the first field contains 'low' or the last field is less than 0
$ awk '$1 ~ /low/ || $NF<0' table.txt
blue cake mug shirt -7
yellow banana window shoes 3.14

Alternation

Many a times, you'd want to search for multiple terms. In a conditional expression, you can use the logical operators to combine multiple conditions. With regular expressions, the | metacharacter is similar to logical OR. The regular expression will match if any of the expression separated by | is satisfied. These can have their own independent anchors as well.

Alternation is similar to using || operator between two regexps. Having a single regexp helps to write terser code and || cannot be used when substitution is required.

$ # match whole word 'par' or string ending with 's'
$ # same as: awk '/\<par\>/ || /s$/'
$ awk '/\<par\>|s$/' word_anchors.txt
sub par
two spare computers

$ # replace 'cat' or 'dog' or 'fox' with '--'
$ echo 'cats dog bee parrot foxed' | awk '{gsub(/cat|dog|fox/, "--")} 1'
--s -- bee parrot --ed

There's some tricky situations when using alternation. If it is used for filtering a line, there is no ambiguity. However, for use cases like substitution, it depends on a few factors. Say, you want to replace are or spared — which one should get precedence? The bigger word spared or the substring are inside it or based on something else?

The alternative which matches earliest in the input gets precedence.

$ # note that 'sub' is used here, so only first match gets replaced
$ echo 'cats dog bee parrot foxed' | awk '{sub(/bee|parrot|at/, "--")} 1'
c--s dog bee parrot foxed
$ echo 'cats dog bee parrot foxed' | awk '{sub(/parrot|at|bee/, "--")} 1'
c--s dog bee parrot foxed

In case of matches starting from same location, for example spar and spared, the longest matching portion gets precedence. Unlike other regular expression implementations, left-to-right priority for alternation comes into play only if length of the matches are the same. See Longest match wins and Backreferences sections for more examples.

$ echo 'spared party parent' | awk '{sub(/spa|spared/, "**")} 1'
** party parent
$ echo 'spared party parent' | awk '{sub(/spared|spa/, "**")} 1'
** party parent

$ # other implementations like 'perl' have left-to-right priority
$ echo 'spared party parent' | perl -pe 's/spa|spared/**/'
**red party parent

Grouping

Often, there are some common things among the regular expression alternatives. It could be common characters or qualifiers like the anchors. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd in maths, you get a(b|c)d = abd|acd in regular expressions.

$ # without grouping
$ printf 'red\nreform\nread\narrest\n' | awk '/reform|rest/'
reform
arrest
$ # with grouping
$ printf 'red\nreform\nread\narrest\n' | awk '/re(form|st)/'
reform
arrest

$ # without grouping
$ printf 'sub par\nspare\npart time\n' | awk '/\<par\>|\<part\>/'
sub par
part time
$ # taking out common anchors
$ printf 'sub par\nspare\npart time\n' | awk '/\<(par|part)\>/'
sub par
part time
$ # taking out common characters as well
$ # you'll later learn a better technique instead of using empty alternate
$ printf 'sub par\nspare\npart time\n' | awk '/\<par(|t)\>/'
sub par
part time

Matching the metacharacters

You have seen a few metacharacters and escape sequences that help to compose a regular expression. To match the metacharacters literally, i.e. to remove their special meaning, prefix those characters with a \ character. To indicate a literal \ character, use \\.

Unlike grep and sed, the string anchors have to be always escaped to match them literally as there is no BRE mode in awk. They do not lose their special meaning when not used in their customary positions.

$ # awk '/b^2/' will not work even though ^ isn't being used as anchor
$ # b^2 will work for both grep and sed if you use BRE syntax
$ echo 'a^2 + b^2 - C*3' | awk '/b\^2/'
a^2 + b^2 - C*3

$ # note that ')' doesn't need to be escaped
$ echo '(a*b) + c' | awk '{gsub(/\(|)/, "")} 1'
a*b + c

$ echo '\learn\by\example' | awk '{gsub(/\\/, "/")} 1'
/learn/by/example

info Backreferences section will discuss how to handle the metacharacters in replacement section.

Using string literal as regexp

The first argument to sub and gsub functions can be a string as well, awk will handle converting it to a regexp. This has a few advantages. For example, if you have many / characters in the search pattern, it might become easier to use string instead of regexp.

$ p='/home/learnbyexample/reports'
$ echo "$p" | awk '{sub(/\/home\/learnbyexample\//, "~/")} 1'
~/reports
$ echo "$p" | awk '{sub("/home/learnbyexample/", "~/")} 1'
~/reports

$ # example with line matching instead of substitution
$ printf '/foo/bar/1\n/foo/baz/1\n' | awk '/\/foo\/bar\//'
/foo/bar/1
$ printf '/foo/bar/1\n/foo/baz/1\n' | awk '$0 ~ "/foo/bar/"'
/foo/bar/1

In the above examples, the string literal was supplied directly. But any other expression or variable can be used as well, examples for which will be shown later in this chapter. The reason why string isn't always used as the first argument is that the special meaning for \ character will clash. For example:

$ awk 'gsub("\<par\>", "X")' word_anchors.txt
awk: cmd. line:1: warning: escape sequence `\<' treated as plain `<'
awk: cmd. line:1: warning: escape sequence `\>' treated as plain `>'

$ # you'll need \\ to represent \
$ awk 'gsub("\\<par\\>", "X")' word_anchors.txt
sub X
$ # much more readable with regexp literal
$ awk 'gsub(/\<par\>/, "X")' word_anchors.txt
sub X

$ # another example
$ echo '\learn\by\example' | awk '{gsub("\\\\", "/")} 1'
/learn/by/example
$ echo '\learn\by\example' | awk '{gsub(/\\/, "/")} 1'
/learn/by/example

info See gawk manual: Gory details for more information than you'd want.

The dot meta character

The dot metacharacter serves as a placeholder to match any character (including the newline character). Later you'll learn how to define your own custom placeholder for limited set of characters.

$ # 3 character sequence starting with 'c' and ending with 't'
$ echo 'tac tin cot abc:tyz excited' | awk '{gsub(/c.t/, "-")} 1'
ta-in - ab-yz ex-ed

$ # any character followed by 3 and again any character
$ printf '4\t35x\n' | awk '{gsub(/.3./, "")} 1'
4x

$ # 'c' followed by any character followed by 'x'
$ awk 'BEGIN{s="abc\nxyz"; sub(/c.x/, " ", s); print s}'
ab yz

Quantifiers

As an analogy, alternation provides logical OR. Combining the dot metacharacter . and quantifiers (and alternation if needed) paves a way to perform logical AND. For example, to check if a string matches two patterns with any number of characters in between. Quantifiers can be applied to both characters and groupings. Apart from ability to specify exact quantity and bounded range, these can also match unbounded varying quantities.

First up, the ? metacharacter which quantifies a character or group to match 0 or 1 times. This helps to define optional patterns and build terser patterns compared to groupings for some cases.

$ # same as: awk '{gsub(/\<(fe.d|fed)\>/, "X")} 1'
$ echo 'fed fold fe:d feeder' | awk '{gsub(/\<fe.?d\>/, "X")} 1'
X fold X feeder

$ # same as: awk '/\<par(|t)\>/'
$ printf 'sub par\nspare\npart time\n' | awk '/\<part?\>/'
sub par
part time

$ # same as: awk '{gsub(/part|parrot/, "X")} 1'
$ echo 'par part parrot parent' | awk '{gsub(/par(ro)?t/, "X")} 1'
par X X parent
$ # same as: awk '{gsub(/part|parrot|parent/, "X")} 1'
$ echo 'par part parrot parent' | awk '{gsub(/par(en|ro)?t/, "X")} 1'
par X X X

$ # both '<' and '\<' are replaced with '\<'
$ echo 'blah \< foo bar < blah baz <' | awk '{gsub(/\\?</, "\\<")} 1'
blah \< foo bar \< blah baz \<

The * metacharacter quantifies a character or group to match 0 or more times. There is no upper bound, more details will be discussed later in the next section.

$ # 'f' followed by zero or more of 'e' followed by 'd'
$ echo 'fd fed fod fe:d feeeeder' | awk '{gsub(/fe*d/, "X")} 1'
X X fod fe:d Xer

$ # zero or more of '1' followed by '2'
$ echo '3111111111125111142' | awk '{gsub(/1*2/, "-")} 1'
3-511114-

The + metacharacter quantifies a character or group to match 1 or more times. Similar to * quantifier, there is no upper bound.

$ # 'f' followed by one or more of 'e' followed by 'd'
$ echo 'fd fed fod fe:d feeeeder' | awk '{gsub(/fe+d/, "X")} 1'
fd X fod fe:d Xer

$ # 'f' followed by at least one of 'e' or 'o' or ':' followed by 'd'
$ echo 'fd fed fod fe:d feeeeder' | awk '{gsub(/f(e|o|:)+d/, "X")} 1'
fd X X X Xer

$ # one or more of '1' followed by optional '4' and then '2'
$ echo '3111111111125111142' | awk '{gsub(/1+4?2/, "-")} 1'
3-5-

You can specify a range of integer numbers, both bounded and unbounded, using {} metacharacters. There are four ways to use this quantifier as listed below:

PatternDescription
{m,n}match m to n times
{m,}match at least m times
{,n}match up to n times (including 0 times)
{n}match exactly n times
$ # note that inside {} space is not allowed
$ echo 'ac abc abbc abbbc abbbbbbbbc' | awk '{gsub(/ab{1,4}c/, "X")} 1'
ac X X X abbbbbbbbc

$ echo 'ac abc abbc abbbc abbbbbbbbc' | awk '{gsub(/ab{3,}c/, "X")} 1'
ac abc abbc X X

$ echo 'ac abc abbc abbbc abbbbbbbbc' | awk '{gsub(/ab{,2}c/, "X")} 1'
X X X abbbc abbbbbbbbc

$ echo 'ac abc abbc abbbc abbbbbbbbc' | awk '{gsub(/ab{3}c/, "X")} 1'
ac abc abbc X abbbbbbbbc

info The {} metacharacters have to be escaped to match them literally. Similar to () metacharacters, escaping { alone is enough.

Next up, how to construct conditional AND using dot metacharacter and quantifiers.

$ # match 'Error' followed by zero or more characters followed by 'valid'
$ echo 'Error: not a valid input' | awk '/Error.*valid/'
Error: not a valid input

To allow matching in any order, you'll have to bring in alternation as well. But, for more than 3 patterns, the combinations become too many to write and maintain.

$ # 'cat' followed by 'dog' or 'dog' followed by 'cat'
$ echo 'two cats and a dog' | awk '{gsub(/cat.*dog|dog.*cat/, "pets")} 1'
two pets
$ echo 'two dogs and a cat' | awk '{gsub(/cat.*dog|dog.*cat/, "pets")} 1'
two pets

Longest match wins

You've already seen an example with alternation, where the longest matching portion was chosen if two alternatives started from same location. For example spar|spared will result in spared being chosen over spar. The same applies whenever there are two or more matching possibilities from same starting location. For example, f.?o will match foo instead of fo if the input string to match is foot.

$ # longest match among 'foo' and 'fo' wins here
$ echo 'foot' | awk '{sub(/f.?o/, "X")} 1'
Xt
$ # everything will match here
$ echo 'car bat cod map scat dot abacus' | awk '{sub(/.*/, "X")} 1'
X

$ # longest match happens when (1|2|3)+ matches up to '1233' only
$ # so that '12baz' can match as well
$ echo 'foo123312baz' | awk '{sub(/o(1|2|3)+(12baz)?/, "X")} 1'
foX
$ # in other implementations like 'perl', that is not the case
$ # quantifiers match as much as possible, but precedence is left to right
$ echo 'foo123312baz' | perl -pe 's/o(1|2|3)+(12baz)?/X/'
foXbaz

While determining the longest match, overall regular expression matching is also considered. That's how Error.*valid example worked. If .* had consumed everything after Error, there wouldn't be any more characters to try to match valid. So, among the varying quantity of characters to match for .*, the longest portion that satisfies the overall regular expression is chosen. Something like a.*b will match from first a in the input string to the last b in the string. In other implementations, like perl, this is achieved through a process called backtracking. Both approaches have their own advantages and disadvantages and have cases where the regexp can result in exponential time consumption.

$ # from start of line to last 'm' in the line
$ echo 'car bat cod map scat dot abacus' | awk '{sub(/.*m/, "-")} 1'
-ap scat dot abacus

$ # from first 'b' to last 't' in the line
$ echo 'car bat cod map scat dot abacus' | awk '{sub(/b.*t/, "-")} 1'
car - abacus

$ # from first 'b' to last 'at' in the line
$ echo 'car bat cod map scat dot abacus' | awk '{sub(/b.*at/, "-")} 1'
car - dot abacus

$ # here 'm*' will match 'm' zero times as that gives the longest match
$ echo 'car bat cod map scat dot abacus' | awk '{sub(/a.*m*/, "-")} 1'
c-

Character classes

To create a custom placeholder for limited set of characters, enclose them inside [] metacharacters. It is similar to using single character alternations inside a grouping, but with added flexibility and features. Character classes have their own versions of metacharacters and provide special predefined sets for common use cases. Quantifiers are also applicable to character classes.

$ # same as: awk '/cot|cut/' and awk '/c(o|u)t/'
$ printf 'cute\ncat\ncot\ncoat\ncost\nscuttle\n' | awk '/c[ou]t/'
cute
cot
scuttle

$ # same as: awk '/.(a|e|o)+t/'
$ printf 'meeting\ncute\nboat\nat\nfoot\n' | awk '/.[aeo]+t/'
meeting
boat
foot

$ # same as: awk '{gsub(/\<(s|o|t)(o|n)\>/, "X")} 1'
$ echo 'no so in to do on' | awk '{gsub(/\<[sot][on]\>/, "X")} 1'
no X in X do X

$ # strings made up of letters 'o' and 'n', string length at least 2
$ # /usr/share/dict/words contains dictionary words, one word per line
$ awk '/^[on]{2,}$/' /usr/share/dict/words
no
non
noon
on

Character classes have their own metacharacters to help define the sets succinctly. Metacharacters outside of character classes like ^, $, () etc either don't have special meaning or have completely different one inside the character classes.

First up, the - metacharacter that helps to define a range of characters instead of having to specify them all individually.

$ # same as: awk '{gsub(/[0123456789]+/, "-")} 1'
$ echo 'Sample123string42with777numbers' | awk '{gsub(/[0-9]+/, "-")} 1'
Sample-string-with-numbers

$ # whole words made up of lowercase alphabets and digits only
$ echo 'coat Bin food tar12 best' | awk '{gsub(/\<[a-z0-9]+\>/, "X")} 1'
X Bin X X X

$ # whole words made up of lowercase alphabets, starting with 'p' to 'z'
$ echo 'road i post grip read eat pit' | awk '{gsub(/\<[p-z][a-z]*\>/, "X")} 1'
X i X grip X eat X

Character classes can also be used to construct numeric ranges. However, it is easy to miss corner cases and some ranges are complicated to design. See also regular-expressions: Matching Numeric Ranges with a Regular Expression.

$ # numbers between 10 to 29
$ echo '23 154 12 26 34' | awk '{gsub(/\<[12][0-9]\>/, "X")} 1'
X 154 X X 34

$ # numbers >= 100 with optional leading zeros
$ echo '0501 035 154 12 26 98234' | awk '{gsub(/\<0*[1-9][0-9]{2,}\>/, "X")} 1'
X 035 X 12 26 X

Next metacharacter is ^ which has to specified as the first character of the character class. It negates the set of characters, so all characters other than those specified will be matched. Handle negative logic with care though, you might end up matching more than you wanted.

$ # replace all non-digits
$ echo 'Sample123string42with777numbers' | awk '{gsub(/[^0-9]+/, "-")} 1'
-123-42-777-

$ # delete last two columns based on a delimiter
$ echo 'foo:123:bar:baz' | awk '{sub(/(:[^:]+){2}$/, "")} 1'
foo:123

$ # sequence of characters surrounded by unique character
$ echo 'I like "mango" and "guava"' | awk '{gsub(/"[^"]+"/, "X")} 1'
I like X and X

$ # sometimes it is simpler to positively define a set than negation
$ # same as: awk '/^[^aeiou]*$/'
$ printf 'tryst\nfun\nglyph\npity\nwhy\n' | awk '!/[aeiou]/'
tryst
glyph
why

Some commonly used character sets have predefined escape sequences:

  • \w matches all word characters [a-zA-Z0-9_] (recall the description for word boundaries)
  • \W matches all non-word characters (recall duality seen earlier, like \y and \B)
  • \s matches all whitespace characters: tab, newline, vertical tab, form feed, carriage return and space
  • \S matches all non-whitespace characters
$ # match all non-word characters
$ echo 'load;err_msg--\/ant,r2..not' | awk '{gsub(/\W+/, "-")} 1'
load-err_msg-ant-r2-not

$ # replace all sequences of whitespaces with single space
$ printf 'hi  \v\f  there.\thave   \ra nice\t\tday\n' | awk '{gsub(/\s+/, " ")} 1'
hi there. have a nice day

These escape sequences cannot be used inside character classes.

$ # \w would simply match w inside character classes
$ echo 'w=y\x+9*3' | awk '{gsub(/[\w=]/, "")} 1'
y\x+9*3

warning awk doesn't support \d and \D, commonly featured in other implementations as a shortcut for all the digits and non-digits.

A named character set is defined by a name enclosed between [: and :] and has to be used within a character class [], along with any other characters as needed.

Named setDescription
[:digit:][0-9]
[:lower:][a-z]
[:upper:][A-Z]
[:alpha:][a-zA-Z]
[:alnum:][0-9a-zA-Z]
[:xdigit:][0-9a-fA-F]
[:cntrl:]control characters — first 32 ASCII characters and 127th (DEL)
[:punct:]all the punctuation characters
[:graph:][:alnum:] and [:punct:]
[:print:][:alnum:], [:punct:] and space
[:blank:]space and tab characters
[:space:]whitespace characters, same as \s
$ s='err_msg xerox ant m_2 P2 load1 eel'
$ echo "$s" | awk '{gsub(/\<[[:lower:]]+\>/, "X")} 1'
err_msg X X m_2 P2 load1 X

$ echo "$s" | awk '{gsub(/\<[[:lower:]_]+\>/, "X")} 1'
X X X m_2 P2 load1 X

$ echo "$s" | awk '{gsub(/\<[[:alnum:]]+\>/, "X")} 1'
err_msg X X m_2 X X X

$ echo ',pie tie#ink-eat_42' | awk '{gsub(/[^[:punct:]]+/, "")} 1'
,#-_

Specific placement is needed to match character class metacharacters literally. Or, they can be escaped by prefixing \ to avoid having to remember the different rules. As \ is special inside character class, use \\ to represent it literally.

$ # - should be first or last character within []
$ echo 'ab-cd gh-c 12-423' | awk '{gsub(/[a-z-]{2,}/, "X")} 1'
X X 12-423
$ # or escaped with \
$ echo 'ab-cd gh-c 12-423' | awk '{gsub(/[a-z\-0-9]{2,}/, "X")} 1'
X X X

$ # ] should be first character within []
$ printf 'int a[5]\nfoo\n1+1=2\n' | awk '/[=]]/'
$ printf 'int a[5]\nfoo\n1+1=2\n' | awk '/[]=]/'
int a[5]
1+1=2

$ # to match [ use [ anywhere in the character set
$ # [][] will match both [ and ]
$ printf 'int a[5]\nfoo\n1+1=2\n' | awk '/[][]/'
int a[5]

$ # ^ should be other than first character within []
$ echo 'f*(a^b) - 3*(a+b)/(a-b)' | awk '{gsub(/a[+^]b/, "c")} 1'
f*(c) - 3*(c)/(a-b)

warning Combinations like [. or [: cannot be used together to mean two individual characters, as they have special meaning within []. See gawk manual: Using Bracket Expressions for more details.

$ echo 'int a[5]' | awk '/[x[.y]/'
awk: cmd. line:1: error: Unmatched [, [^, [:, [., or [=: /[x[.y]/
$ echo 'int a[5]' | awk '/[x[y.]/'
int a[5]

Escape sequences

Certain ASCII characters like tab \t, carriage return \r, newline \n, etc have escape sequences to represent them. Additionally, any character can be represented using their ASCII value in octal \NNN or hexadecimal \xNN formats. Unlike character set escape sequences like \w, these can be used inside character classes.

$ # using \t to represent tab character
$ printf 'foo\tbar\tbaz\n' | awk '{gsub(/\t/, " ")} 1'
foo bar baz

$ # these escape sequence work inside character class too
$ printf 'a\t\r\fb\vc\n' | awk '{gsub(/[\t\v\f\r]+/, ":")} 1'
a:b:c

$ # representing single quotes
$ # use \047 for octal format
$ echo "universe: '42'" | awk '{gsub(/\x27/, "")} 1'
universe: 42

info If a metacharacter is specified by ASCII value, it will still act as the metacharacter. Undefined sequences will result in a warning and treated as the character it escapes.

$ # \x5e is ^ character, acts as string anchor here
$ printf 'cute\ncot\ncat\ncoat\n' | awk '/\x5eco/'
cot
coat

$ # & metacharacter in replacement will be discussed in a later section
$ # it represents entire matched portion
$ echo 'hello world' | awk '{sub(/.*/, "[&]")} 1'
[hello world]
$ # \x26 is & character
$ echo 'hello world' | awk '{sub(/.*/, "[\x26]")} 1'
[hello world]

$ echo 'read' | awk '{sub(/a/, "\.")} 1'
awk: cmd. line:1: warning: escape sequence `\.' treated as plain `.'
re.d

info See gawk manual: Escape Sequences for full list and other details.

Replace specific occurrence

The third substitution function is gensub which can be used instead of both sub and gsub functions. Syntax wise, gensub needs minimum three arguments. The third argument is used to indicate whether you want to replace all occurrences with "g" or specific occurrence by giving a number. Another difference is that gensub returns a string value (irrespective of substitution succeeding) instead of modifying the input.

$ # same as: sed 's/:/-/2'
$ # replace only second occurrence of ':' with '-'
$ # note that output of gensub is passed to print here
$ echo 'foo:123:bar:baz' | awk '{print gensub(/:/, "-", 2)}'
foo:123-bar:baz

$ # same as: sed -E 's/[^:]+/X/3'
$ # replace only third field with 'X'
$ echo 'foo:123:bar:baz' | awk '{print gensub(/[^:]+/, "X", 3)}'
foo:123:X:baz

The fourth argument for gensub function allows you to specify the input string or variable on which the substitution has to be performed. Default is $0, as seen in previous examples.

$ # replace vowels with 'X' only for fourth field
$ # same as: awk '{gsub(/[aeiou]/, "X", $4)} 1'
$ echo '1 good 2 apples' | awk '{$4 = gensub(/[aeiou]/, "X", "g", $4)} 1'
1 good 2 XpplXs

Backreferences

The grouping metacharacters () are also known as capture groups. They are like variables, the string captured by () can be referred later using backreference \N where N is the capture group you want. Leftmost ( in the regular expression is \1, next one is \2 and so on up to \9. As a special case, & metacharacter represents entire matched string. As \ is special inside double quotes, you'll have to use "\\1" to represent \1.

info Backreferences of the form \N can only be used with gensub function. & can be used with sub, gsub and gensub functions. \0 can also be used instead of & with gensub function.

$ # reduce \\ to single \ and delete if it is a single \
$ s='\[\] and \\w and \[a-zA-Z0-9\_\]'
$ echo "$s" | awk '{print gensub(/(\\?)\\/, "\\1", "g")}'
[] and \w and [a-zA-Z0-9_]

$ # duplicate first column value as final column
$ echo 'one,2,3.14,42' | awk '{print gensub(/^([^,]+).*/, "&,\\1", 1)}'
one,2,3.14,42,one

$ # add something at start and end of string, gensub isn't needed here
$ echo 'hello world' | awk '{sub(/.*/, "Hi. &. Have a nice day")} 1'
Hi. hello world. Have a nice day

$ # here {N} refers to last but Nth occurrence
$ s='456:foo:123:bar:789:baz'
$ echo "$s" | awk '{print gensub(/(.*):((.*:){2})/, "\\1[]\\2", 1)}'
456:foo:123[]bar:789:baz

warning See unix.stackexchange: Why doesn't this sed command replace the 3rd-to-last "and"? for a bug related to use of word boundaries in the ((){N}) generic case.

warning Unlike other regular expression implementations, like grep or sed or perl, backreferences cannot be used in search section in awk. See also unix.stackexchange: backreference in awk.

If quantifier is applied on a pattern grouped inside () metacharacters, you'll need an outer () group to capture the matching portion. Some regular expression engines provide non-capturing group to handle such cases. In awk, you'll have to work around the extra capture group.

$ # note the numbers used in replacement section
$ s='one,2,3.14,42'
$ echo "$s" | awk '{$0=gensub(/^(([^,]+,){2})([^,]+)/, "[\\1](\\3)", 1)} 1'
[one,2,](3.14),42

Here's an example where alternation order matters when matching portions have same length. Aim is to delete all whole words unless it starts with g or p and contains y.

$ s='tryst,fun,glyph,pity,why,group'

$ # all words get deleted because \w+ gets priority here
$ echo "$s" | awk '{print gensub(/\<\w+\>|(\<[gp]\w*y\w*\>)/, "\\1", "g")}'
,,,,,

$ # capture group gets priority here, thus words matching the group are retained
$ echo "$s" | awk '{print gensub(/(\<[gp]\w*y\w*\>)|\<\w+\>/, "\\1", "g")}'
,,glyph,pity,,

As \ and & are special characters inside double quotes in replacement section, use \\ and \\& respectively for literal representation.

$ echo 'foo and bar' | awk '{sub(/and/, "[&]")} 1'
foo [and] bar
$ echo 'foo and bar' | awk '{sub(/and/, "[\\&]")} 1'
foo [&] bar

$ echo 'foo and bar' | awk '{sub(/and/, "\\")} 1'
foo \ bar

Case insensitive matching

Unlike sed or perl, regular expressions in awk do not directly support the use of flags to change certain behaviors. For example, there is no flag to force the regexp to ignore case while matching.

The IGNORECASE special variable controls case sensitivity, which is 0 by default. By changing it to some other value (which would mean true in conditional expression), you can match case insensitively. The -v command line option allows you to assign a variable before input is read. The BEGIN block is also often used to change such settings.

$ printf 'Cat\ncOnCaT\nscatter\ncot\n' | awk -v IGNORECASE=1 '/cat/'
Cat
cOnCaT
scatter

$ # for small enough string, can also use character class
$ printf 'Cat\ncOnCaT\nscatter\ncot\n' | awk '{gsub(/[cC][aA][tT]/, "dog")} 1'
dog
cOndog
sdogter
cot

Another way is to use built-in string function tolower to change the input to lowercase first.

$ printf 'Cat\ncOnCaT\nscatter\ncot\n' | awk 'tolower($0) ~ /cat/'
Cat
cOnCaT
scatter

Dynamic regexp

As seen earlier, you can use a string literal instead of regexp to specify the pattern to be matched. Which implies that you can use any expression or a variable as well. This is helpful if you need to compute the regexp based on some conditions or if you are getting the pattern externally, such as user input.

The -v command line option comes in handy to get user input, say from a bash variable.

$ r='cat.*dog|dog.*cat'
$ echo 'two cats and a dog' | awk -v ip="$r" '{gsub(ip, "pets")} 1'
two pets

$ awk -v s='ow' '$0 ~ s' table.txt
brown bread mat hair 42
yellow banana window shoes 3.14

$ # you'll have to make sure to use \\ instead of \
$ r='\\<[12][0-9]\\>'
$ echo '23 154 12 26 34' | awk -v ip="$r" '{gsub(ip, "X")} 1'
X 154 X X 34

info See Using shell variables chapter for a way to avoid having to escape backslashes.

Sometimes, you need to get user input and then treat it literally instead of regexp pattern. In such cases, you'll need to first escape the metacharacters before using in substitution functions. Below example shows how to do it for search section. For replace section, you only have to escape the \ and & characters.

$ awk -v s='(a.b)^{c}|d' 'BEGIN{gsub(/[{[(^$*?+.|\\]/, "\\\\&", s); print s}'
\(a\.b)\^\{c}\|d

$ echo 'f*(a^b) - 3*(a^b)' |
     awk -v s='(a^b)' '{gsub(/[{[(^$*?+.|\\]/, "\\\\&", s); gsub(s, "c")} 1'
f*c - 3*c

$ # match given input string literally, but only at the end of string
$ echo 'f*(a^b) - 3*(a^b)' |
     awk -v s='(a^b)' '{gsub(/[{[(^$*?+.|\\]/, "\\\\&", s); gsub(s "$", "c")} 1'
f*(a^b) - 3*c

info See my blog post for more details about escaping metacharacters.

info If you need to match instead of substitution, you can use the index function. See index section for details.

Summary

Regular expressions is a feature that you'll encounter in multiple command line programs and programming languages. It is a versatile tool for text processing. Although the features in awk are less compared to those found in programming languages, they are sufficient for most of the tasks you'll need for command line usage. It takes a lot of time to get used to syntax and features of regular expressions, so I'll encourage you to practice a lot and maintain notes. It'd also help to consider it as a mini-programming language in itself for its flexibility and complexity.

Exercises

a) For the given input, print all lines that start with den or end with ly.

$ lines='lovely\n1 dentist\n2 lonely\neden\nfly away\ndent\n'
$ printf '%b' "$lines" | awk ##### add your solution here
lovely
2 lonely
dent

b) Replace all occurrences of 42 with [42] unless it is at the edge of a word. Note that word in these exercises have same meaning as defined in regular expressions.

$ echo 'hi42bye nice421423 bad42 cool_42a 42c' | awk ##### add your solution here
hi[42]bye nice[42]1[42]3 bad42 cool_[42]a 42c

c) Add [] around words starting with s and containing e and t in any order.

$ words='sequoia subtle exhibit asset sets tests site'
$ echo "$words" | awk ##### add your solution here
sequoia [subtle] exhibit asset [sets] tests [site]

d) Replace the space character that occurs after a word ending with a or r with a newline character.

$ echo 'area not a _a2_ roar took 22' | awk ##### add your solution here
area
not a
_a2_ roar
took 22

e) Replace all occurrences of [4]|* with 2 for the given input.

$ echo '2.3/[4]|*6 foo 5.3-[4]|*9' | awk ##### add your solution here
2.3/26 foo 5.3-29

f) awk '/\<[a-z](on|no)[a-z]\>/' is same as awk '/\<[a-z][on]{2}[a-z]\>/'. True or False? Sample input shown below might help to understand the differences, if any.

$ printf 'known\nmood\nknow\npony\ninns\n'
known
mood
know
pony
inns

g) Print all lines that start with hand and ends with s or y or le or no further character. For example, handed shouldn't be printed even though it starts with hand.

$ lines='handed\nhand\nhandy\nunhand\nhands\nhandle\n'
$ printf '%b' "$lines" | awk ##### add your solution here
hand
handy
hands
handle

h) Replace 42//5 or 42/5 with 8 for the given input.

$ echo 'a+42//5-c pressure*3+42/5-14256' | awk ##### add your solution here
a+8-c pressure*3+8-14256

i) For the given quantifiers, what would be the equivalent form using {m,n} representation?

  • ? is same as
  • * is same as
  • + is same as

j) True or False? (a*|b*) is same as (a|b)*

k) For the given input, construct two different regexps to get the outputs as shown below.

$ # delete from '(' till next ')'
$ echo 'a/b(division) + c%d() - (a#(b)2(' | awk ##### add your solution here
a/b + c%d - 2(

$ # delete from '(' till next ')' but not if there is '(' in between
$ echo 'a/b(division) + c%d() - (a#(b)2(' | awk ##### add your solution here
a/b + c%d - (a#2(

l) For the input file anchors.txt, convert markdown anchors to corresponding hyperlinks.

$ cat anchors.txt
# <a name="regular-expressions"></a>Regular Expressions
## <a name="subexpression-calls"></a>Subexpression calls

$ awk ##### add your solution here
[Regular Expressions](#regular-expressions)
[Subexpression calls](#subexpression-calls)

m) Display all lines that satisfies both of these conditions:

  • professor matched irrespective of case
  • quip or this matched case sensitively

Input is a file downloaded from internet as shown below.

$ wget https://www.gutenberg.org/files/345/345.txt -O dracula.txt

$ awk ##### add your solution here
equipment of a professor of the healing craft. When we were shown in,
should be. I could see that the Professor had carried out in this room,
"Not up to this moment, Professor," she said impulsively, "but up to
and sprang at us. But by this time the Professor had gained his feet,
this time the Professor had to ask her questions, and to ask them pretty

n) Given sample strings have fields separated by , and field values cannot be empty. Replace the third field with 42.

$ echo 'lion,ant,road,neon' | awk ##### add your solution here
lion,ant,42,neon

$ echo '_;3%,.,=-=,:' | awk ##### add your solution here
_;3%,.,42,:

o) For the given strings, replace last but third so with X. Only print the lines which are changed by the substitution.

$ printf 'so and so also sow and soup' | awk ##### add your solution here
so and X also sow and soup

$ printf 'sososososososo\nso and so\n' | awk ##### add your solution here
sososoXsososo

p) Surround all whole words with (). Additionally, if the whole word is imp or ant, delete them. Can you do it with single substitution?

$ words='tiger imp goat eagle ant important'
$ echo "$words" | awk ##### add your solution here
(tiger) () (goat) (eagle) () (important)