GNU BRE/ERE cheatsheet

This post covers Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE) syntax supported by GNU grep, sed and awk. You'll also learn the differences between these tools — for example, awk doesn't support backreferences within regexp definition (i.e. the search portion).


BRE and ERE🔗

info From GNU grep manual:

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

grep and sed support BRE by default and enables ERE when -E option is used. awk supports only ERE. Assume ERE for descriptions in this post unless otherwise mentioned.

This post is intended as a reference for BRE/ERE flavor of regular expressions. For a more detailed explanation with examples and exercises, see these chapters from my ebooks:


Anchors🔗

PatternDescription
^restricts the match to the start of the string
$restricts the match to the end of the string
\<restricts the match to the start of word
\>restricts the match to the end of word

The -x cli option in grep is equivalent to ^pattern$.

Word characters include alphabets, digits and underscore. Here's some more alternate ways to specify word anchors:

PatternDescription
\brestricts the match to the start/end of words, applicable for grep and sed
\yrestricts the match to the start/end of words, applicable for awk (\b means backspace)
\Bmatches wherever \b (or \y) doesn't match

grep also supports -w cli option. It is equivalent to (?<!\w)pattern(?!\w). The three different ways to specify word anchors are not exactly equivalent though, see Word boundary differences section from my book for details and examples.


Alternation and Grouping🔗

PatternDescription
pat1|pat2|pat3match pat1 or pat2 or pat3
use \| in BRE mode
()group pattern(s), a(b|c)d is same as abd|acd
use \(\) in BRE mode

The alternative patterns can have their own independent anchors. Alternative which matches earliest in the input gets precedence. Longest matching portion wins if multiple alternatives start from the same location (irrespective of the order of alternatives). In case of a tie with same lengths, leftmost alternative wins (see stackoverflow: Non greedy matching in sed for a practical use case).


Escaping metacharacters🔗

PatternDescription
\prefix metacharacters with \ to match them literally
\\to match \ literally
  • With grep and sed, switching between ERE and BRE can reduce the number of escapes needed for some cases. For fixed string matching, grep has -F option and awk has string comparison operators (whole string) and the index function (partial string).
  • sed requires both ( and ) characters to be escaped (in ERE mode), whereas grep and awk don't require ) to be escaped.
  • sed requires { to be escaped (in ERE mode) even if it isn't part of a valid quantifier syntax, whereas grep and awk don't require escaping. For example, you'd need \{a} in sed whereas {a} is enough for the other two.
  • In BRE mode, grep and sed don't require ^ and $ to be escaped if they are used away from their customary positions.

Dot metacharacter and Quantifiers🔗

PatternDescription
.match any character, including the newline character
?match 0 or 1 times
use \? in BRE mode
*match 0 or more times
+match 1 or more times
use \+ in BRE mode
{m,n}match m to n times
{m,}match at least m times
{,n}match up to n times (including 0 times)
{n}match exactly n times
use \{\} in BRE mode
pat1.*pat2any number of characters between pat1 and pat2
pat1.*pat2|pat2.*pat1match both pat1 and pat2 in any order

Precedence rule is longest match wins, which is mostly similar but not exactly same as greedy quantifiers. For example, with foo123312baz as input string, o[123]+(12baz)? will match o123312baz with these tools, whereas it will match o123312 with greedy quantifiers.


Character class🔗

PatternDescription
[set123]match any of these characters once
[^set123]match except any of these characters once
[3-7AM-X]range of characters from 3 to 7, A, another range from M to X
[.open collating symbol
.]close collating symbol
[=open equivalence class
=]close equivalence class

Specific placement will help to match character class metacharacters literally.

PatternDescription
[a-z-]- should be first/last character to match literally
[+^]^ shouldn't be first character
[]=]] should be first character (second if ^ is used to invert the set)
  • \ isn't special within character class in grep.
  • \ can be used to escape character class metacharacters in awk.

Some commonly used character sets have predefined escape sequences:

PatternDescription
\wsimilar to [a-zA-Z0-9_] for matching word characters
\ssimilar to [ \t\n\r\f\v] for matching whitespace characters
\Wmatch non-word characters
\Smatch non-whitespace characters
  • Undefined escape sequences will be treated as the character it escapes. For example, \e will match e (not \ and e).
    • in addition, awk gives a "not a known regexp operator" warning.
  • The above escape sequences cannot be used inside character classes and behavior varies between the tools.
    • For example, using [\w] will match \ or w characters in grep and sed whereas it will match only w in awk.
  • These escape sequences are also locale aware, for example αλεπού and \u2028 (line separator) will be considered as word and whitespace characters respectively in appropriate locales.
  • These tools do not support \d and \D, commonly featured in other regexp implementations for digits and non-digits.

Escape sequences🔗

This section is applicable only for sed and awk unless otherwise specified and can be used within character classes too. See also ASCII Codes Table Standard characters.

Escape sequenceDescription
\aalert
\bbackspace in awk, word boundary in grep and sed
\b inside a character class in sed will act as a backspace
\fformfeed
\nnewline
\rcarriage return
\thorizontal tab
\vvertical tab
\cxCONTROL-x in sed

You can also represent ASCII characters using their codepoint values.

Escape sequenceDescription
\xNNhexadecimal digits
\NNNoctal digits in awk
\oNNNoctal digits in sed
\dNNNdecimal digits in sed
  • In search section, a metacharacter specified by escape sequences will still act as the metacharacter. For example, /\x5eco/ will match co only at the start of the string.
  • In replacement section,
    • escape sequences in sed produces literal character. For example, s/.*/"\x26"/ will have "&" as the replacement value.
    • escape sequences in awk is treated as metacharacter. For example, sub(/.*/, "[&]") and sub(/.*/, "[\x26]") are equivalent.

info Ways to use escape sequences with grep:


Named character sets🔗

The below table lists named sets and their equivalent character class in ASCII encoding. These can be used inside character classes only. For example, [[:digit:]] is same as [0-9] and [[:alnum:]_] is equivalent to \w.

Named setDescription
[:digit:][0-9]
[:lower:][a-z]
[:upper:][A-Z]
[:alpha:][a-zA-Z]
[:alnum:][0-9a-zA-Z]
[:xdigit:][0-9a-fA-F]
[:cntrl:]control characters — first 32 ASCII characters and 127th (DEL)
[:punct:]all the punctuation characters
[:graph:][:alnum:] and [:punct:]
[:print:][:alnum:], [:punct:] and space
[:blank:]space and tab characters
[:space:]whitespace characters, same as \s

info From grep manual:

Their interpretation depends on the LC_CTYPE locale; for example, [[:alnum:]] means the character class of numbers and letters in the current locale.


Backreferences🔗

PatternDescription
\Nbackreference, gives matched portion of Nth capture group
possible values: \1, \2 up to \9
&represents entire matched string in the replacement section
\0equivalent to & in sed

Notes for awk:

  • backreferences can be used only in replacement section, not allowed in search section.
  • sub and gsub functions allow only the & backreference.
  • gensub function allows \N form of backreference as well.
    • but need to use \\0, \\1, \\2 etc since they are specified using string syntax.

sed flags🔗

This section discusses flags (also known as modifiers) that change the regexp behavior. When used with regexp addressing:

FlagDescription
Imatch case insensitively

When used with substitution command:

FlagDescription
i or Imatch case insensitively
greplace all occurrences instead of just the first match
Na number will cause only the Nth match to be replaced
Ngreplace from Nth match to the end
m or Mmultiline mode
. will not match the newline character
^ and $ will match every line's start and end locations (line separator is \n by default and NUL when -z option is used)
\`always match the start of string irrespective of multiline mode
\'always match the end of string irrespective of multiline mode

Flags are not supported by grep or awk. But these equivalent/alternative options can be used:

  • -i cli option in grep and setting IGNORECASE to non-zero value in awk will match case insensitively.
  • tolower or toupper functions can be used in awk to convert input to single case.
  • you can also use character classes for small strings, for example [cC][aA][tT] will match cat case insensitively.
  • sub function in awk replaces only the first matching occurrence and gsub function is equivalent to using the g flag.
  • third argument of gensub function in awk supports replacing only the Nth match as well as the g flag.

The behavior of sed and awk differs for Nth match if the pattern can match empty string:

$ echo 'a,,c,d,,f' | sed 's/[^,]*/b/2'
a,b,c,d,,f
$ echo 'a,,c,d,,f' | sed 's/[^,]*/e/5'
a,,c,d,e,f

$ echo 'a,,c,d,,f' | awk '{print gensub(/[^,]*/, "b", 2)}'
ab,,c,d,,f
$ echo 'a,,c,d,,f' | awk '{print gensub(/[^,]*/, "e", 5)}'
a,,ce,d,,f

sed case conversion🔗

Escape sequenceDescription
\Eindicates end of case conversion in replacement section
\lconvert next character to lowercase
\uconvert next character to uppercase
\Lconvert following characters to lowercase, stops if \U or \E is found
\Uconvert following characters to uppercase, stops if \L or \E is found

sed delimiters🔗

  • / is idiomatically used as the delimiter.
  • Any character except \ and newline character can also be used. For example: s#/home/learnbyexample/#~/# is same as s/\/home\/learnbyexample\//~\//.
  • For regexp addressing, the first delimiter has to be escaped. For example: \;/foo/bar/;p is same as /foo\/bar\//p.