BRE/ERE Regular Expressions
This chapter covers Basic and Extended Regular Expressions as implemented in GNU grep
. Unless otherwise indicated, examples and descriptions will assume ASCII input. GNU grep
also supports Perl Compatible Regular Expressions, which will be discussed in a later chapter.
By default, grep
treats the search pattern as Basic Regular Expression (BRE). Here are the various options available to choose a particular flavor:
-G
option can be used to specify explicitly that BRE is needed-E
option will enable Extended Regular Expression (ERE)- in
GNU grep
, BRE and ERE only differ in how metacharacters are specified, no difference in features
- in
-F
option will cause the search patterns to be treated literally-P
if available, this option will enable Perl Compatible Regular Expression (PCRE)
The example_files directory has all the files used in the examples.
See grep manual: Problematic Regular Expressions if you are working on portable scripts. See also POSIX specification for BRE and ERE.
Line Anchors
Instead of matching anywhere in the line, restrictions can be specified. For now, you'll see the ones that are already part of BRE/ERE. In later sections and chapters, you'll get to know how to define your own rules for restriction. These restrictions are made possible by assigning special meaning to certain characters and escape sequences.
The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a \
(discussed in the Escaping metacharacters section).
There are two line anchors:
^
metacharacter restricts the matching to the start of the line$
metacharacter restricts the matching to the end of the line
Here are some examples:
$ cat anchors.txt
sub par
spar
apparent effort
two spare computers
cart part tart mart
# lines starting with 's'
$ grep '^s' anchors.txt
sub par
spar
# lines ending with 'rt'
$ grep 'rt$' anchors.txt
apparent effort
cart part tart mart
You can combine these two anchors to match only whole lines. Or, use the -x
option.
$ printf 'spared no one\npar\nspar\ndare' | grep '^par$'
par
$ printf 'spared no one\npar\nspar\ndare' | grep -x 'par'
par
Word Anchors
The second type of restriction is word anchors. A word character is any alphabet (irrespective of case), digit and the underscore character. This is similar to using -w
option, with added flexibility of using word anchor only at the start or end of a word.
The escape sequence \b
denotes a word boundary. This works for both the start of word and the end of word anchoring. Start of word means either the character prior to the word is a non-word character or there is no character (start of line). Similarly, end of word means the character after the word is a non-word character or no character (end of line). This implies that you cannot have word boundaries without a word character. Here are some examples:
$ cat anchors.txt
sub par
spar
apparent effort
two spare computers
cart part tart mart
# match words starting with 'par'
$ grep '\bpar' anchors.txt
sub par
cart part tart mart
# match words ending with 'par'
$ grep 'par\b' anchors.txt
sub par
spar
# match only whole word 'par'
$ grep '\bpar\b' anchors.txt
sub par
$ grep -w 'par' anchors.txt
sub par
Word boundaries behave a bit differently than the
-w
option. See the Word boundary differences section for details.
Alternatively, you can use
\<
to indicate the start of word anchor and\>
to indicate the end of word anchor. Using\b
is preferred as it is more commonly used in other regular expression implementations and has\B
as its opposite.
Opposite Word Anchor
The word boundary has an opposite anchor too. \B
matches wherever \b
doesn't match. This duality will be seen with some other escape sequences too.
# match 'par' if it is surrounded by word characters
$ grep '\Bpar\B' anchors.txt
apparent effort
two spare computers
# match 'par' but not as start of word
$ grep '\Bpar' anchors.txt
spar
apparent effort
two spare computers
# match 'par' but not as end of word
$ grep 'par\B' anchors.txt
apparent effort
two spare computers
cart part tart mart
Negative logic is handy in many text processing situations. But use it with care, you might end up matching things you didn't intend.
Alternation
Many a times, you'd want to search for multiple terms. In a conditional expression, you can use the logical operators to combine multiple conditions. With regular expressions, the |
metacharacter is similar to logical OR. The regular expression will match if any of the patterns separated by |
is satisfied.
Alternation is similar to using multiple -e
option, but provides more flexibility when combined with grouping. The |
metacharacter syntax varies between BRE and ERE. Quoting from the manual:
In basic regular expressions the meta-characters
?
,+
,{
,|
,(
, and)
lose their special meaning; instead use the backslashed versions\?
,\+
,\{
,\|
,\(
, and\)
.
Here are some examples:
$ cat pets.txt
I like cats
I like parrots
I like dogs
# three different ways to match either 'cat' or 'dog'
$ grep 'cat\|dog' pets.txt
I like cats
I like dogs
$ grep -E 'cat|dog' pets.txt
I like cats
I like dogs
$ grep -e 'cat' -e 'dog' pets.txt
I like cats
I like dogs
# extract either 'cat' or 'dog' or 'fox' case insensitively
$ printf 'CATs dog bee parrot FoX' | grep -ioE 'cat|dog|fox'
CAT
dog
FoX
Here's an example of alternate patterns with their own anchors:
# match lines starting with 't' or a line containing a word ending with 'ar'
$ grep -E '^t|ar\b' anchors.txt
sub par
spar
two spare computers
Sometimes, you want to view the entire input file with only the required search patterns highlighted. You can use an empty alternation to match any line.
Alternation precedence
There are some tricky corner cases when using alternation. If it is used for filtering a line, there is no ambiguity. However, for matching portion extraction with -o
option, it depends on a few factors. Say, you want to extract are
or spared
— which one should get precedence? The bigger word spared
or the substring are
inside it or based on something else?
The alternative which matches earliest in the input gets precedence.
$ echo 'car spared spar' | grep -oE 'are|spared'
spared
$ echo 'car spared spar' | grep -oE 'spared|are'
spared
In case of matches starting from same location, for example party
and par
, the longest matching portion gets precedence. See Longest match wins section for more examples. See regular-expressions: alternation for more information on this topic.
# same output irrespective of alternation order
$ echo 'pool party 2' | grep -oE 'party|par'
party
$ echo 'pool party 2' | grep -oE 'par|party'
party
# other implementations like PCRE have left-to-right priority
$ echo 'pool party 2' | grep -oP 'par|party'
par
Grouping
Often, there are some common things among the regular expression alternatives. It could be common characters or qualifiers like the anchors. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd
in maths, you get a(b|c)d = abd|acd
in regular expressions.
# without grouping
$ printf 'red\nreform\nread\ncrest' | grep -E 'reform|rest'
reform
crest
# with grouping
$ printf 'red\nreform\nread\ncrest' | grep -E 're(form|st)'
reform
crest
# without grouping
$ grep -E '\bpar\b|\bpart\b' anchors.txt
sub par
cart part tart mart
# taking out common anchors
$ grep -E '\b(par|part)\b' anchors.txt
sub par
cart part tart mart
# taking out common characters as well
# you'll later learn a better technique instead of using empty alternate
$ grep -E '\bpar(|t)\b' anchors.txt
sub par
cart part tart mart
Escaping metacharacters
You have already seen a few metacharacters and escape sequences that help compose a regular expression. To match the metacharacters literally, i.e. to remove their special meaning, prefix those characters with a \
character. To indicate a literal \
character, use \\
. Some of the metacharacters, like the line anchors, lose their special meaning when not used in their customary positions with BRE syntax.
If there are many metacharacters to be escaped, try to work out alternate solutions by using -F
(paired with regular expression like options such as -e
, -f
, -i
, -w
, -x
, etc) or by switching between ERE and BRE. Another option is to use PCRE (covered later), which has special constructs to mark whole or portion of pattern to be matched literally — especially useful when using shell variables.
# line anchors aren't special away from customary positions with BRE
$ echo 'a^2 + b^2 - C*3' | grep 'b^2'
a^2 + b^2 - C*3
$ echo '$a = $b + $c' | grep '$b'
$a = $b + $c
# escape line anchors to match literally if you are using ERE
# or if you want to match them at customary positions with BRE
$ echo '$a = $b + $c' | grep -o '\$' | wc -l
3
# or use -F where possible
$ echo '$a = $b + $c' | grep -oF '$' | wc -l
3
Here's another example to show differences between BRE and ERE:
# cannot use -F here as line anchor is needed
$ printf '(a/b) + c\n3 + (a/b) - c' | grep '^(a/b)'
(a/b) + c
$ printf '(a/b) + c\n3 + (a/b) - c' | grep -E '^\(a/b)'
(a/b) + c
Matching characters like tabs
GNU grep
doesn't support escape sequences like \t
(tab) and \n
(newline). Neither does it support formats like \xNN
(specifying a character by its codepoint value in hexadecimal format). Shells like Bash support ANSI-C Quoting as an alternate way to use such escape sequences.
# $'..' is ANSI-C quoting syntax
$ printf 'go\tto\ngo to' | grep $'go\tto'
go to
# \x20 in hexadecimal represents the space character
$ printf 'go\tto\ngo to' | grep $'go\x20to'
go to
Undefined escape sequences are treated as the character it escapes. Newer versions of
GNU grep
will generate a warning for such escapes and might become errors in future versions.$ echo 'sea eat car rat eel tea' | grep 's\ea' grep: warning: stray \ before e sea eat car rat eel tea
The dot metacharacter
The dot metacharacter serves as a placeholder to match any character. Later you'll learn how to define your own custom placeholders for a limited set of characters.
# extract 'c', followed by any character and then 't'
$ echo 'tac tin cot abc:tuv excite' | grep -o 'c.t'
c t
cot
c:t
cit
$ printf '42\t33\n'
42 33
# extract '2', followed by any character and then '3'
$ printf '42\t33\n' | grep -o '2.3'
2 3
If you are using a Unix-like distribution, you'll likely have the /usr/share/dict/words
dictionary file. This will be used as an input file to illustrate regular expression examples in this chapter. This file is included in the learn_gnugrep_ripgrep repo as words.txt
file (modified to make it ASCII only).
$ wc -l words.txt
98927 words.txt
# 5 character lines starting with 'du' and ending with 'ts' or 'ky'
$ grep -xE 'du.(ky|ts)' words.txt
ducts
duets
dusky
dusts
Quantifiers
Alternation helps you match one among multiple patterns. Combining the dot metacharacter with quantifiers (and alternation if needed) paves a way to perform logical AND between patterns. For example, to check if a string matches two patterns with any number of characters in between. Quantifiers can be applied to characters, groupings and some more constructs that'll be discussed later. Apart from the ability to specify exact quantity and bounded range, these can also match unbounded varying quantities.
BRE/ERE support only one type of quantifiers, whereas PCRE supports three types. Quantifiers in GNU grep
behave mostly like greedy quantifiers supported by PCRE, but there are subtle differences, which will be discussed with examples later on.
First up, the ?
metacharacter which quantifies a character or group to match 0
or 1
times. This helps to define optional patterns and build terser patterns compared to alternation and groupings for some cases.
# same as: grep -E '\b(fe.d|fed)\b'
# BRE version: grep -w 'fe.\?d'
$ printf 'fed\nfod\nfe:d\nfeed' | grep -wE 'fe.?d'
fed
fe:d
feed
# same as: grep -E '\bpar(|t)\b'
$ printf 'sub par\nspare\npart time' | grep -wE 'part?'
sub par
part time
# same as: grep -oE 'part|parrot'
$ echo 'par part parrot parent' | grep -oE 'par(ro)?t'
part
parrot
# same as: grep -oE 'part|parrot|parent'
$ echo 'par part parrot parent' | grep -oE 'par(en|ro)?t'
part
parrot
parent
The *
metacharacter quantifies a character or group to match 0
or more times.
# extract 'f' followed by zero or more of 'e' followed by 'd'
$ echo 'fd fed fod fe:d feeeeder' | grep -o 'fe*d'
fd
fed
feeeed
# extract zero or more of '1' followed by '2'
$ echo '3111111111125111142' | grep -o '1*2'
11111111112
2
The +
metacharacter quantifies a character or group to match 1
or more times.
# extract 'f' followed by one or more of 'e' followed by 'd'
# BRE version: grep -o 'fe\+d'
$ echo 'fd fed fod fe:d feeeeder' | grep -oE 'fe+d'
fed
feeeed
# extract 'f' followed by at least one of 'e' or 'o' or ':' followed by 'd'
$ echo 'fd fed fod fe:d feeeeder' | grep -oE 'f(e|o|:)+d'
fed
fod
fe:d
feeeed
# extract one or more of '1' followed by '2'
$ echo '3111111111125111142' | grep -oE '1+2'
11111111112
# extract one or more of '1' followed by optional '4' and then '2'
$ echo '3111111111125111142' | grep -oE '1+4?2'
11111111112
111142
You can specify a range of integer numbers, both bounded and unbounded, using {}
metacharacters. There are four ways to use this quantifier as listed below:
Quantifier | Description |
---|---|
{m,n} | match m to n times |
{m,} | match at least m times |
{,n} | match up to n times (including 0 times) |
{n} | match exactly n times |
# note that stray characters like space is not allowed anywhere within {}
# BRE version: grep -o 'ab\{1,4\}c'
$ echo 'abc ac adc abbc xabbbcz bbb bc abbbbbc' | grep -oE 'ab{1,4}c'
abc
abbc
abbbc
$ echo 'abc ac adc abbc xabbbcz bbb bc abbbbbc' | grep -oE 'ab{3,}c'
abbbc
abbbbbc
$ echo 'abc ac adc abbc xabbbcz bbb bc abbbbbc' | grep -oE 'ab{,2}c'
abc
ac
abbc
$ echo 'abc ac adc abbc xabbbcz bbb bc abbbbbc' | grep -oE 'ab{3}c'
abbbc
To match
{}
metacharacters literally (assuming ERE), escaping{
alone is enough. Or if it doesn't conform strictly to any of the four forms listed above, escaping is not needed at all.$ echo 'a{5} = 10' | grep -E 'a\{5}' a{5} = 10 $ echo 'report_{a,b}.txt' | grep -E '_{a,b}' report_{a,b}.txt
Conditional AND
Next up, constructing AND conditional using dot metacharacter and quantifiers. To allow matching in any order, you'll have to bring in alternation as well. That is somewhat manageable for 2 or 3 patterns. With PCRE, you can use lookarounds for a comparatively easier approach.
# match 'Error' followed by zero or more characters followed by 'valid'
$ echo 'Error: not a valid input' | grep -o 'Error.*valid'
Error: not a valid
$ echo 'cat and dog and parrot' | grep -oE 'cat.*dog|dog.*cat'
cat and dog
$ echo 'dog and cat and parrot' | grep -oE 'cat.*dog|dog.*cat'
dog and cat
Longest match wins
You've already seen an example where the longest matching portion was chosen if the alternatives started from the same location. For example spar|spared
will result in spared
being chosen over spar
. The same applies whenever there are two or more matching possibilities from same starting location. For example, f.?o
will match foo
instead of fo
if the input string to match is foot
.
# longest match among 'foo' and 'fo' wins here
$ echo 'foot' | grep -oE 'f.?o'
foo
# everything will match here
$ echo 'car bat cod map scat dot abacus' | grep -o '.*'
car bat cod map scat dot abacus
# longest match happens when (1|2|3)+ matches up to '1233' only
# so that '12apple' can match as well
$ echo 'fig123312apple' | grep -oE 'g(1|2|3)+(12apple)?'
g123312apple
# in other implementations like PCRE, that is not the case
# precedence is left to right for greedy quantifiers
$ echo 'fig123312apple' | grep -oP 'g(1|2|3)+(12apple)?'
g123312
While determining the longest match, the overall regular expression matching is also considered. That's how Error.*valid
example worked. If .*
had consumed everything after Error
, there wouldn't be any more characters to try to match valid
. So, among the varying quantity of characters to match for .*
, the longest portion that satisfies the overall regular expression is chosen. Something like a.*b
will match from the first a
in the input string to the last b
. In other implementations, like PCRE, this is achieved through a process called backtracking. These approaches have their own advantages and disadvantages and have cases where the pattern can result in exponential time consumption.
# extract from the start of the line to the last 'm' in the line
$ echo 'car bat cod map scat dot abacus' | grep -o '.*m'
car bat cod m
# extract from the first 'c' to the last 't' in the line
$ echo 'car bat cod map scat dot abacus' | grep -o 'c.*t'
car bat cod map scat dot
# extract from the first 'c' to the last 'at' in the line
$ echo 'car bat cod map scat dot abacus' | grep -o 'c.*at'
car bat cod map scat
# here 'm*' will match 'm' zero times as that gives the longest match
$ echo 'car bat cod map scat dot abacus' | grep -o 'b.*m*'
bat cod map scat dot abacus
Character classes
To create a custom placeholder for limited set of characters, enclose them inside []
metacharacters. It is similar to using single character alternations inside a grouping, but with added flexibility and features. Character classes have their own versions of metacharacters and provide special predefined sets for common use cases. Quantifiers are also applicable to character classes.
# same as: grep -E 'cot|cut' or grep -E 'c(o|u)t'
$ printf 'cute\ncat\ncot\ncoat\ncost\nscuttle' | grep 'c[ou]t'
cute
cot
scuttle
# same as: grep -E '(a|e|o)+t'
$ printf 'meeting\ncute\nboat\nsite\nfoot' | grep -E '[aeo]+t'
meeting
boat
foot
# same as: grep -owE '(s|o|t)(o|n)'
$ echo 'do so in to no on' | grep -ow '[sot][on]'
so
to
on
# lines made up of letters 'o' and 'n', line length at least 2
$ grep -xE '[on]{2,}' words.txt
no
non
noon
on
Character class metacharacters
Character classes have their own metacharacters to help define the sets succinctly. Metacharacters outside of character classes like ^
, $
, ()
etc either don't have special meaning or have a completely different one inside the character classes.
First up, the -
metacharacter that helps to define a range of characters instead of having to specify them all individually.
# same as: grep -oE '[0123456789]+'
$ echo 'Sample123string42with777numbers' | grep -oE '[0-9]+'
123
42
777
# whole words made up of lowercase alphabets only
$ echo 'coat Bin food tar12 best' | grep -owE '[a-z]+'
coat
food
best
# whole words made up of lowercase alphabets and digits only
$ echo 'coat Bin food tar12 best' | grep -owE '[a-z0-9]+'
coat
food
tar12
best
# whole words made up of lowercase alphabets, starting with 'p' to 'z'
$ echo 'go no u grip read eat pit' | grep -owE '[p-z][a-z]*'
u
read
pit
Character classes can also be used to construct numeric ranges. However, it is easy to miss corner cases and some ranges are complicated to construct.
# numbers between 10 to 29
$ echo '23 154 12 26 34' | grep -ow '[12][0-9]'
23
12
26
# numbers >= 100
$ echo '23 154 12 26 98234' | grep -owE '[0-9]{3,}'
154
98234
# numbers >= 100 if there are leading zeros
$ echo '0501 035 154 12 26 98234' | grep -owE '0*[1-9][0-9]{2,}'
0501
154
98234
Next metacharacter is ^
which has to specified as the first character of the character class. It negates the set of characters, so all characters other than those specified will be matched. As highlighted earlier, handle negative logic with care, you might end up matching more than you wanted.
# all non-digits
$ echo 'Sample123string42with777numbers' | grep -oE '[^0-9]+'
Sample
string
with
numbers
# extract characters from the start of string based on a delimiter
$ echo 'apple:123:banana:cherry' | grep -o '^[^:]*'
apple
# extract last two columns based on a delimiter
$ echo 'apple:123:banana:cherry' | grep -oE '(:[^:]+){2}$'
:banana:cherry
# get all sequence of characters surrounded by double quotes
$ echo 'I like "mango" and "guava"' | grep -oE '"[^"]+"'
"mango"
"guava"
Sometimes, it is easier to use positive character class and the -v
option instead of using negated character classes.
# lines not containing vowel characters
# note that this will match empty lines too
$ printf 'tryst\nfun\nglyph\npity\nwhy' | grep -xE '[^aeiou]*'
tryst
glyph
why
# easier to write and maintain
$ printf 'tryst\nfun\nglyph\npity\nwhy' | grep -v '[aeiou]'
tryst
glyph
why
Escape sequence sets
Some commonly used character sets have predefined escape sequences:
\w
matches all word characters[a-zA-Z0-9_]
(recall-w
definition)\W
matches all non-word characters (recall duality seen earlier, like\b
and\B
)\s
matches all whitespace characters: tab, newline, vertical tab, form feed, carriage return and space\S
matches all non-whitespace characters
These escape sequences cannot be used inside character classes (unlike PCRE). Also, as mentioned earlier, these definitions assume ASCII input.
# extract all word character sequences
$ printf 'load;err_msg--\nant,r2..not\n' | grep -o '\w*'
load
err_msg
ant
r2
not
$ echo 'sea eat car rat eel tea' | grep -o '\b\w' | paste -sd ''
secret
# extract all non-whitespace character sequences
$ printf ' 1..3 \v\f fig_tea 42\tzzz \r\n1-2-3\n\n' | grep -o '\S*'
1..3
fig_tea
42
zzz
1-2-3
Named character sets
A named character set is defined by a name enclosed between [:
and :]
and has to be used within a character class []
, along with other characters as needed.
Named set | Description |
---|---|
[:digit:] | [0-9] |
[:lower:] | [a-z] |
[:upper:] | [A-Z] |
[:alpha:] | [a-zA-Z] |
[:alnum:] | [0-9a-zA-Z] |
[:xdigit:] | [0-9a-fA-F] |
[:cntrl:] | control characters — first 32 ASCII characters and 127th (DEL) |
[:punct:] | all the punctuation characters |
[:graph:] | [:alnum:] and [:punct:] |
[:print:] | [:alnum:] , [:punct:] and space |
[:blank:] | space and tab characters |
[:space:] | whitespace characters, same as \s |
Here are some examples:
$ printf 'err_msg\nxerox\nant\nm_2\nP2\nload1\neel' | grep -x '[[:lower:]]*'
xerox
ant
eel
$ printf 'err_msg\nxerox\nant\nm_2\nP2\nload1\neel' | grep -x '[[:lower:]_]*'
err_msg
xerox
ant
eel
$ printf 'err_msg\nxerox\nant\nm_2\nP2\nload1\neel' | grep -x '[[:alnum:]]*'
xerox
ant
P2
load1
eel
$ echo 'pie tie#ink-eat_42;' | grep -o '[^[:punct:]]*'
pie tie
ink
eat
42
Matching character class metacharacters literally
Specific placement is needed to match the character class metacharacters literally.
-
should be the first or the last character.
# same as: grep -owE '[-a-z]{2,}'
$ echo 'ab-cd gh-c 12-423' | grep -owE '[a-z-]{2,}'
ab-cd
gh-c
]
should be the first character.
# no match
$ printf 'int a[5]\nfig\n1+1=2\n' | grep '[=]]'
# correct usage
$ printf 'int a[5]\nfig\n1+1=2\n' | grep '[]=]'
int a[5]
1+1=2
[
can be used anywhere in the character set, but not combinations like [.
or [:
. Using [][]
will match both [
and ]
.
$ echo 'int a[5]' | grep '[x[.y]'
grep: Unmatched [, [^, [:, [., or [=
$ echo 'int a[5]' | grep '[x[y.]'
int a[5]
^
should be other than the first character.
$ echo 'f*(a^b) - 3*(a+b)/(a-b)' | grep -o 'a[+^]b'
a^b
a+b
Characters like \
and $
are not special.
$ echo '5ba\babc2' | grep -o '[a\b]*'
ba\bab
As seen in the examples above, combinations like
[.
or[:
cannot be used together to mean two individual characters, as they have special meaning within[]
. See Character Classes and Bracket Expressions section ininfo grep
for more details.
Backreferences
The grouping metacharacters ()
are also known as capture groups. Similar to variables in programming languages, the portion captured by ()
can be referred later using backreferences. The syntax is \N
where N
is the capture group you want. Leftmost (
in the regular expression is \1
, next one is \2
and so on up to \9
.
# 8 character lines having same 3 lowercase letters at the start and end
$ grep -xE '([a-z]{3})..\1' words.txt
mesdames
respires
restores
testates
# different than: grep -xE '([a-d]..){2}'
$ grep -xE '([a-d]..)\1' words.txt
bonbon
cancan
chichi
# whole words that have at least one consecutive repeated character
$ echo 'effort flee facade oddball rat tool' | grep -owE '\w*(\w)\1\w*'
effort
flee
oddball
tool
# spot repeated words
# use \s instead of \W if only whitespaces are allowed between words
$ printf 'spot the the error\nno issues here' | grep -wE '(\w+)\W+\1'
spot the the error
Backreference will provide the string that was matched, not the pattern that was inside the capture group. For example, if
([0-9][a-f])
matches3b
, then backreferencing will give3b
and not any other valid match like8f
,0a
etc. This is akin to how variables behave in programming, only the result of expression stays after variable assignment, not the expression itself.
Known Bugs
Visit grep bug list for a list of known issues. See GNU grep manual: Known Bugs for a list of backreference related bugs.
Large repetition counts in the
{n,m}
construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority
Here's an issue for certain usage of backreferences and quantifier that was filed by yours truly.
# takes some time and results in no output
# aim is to get words having two occurrences of repeated characters
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' words.txt
# works when the nesting is unrolled
$ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*([a-z])\2[a-z]*' words.txt
Abbott
Annabelle
Annette
Appaloosa
Appleseed
# no problem if PCRE is used
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' words.txt
Abbott
Annabelle
Annette
Appaloosa
Appleseed
unix.stackexchange: Why doesn't this sed command replace the 3rd-to-last "and"? shows another interesting bug when word boundaries and group repetitions are involved. Some examples are shown below. Again, workaround is to use PCRE or expand the group.
# wrong output
$ echo 'cocoa' | grep -E '(\bco){2}'
cocoa
# correct behavior, no output
$ echo 'cocoa' | grep -E '\bco\bco'
$ echo 'cocoa' | grep -P '(\bco){2}'
# wrong output
$ echo 'it line with it here sit too' | grep -oE 'with(.*\bit\b){2}'
with it here sit
# correct behavior, no output
$ echo 'it line with it here sit too' | grep -oE 'with.*\bit\b.*\bit\b'
$ echo 'it line with it here sit too' | grep -oP 'with(.*\bit\b){2}'
Changing word boundaries to \<
and \>
results in a different issue:
# this correctly gives no output
$ echo 'it line with it here sit too' | grep -oE 'with(.*\<it\>){2}'
# this correctly gives output
$ echo 'it line with it here it too' | grep -oE 'with(.*\<it\>){2}'
with it here it
# but this one fails
$ echo 'it line with it here it too sit' | grep -oE 'with(.*\<it\>){2}'
# correct behavior
$ echo 'it line with it here it too sit' | grep -oP 'with(.*\bit\b){2}'
with it here it
Summary
Mastering regular expressions is not only important for using grep
effectively, but also comes in handy for text processing with other CLI tools like sed
and awk
and programming languages like Python
and Ruby
. These days, some of the GUI applications also support regular expressions. One main thing to remember is that syntax and features will vary. This book itself discusses four variations — BRE, ERE, PCRE and ripgrep
regex. However, core concepts are likely to be same and having a handy reference sheet would go a long way in reducing misuse.
Exercises
The exercises directory has all the files used in this section.
1) For the input file patterns.txt
, extract from (
to the next occurrence of )
unless they contain parentheses characters in between.
##### add your solution here
(division)
(#modulo)
(9-2)
()
(j/k-3)
(greeting)
(b)
2) For the input file patterns.txt
, match all lines that start with den
or end with ly
.
##### add your solution here
2 lonely
dent
lovely
3) For the input file patterns.txt
, extract all whole words containing 42
surrounded by word characters on both sides.
##### add your solution here
Hi42Bye
nice1423
cool_42a
_42_
4) For the input file patterns.txt
, match all lines containing car
but not as a whole word.
##### add your solution here
scar
care
a huge discarded pile of books
scare
part cart mart
5) Count the total number of times the whole words removed
or rested
or received
or replied
or refused
or retired
are present in the patterns.txt
file.
##### add your solution here
9
6) For the input file patterns.txt
, match lines starting with s
and containing e
and t
in any order.
##### add your solution here
sets tests
site cite kite bite
subtle sequoia
7) From the input file patterns.txt
, extract all whole lines having the same first and last word character.
##### add your solution here
sets tests
Not a pip DOWN
y
1 dentist 1
_42_
8) For the input file patterns.txt
, match all lines containing *[5]
literally.
##### add your solution here
(9-2)*[5]
9) For the given quantifiers, what would be the equivalent form using the {m,n}
representation?
?
is same as*
is same as+
is same as
10) In ERE, (a*|b*)
is same as (a|b)*
— True or False?
11) grep -wE '[a-z](on|no)[a-z]'
is same as grep -wE '[a-z][on]{2}[a-z]'
. True or False? Sample input shown below might help to understand the differences, if any.
$ printf 'known\nmood\nknow\npony\ninns\n'
known
mood
know
pony
inns
12) For the input file patterns.txt
, display all lines starting with hand
and ending immediately with s
or y
or le
or no further characters.
##### add your solution here
handle
handy
hands
hand
13) For the input files patterns.txt
, display matching lines based on the patterns (one per line) present in the regex_terms.txt
file.
$ cat regex_terms.txt
^[c-k].*\W$
ly.
[A-Z].*[0-9]
##### add your solution here
Hi42Bye nice1423 bad42
fly away
def factorial()
hand
14) Will the ERE pattern ^a\w+([0-9]+:fig)?
match the same characters for the input apple42:banana314
and apple42:fig100
? If not, why not?
15) For the input file patterns.txt
, match all lines starting with [5]
.
##### add your solution here
[5]*3
16) What characters will the pattern \t
match? A tab character or \
followed by a t
or something else? Does the behavior change inside a character class? What alternatives are there to match a tab character?
17) From the input file patterns.txt
, extract all hexadecimal sequences with a minimum of four characters. Match 0x
as an optional prefix, but shouldn't be counted for determining the length. Match the characters case insensitively, and the sequences shouldn't be surrounded by other word characters.
##### add your solution here
0XdeadBEEF
bad42
0x0ff1ce
18) From the input file patterns.txt
, extract from -
till the end of the line, provided the characters after the hyphen are all word characters only.
##### add your solution here
-handy
-icy
19) For the input file patterns.txt
, count the total number of lines containing e
or i
followed by l
or n
and vice versa.
##### add your solution here
18
20) For the input file patterns.txt
, match lines starting with 4
or -
or u
or sub
or care
.
##### add your solution here
care
4*5]
-handy
subtle sequoia
unhand