Perl Compatible Regular Expressions

The -P option will help you use Perl Compatible Regular Expressions (PCRE) instead of BRE/ERE. PCRE is mostly similar, but not exactly the same as regular expressions present in the Perl programming language.

PCRE is handy when you need advanced features like lookarounds, non-greedy quantifiers, possessive quantifiers, unicode character sets, subexpression calls and so on.

Only some of the commonly used features are presented in this chapter. See man pcrepattern or online manual for complete details.

The example_files directory has all the files used in the examples.

BRE/ERE vs PCRE subtle differences

There are several subtle differences between PCRE and BRE/ERE for the same feature. This section lists some of the them, along with examples.

Escaping metacharacters

$ echo 'a^2 + b^2 - C*3' | grep 'b^2'
a^2 + b^2 - C*3

# line anchors have to be always escaped to match literally
$ echo 'a^2 + b^2 - C*3' | grep -P 'b^2'
$ echo 'a^2 + b^2 - C*3' | grep -P 'b\^2'
a^2 + b^2 - C*3

Character class metacharacters

$ echo 'int a[5]' | grep '[x[.y]'
grep: Unmatched [, [^, [:, [., or [=
# [. and [= aren't special
$ echo 'int a[5]' | grep -P '[x[.y]'
int a[5]

$ echo '5ba\babc2' | grep -o '[a\b]*'
ba\bab
# \ is special inside character class
$ echo '5ba\babc2' | grep -oP '[a\b]*'
a
a
$ echo '5ba\babc2' | grep -oP '[a\\b]*'
ba\bab

Backslash sequences inside character class

# \w here matches \ and w
$ echo 'w=y\x+9' | grep -oE '[\w=]+'
w=
\

# \w here matches word characters
$ echo 'w=y\x+9' | grep -oP '[\w=]+'
w=y
x
9

Backreferences greater than \9

# no match as '\10' will be treated as '\1' and '0'
$ echo '123456789abc42az' | grep -E '(.)(.)(.)(.)(.)(.)(.)(.)(.)(.).*\10'

# no such limitation for PCRE
# use '\g{1}0' if you need to represent '\1' and '0'
$ echo '123456789abc42az' | grep -P '(.)(.)(.)(.)(.)(.)(.)(.)(.)(.).*\10'
123456789abc42az

Dot metacharacter

# dot metacharacter will match any character
$ printf 'blue green\nteal brown' | grep -oz 'g.*n'
green
teal brown

# by default dot metacharacter won't match newline characters
$ printf 'blue green\nteal brown' | grep -ozP 'g.*n'
green
# can be changed using (?s) modifier (covered later)
$ printf 'blue green\nteal brown' | grep -ozP '(?s)g.*n'
green
teal brown

Alternation precedence

# order doesn't matter, longest match wins
$ printf 'spared PARTY PaReNt' | grep -ioE 'par|pare|spare'
spare
PAR
PaRe

# left to right precedence if alternatives match from the same index
$ printf 'spared PARTY PaReNt' | grep -ioP 'par|pare|spare'
spare
PAR
PaR

# workaround is to sort alternations based on length, longest first
$ printf 'spared PARTY PaReNt' | grep -ioP 'spare|pare|par'
spare
PAR
PaRe

Quantifier precedence

# longest match wins
$ echo 'fig123312apple' | grep -oE 'g[123]+(12apple)?'
g123312apple

# precedence is left-to-right
$ echo 'fig123312apple' | grep -oP 'g[123]+(12apple)?'
g123312

{,n} quantifier

$ echo 'abc ac adc abbc xabbbcz bbb bc abbbbbc' | grep -oE 'ab{,2}c'
abc
ac
abbc

# '0' has to be explicitly mentioned as the lower limit
$ echo 'abc ac adc abbc xabbbcz bbb bc abbbbbc' | grep -oP 'ab{,2}c'
$ echo 'abc ac adc abbc xabbbcz bbb bc abbbbbc' | grep -oP 'ab{0,2}c'
abc
ac
abbc

-f and -e options

$ cat five_words.txt
sequoia
subtle
questionable
exhibit
equation

$ printf 'sub\nbit' | grep -f- five_words.txt
subtle
exhibit
$ grep -e 'sub' -e 'bit' five_words.txt
subtle
exhibit

$ printf 'sub\nbit' | grep -P -f- five_words.txt
grep: the -P option only supports a single pattern
$ grep -P -e 'sub' -e 'bit' five_words.txt
grep: the -P option only supports a single pattern

String anchors

This restriction is about qualifying a pattern to match only at the start or end of an input string. A string can contain zero or more newline characters. This is helpful if you want to distinguish between start/end of string and start/end of line (see Modifiers section for examples).

\A restricts the match to the start of string and \z restricts the match to the end of string. There is another end of string anchor \Z which is similar to \z but if newline is the last character, then \Z allows matching just before this newline character.

# start of string
$ echo 'hi-hello;top-spot' | grep -oP '\A\w+'
hi
# end of string
# note that grep strips newline from each input line
# and adds it back for matching lines
$ echo 'hi-hello;top-spot' | grep -oP '\w+\z'
spot

# here, newline is not stripped as -z is used
# \z matches the exact end of string
# \Z matches just before newline (if present) at the end of string
$ echo 'hi-hello;top-spot' | grep -zoP '\w+\z'
$ echo 'hi-hello;top-spot' | grep -zoP '\w+\Z'
spot

Escape sequences

Apart from \w, \s and their opposites, PCRE provides more such handy sequences.

\d for digits [0-9]
\h for horizontal blank characters [ \t]
\n for newline character
\D, \H and \N respectively for their opposites

# same as: grep -oE '[0-9]+'
$ echo 'Sample123string42with777numbers' | grep -oP '\d+'
123
42
777

# same as: grep -oE '[^0-9]+'
$ echo 'Sample123string42with777numbers' | grep -oP '\D+'
Sample
string
with
numbers

PCRE supports escape sequences like \t to represent the tab character. You can also represent a character using the format \xNN where NN are exactly two hexadecimal characters. See pcre: escape sequences for full list and other details.

$ printf 'blue green\nteal\n' | grep -z $'n\nt'
blue green
teal

$ printf 'blue green\nteal\n' | grep -zP 'n\nt'
blue green
teal

Non-greedy quantifiers

As the name implies, these quantifiers will try to match as minimally as possible. Also known as lazy or reluctant quantifiers. Appending a ? to greedy quantifiers makes them non-greedy.

# greedy
$ echo 'foot' | grep -oP 'f.?o'
foo

# non-greedy
$ echo 'foot' | grep -oP 'f.??o'
fo

Here's an example using the {m,n} format:

$ echo 'apple 314' | grep -oP '\d{2,5}'
314

$ echo 'apple 314' | grep -oP '\d{2,5}?'
31

Like greedy quantifiers, lazy quantifiers will try to satisfy the overall pattern. For example, .*? will first start with an empty match and then move forward one character at a time until a match is found.

# ':.*:' will match from the first ':' to the last ':'
$ echo 'green:3.14:teal::brown:oh!:blue' | grep -oP ':.*:'
:3.14:teal::brown:oh!:

# ':.*?:' will match from ':' to the very next ':'
$ echo 'green:3.14:teal::brown:oh!:blue' | grep -oP ':.*?:'
:3.14:
::
:oh!:

Possessive quantifiers

The difference between greedy and possessive quantifiers is that possessive will not backtrack to find a match. In other words, possessive quantifiers will always consume every character that matches the pattern on which it is applied. Syntax wise, you need to append + to greedy quantifiers to make it possessive, similar to adding ? for the non-greedy case.

Unlike greedy and non-greedy quantifiers, a pattern like :.*+apple will never result in a match because .*+ will consume rest of the line, leaving no way to match apple.

# greedy quantifiers will backtrack to allow overall pattern to succeed
$ echo 'fig:mango:pineapple:guava' | grep -oP ':.*apple'
:mango:pineapple

# possessive quantifiers will never backtrack
$ echo 'fig:mango:pineapple:guava' | grep -oP ':.*+apple'

Here's a more practical example. Suppose you want to match integer numbers greater than or equal to 100 where these numbers can optionally have leading zeros.

# same as: grep -woP '0*[1-9]\d{2,}'
$ echo '0501 035 154 12 26 98234' | grep -woP '0*+\d{3,}'
0501
154
98234

Atomic grouping

(?>pattern) is an atomic group which safeguards the pattern from further backtracking. You can think of it as a special group that is isolated from the rest of the regular expression.

Here's an example with greedy quantifier:

# 0* is greedy and the (?>) grouping prevents backtracking
$ echo '0501 035 154 12 26 98234' | grep -woP '(?>0*)\d{3,}'
0501
154
98234

Here's an example with non-greedy quantifier:

$ s='fig::mango::pineapple::guava::apples::orange'

# this matches from the first '::' to the first occurrence of '::apple'
$ echo "$s" | grep -oP '::.*?::apple'
::mango::pineapple::guava::apple

# '(?>::.*?::)' will match only from '::' to the very next '::'
# '::mango::' fails because 'apple' isn't found afterwards
# similarly '::pineapple::' fails
# '::guava::' succeeds because it is followed by 'apple'
$ echo "$s" | grep -oP '(?>::.*?::)apple'
::guava::apple

Non-capturing group

You can use non-capturing groups (?:pattern) to avoid keeping a track of groups not needed for backreferencing.

# lines containing same content in the 3rd and 4th fields
# the first group is needed to apply quantifier, not backreferencing
$ printf 'a,b,c,d,e\n1,2,3,3,5' | grep -P '^([^,]+,){2}([^,]+),\2,'
1,2,3,3,5

# you can use non-capturing groups in such cases
$ printf 'a,b,c,d,e\n1,2,3,3,5' | grep -P '^(?:[^,]+,){2}([^,]+),\1,'
1,2,3,3,5

Named capture groups

Regular expressions can get cryptic and difficult to maintain, even for seasoned programmers. There are a few constructs to help add clarity. Named capture groups enables descriptive names for backreferencing instead of plain numbers. The naming can be specified in multiple ways:

(?<name>pattern) — Perl style
(?P<name>pattern) — Python style
(?'name'pattern) — not suited for CLI usage, as single quotes are usually used around the entire regular expression

Any of these can be used for backreferencing:

\k<name>
\k{name}
\g{name}
(?P=name)
\N or \g{N} numbering can also be used

# one of the combinations to use named capture groups
$ echo '1,2,3,3,5' | grep -P '^(?:[^,]+,){2}(?<col3>[^,]+),\k<col3>,'
1,2,3,3,5

# here's another combination
$ echo '1,2,3,3,5' | grep -P '^(?:[^,]+,){2}(?P<col3>[^,]+),(?P=col3),'
1,2,3,3,5

Negative backreferences

Another useful approach when there are numerous capture groups is to use negative backreferences. The negative numbering starts with -1 to refer to the capture group closest to the backreference that was defined before the backreference. In other words, the highest numbered capture group prior to the backreference will be -1, the second highest will be -2 and so on.

# \g{-1} here is same as using \2
$ echo '1,2,3,3,5' | grep -P '^([^,]+,){2}([^,]+),\g{-1},'
1,2,3,3,5

# {} is optional if there is no ambiguity
$ echo '1,2,3,3,5' | grep -P '^([^,]+,){2}([^,]+),\g-1,'
1,2,3,3,5

Subexpression calls

If backreferences are like variables, then subexpression calls are like functions. Backreferences allow you to reuse the portion matched by the capture group. Subexpression calls allow you to reuse the pattern that was used inside the capture group. You can call subexpressions recursively too, see the Recursive matching section for examples.

The syntax is (?N) to refer to that particular capture group by number (relative numbering is allowed as well). Named capture groups can be called in various ways as (?&name) or (?P>name) or \g<name> or \g'name'.

$ row='today,2008-03-24,food,2012-08-12,nice,5632'

# numbered backreference
$ echo "$row" | grep -oP '(\d{4}-\d{2}-\d{2}).*(?1)'
2008-03-24,food,2012-08-12

# named capture group
$ echo "$row" | grep -oP '(?<date>\d{4}-\d{2}-\d{2}).*(?&date)'
2008-03-24,food,2012-08-12

Lookarounds

Lookarounds help to create custom anchors and add conditions to a pattern. These assertions are also known as zero-width patterns because they add restrictions similar to anchors and are not part of matched portions (especially helpful with the -o option). These can also be used to negate a grouping similar to negated character sets.

Lookaround assertions can be added to a pattern in two ways — lookbehind and lookahead. Syntax wise, these two ways are differentiated by adding a < for the lookbehind version. The assertion can be negative (!) or positive (=).

Syntax	Lookaround type
`(?!pattern)`	Negative lookahead
`(?<!pattern)`	Negative lookbehind
`(?=pattern)`	Positive lookahead
`(?<=pattern)`	Positive lookbehind

Here are some examples for negative lookarounds:

# extract whole words only if not preceded by : or -
# note that the start of the string satisfies the given assertion
$ echo 'fig:cart<apple-rest;tea' | grep -oP '(?<![:-])\b\w+'
fig
apple
tea

# match 'cat' only if it is not followed by a digit character
$ printf 'hey cats!\ncat42\ncat_5\ncatcat' | grep -P 'cat(?!\d)'
hey cats!
cat_5
catcat

# extract whole words only if NOT preceded by : or -
# and not followed by - or end of line
$ echo 'fig:cart<apple-rest;tea' | grep -woP '(?<![:-])\w+(?!-|$)'
fig

And here are some examples for positive lookarounds:

# extract digits only if it is followed by ,
# note that the end of string doesn't qualify
$ echo '42 apple-5, fig3; x-83, y-20: f12' | grep -oP '\d+(?=,)'
5
83

# extract digits only if it is preceded by a lowercase alphabet
$ echo '42 apple-5, fig3; x-83, y-20: f12' | grep -oP '(?<=[a-z])\d+'
3
12

# extract words containing 'par'
# as long as 'part' occurs as a whole word later in the line
$ echo 'par spare part party' | grep -oP '\b\w*par\w*\b(?=.*\bpart\b)'
par
spare

# extract digits only if it is preceded by - and not followed by ,
# possessive quantifier here prevents digits from being part of the assertion
$ echo '42 apple-5, fig3; x-83, y-20: f12' | grep -oP '(?<=-)\d++(?!,)'
20

In all the examples so far, lookahead grouping was placed as a suffix and lookbehind as a prefix. This is how they are used most of the time, but not the only way to use them. Lookarounds can be placed anywhere and multiple lookarounds can be combined in any order. They do not consume characters nor do they play a role in matched portions. They just let you know whether the condition you want to test is satisfied from the current location in the input string.

# extract whole words that don't end with 'r' or 't'
$ echo 'par spare part party' | grep -oP '\b\w++(?<![rt])'
spare
party

Conditional AND with lookarounds

As promised earlier, here are some examples that show how lookarounds make it simpler to construct AND conditionals.

# words containing 'b' and 'e' and 't' in any order
# same as: 'b.*e.*t|b.*t.*e|e.*b.*t|e.*t.*b|t.*b.*e|t.*e.*b'
# or: grep 'b' five_words.txt | grep 'e' | grep 't'
$ grep -P '(?=.*b)(?=.*e).*t' five_words.txt
subtle
questionable
exhibit

# words containing all lowercase vowels in any order
$ grep -P '(?=.*a)(?=.*e)(?=.*i)(?=.*o).*u' five_words.txt
sequoia
questionable
equation

# words containing ('ab' or 'at') and 'q' but not 'n' at the end
$ grep -P '(?!.*n$)(?=.*a[bt]).*q' five_words.txt
questionable

Variable length lookbehind

With lookbehind (both positive and negative), the pattern used for the assertion cannot imply matching variable length of text. Using fixed length quantifier or alternations of different lengths (but each alternative being fixed length) is allowed. Here are some examples to clarify these points:

$ s='pore42 tar3 dare7 care5'

# allowed
$ echo "$s" | grep -oP '(?<=(?:po|da)re)\d+'
42
7
$ echo "$s" | grep -oP '(?<=\b[a-z]{4})\d+'
42
7
5
$ echo "$s" | grep -oP '(?<=tar|dare)\d+'
3
7

# not allowed
$ echo "$s" | grep -oP '(?<=\b[a-z]+)\d+'
grep: lookbehind assertion is not fixed length
$ echo "$s" | grep -oP '(?<=\b[a-z]{1,3})\d+'
grep: lookbehind assertion is not fixed length
$ echo 'cat scatter cater scat' | grep -oP '(?<=(cat.*?){2})cat[a-z]*'
grep: lookbehind assertion is not fixed length

Set start of matching portion with \K

Some of the positive lookbehind cases can be solved by adding \K as a suffix to the pattern to be asserted. The text consumed until \K won't be part of the matching portion. In other words, \K determines the starting point. The pattern before \K can be variable length too.

# extract digits that follow =
# same as: grep -oP '(?<==)\d+'
$ echo 'apple=42, fig=314' | grep -oP '=\K\d+'
42
314

$ s='cat scatter cater scat concatenate catastrophic catapult duplicate'
# extract 3rd occurrence of 'cat' followed by optional lowercase letters
$ echo "$s" | grep -oP '^(.*?cat.*?){2}\Kcat[a-z]*'
cater
# extract occurrences at multiples of 3
$ echo "$s" | grep -oP '(.*?cat.*?){2}\Kcat[a-z]*'
cater
catastrophic

# extract digits only if preceded by 1 to 3 lowercase letters at word boundary
$ echo 'or42 pare7 cat3 cared5' | grep -oP '\b[a-z]{1,3}\K\d+'
42
3

Negated groups

Some of the variable length negative lookbehind cases can be simulated by using a negative lookahead (which doesn't have restriction on variable length). The trick is to assert negative lookahead one character at a time and applying quantifiers on such a grouping to satisfy the variable requirement. This will only work if you have well defined conditions before the negated group.

$ s='fox,cat,dog,parrot'

# match 'dog' only if it is not preceded by 'cat' anywhere before
# note the use of anchor to force matching all characters up to 'dog'
$ echo "$s" | grep -qP '^((?!cat).)*dog' || echo 'no match'
no match

# match 'dog' only if it is not preceded by 'parrot' anywhere before
$ echo "$s" | grep -qP '^((?!parrot).)*dog' && echo 'match found'
match found

# match if 'go' is not there between 'at' and 'par'
$ echo "$s" | grep -qP 'at((?!go).)*par' && echo 'match found'
match found

You can extract the matched portion to understand negated grouping better:

$ s='fox,cat,dog,parrot'

$ echo "$s" | grep -oP '^((?!cat).)*'
fox,
$ echo "$s" | grep -oP '^((?!parrot).)*'
fox,cat,dog,
$ echo "$s" | grep -oP '^((?!(.)\2).)*'
fox,cat,dog,pa
$ echo "$s" | grep -oP '^((?!lion).)*'
fox,cat,dog,parrot

Conditional groups

This special grouping allows you to add a condition that depends on whether a capture group succeeded in matching. You can also add an optional else condition. The main advantage of conditional groups is that it prevents pattern duplication. The syntax as per the docs is shown below.

(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)

Here's an example. The task is to match whole lines containing word characters surrounded by [] or containing word characters separated by a hyphen.

$ cat conditional.txt
[hi]
good-bye
bad
[42]
-oh
i-j
[-]
[oh-no]
[apple banana]
1-2-3

# ?(1) condition refers to the first capture group succeeding
# in this example, ?(1) checks if '[' was matched
$ grep -xP '(\[)?\w+(?(1)]|-\w+)' conditional.txt
[hi]
good-bye
[42]
i-j

The above command is equivalent to grep -xP '\[\w+]|\w+-\w+'. Which seems simpler than the conditional group syntax. But if the first \w+ was a complicated pattern, conditional group would be better suited.

Modifiers

Modifiers are like CLI options to change the default behavior of a pattern. The -i option is an example for a modifier. However, unlike -i, these modifiers can be applied selectively to a portion of the pattern. In regular expression parlance, modifiers are also known as flags.

Modifier	Description
`i`	case sensitivity
`m`	multiline for line anchors
`s`	matching newline with `.` metacharacter
`x`	readable pattern with whitespace and comments

To apply modifiers selectively, specify them inside a special grouping syntax. This will override the modifiers applied to entire pattern, if any. The syntax variations are:

(?modifiers:pattern) will apply modifiers only for this portion
(?-modifiers:pattern) will negate modifiers only for this portion
(?modifiers-modifiers:pattern) will apply and negate particular modifiers only for this portion
(?modifiers) when pattern is not given, modifiers (including negation) will be applied from this point onwards

In these ways, modifiers can be specified precisely only where it is needed. Especially useful for constructing patterns programmatically. Here are some examples:

# same as: grep -i 'cat'
$ printf 'Cat\ncOnCaT\nscatter\ncut' | grep -P '(?i)cat'
Cat
cOnCaT
scatter

# override -i option
$ printf 'Cat\ncOnCaT\nscatter\ncut' | grep -iP '(?-i)cat'
scatter

# same as: grep -ioP '(?-i:Cat)[a-z]*\b' or grep -oP 'Cat(?i)[a-z]*\b'
$ echo 'Cat SCatTeR CATER cAts' | grep -oP 'Cat(?i:[a-z]*)\b'
Cat
CatTeR

# allow . metacharacter to match newline character as well
$ printf 'Hi there\nHave a Nice Day' | grep -zoP '(?s)the.*ice'
there
Have a Nice

Here's an example with multiple modifiers used together:

# whole word 'python3' in 1st line and a line starting with 'import'
# note the use of string anchor \A to match only the start of file
# \N is used instead of . to match non-newline characters as 's' flag is active
$ grep -zlP '(?ms)\A\N*\bpython3\b.*^import' five_words.txt script
script

The x modifier allows you to use literal unescaped whitespaces for readability purposes and add comments after an unescaped # character. This modifier has limited usage for CLI applications as multiline pattern cannot be specified.

# same as: grep -oP '^((?!(.)\2).)*'
$ echo 'fox,cat,dog,parrot' | grep -oP '(?x) ^( (?! (.)\2 ) . )*'
fox,cat,dog,pa

$ echo 'fox,cat,dog,parrot' | grep -oP '(?x) (,[^,]+){2}$ #last 2 columns'
,dog,parrot

Comments can also be added using the (?#comment) special group:

$ echo 'fox,cat,dog,parrot' | grep -oP '(,[^,]+){2}$(?#last 2 columns)'
,dog,parrot

You'll have to escape whitespace or use them inside character classes to match them literally when (?x) is active:

$ echo 'a cat and a dog' | grep -P '(?x)t a'
$ echo 'a cat and a dog' | grep -P '(?x)t\ a'
a cat and a dog
$ echo 'a cat and a dog' | grep -P '(?x)t[ ]a'
a cat and a dog

$ echo 'food a#b 123' | grep -oP '(?x)a#.'
a
$ echo 'food a#b 123' | grep -oP '(?x)a\#.'
a#b

\Q and \E

A pattern surrounded by \Q and \E will be matched literally, just like how the -F option behaves. If \E is not specified, the effect will be applicable until the end of the pattern. These escapes can be used inside character class too, but you'll get syntax error if \Q alone is used.

# same as: grep -F 'a[5]'
$ echo 'int a[5]' | grep -P '\Qa[5]'
int a[5]

# same as: grep -oP '[a\\\-b]*'
$ echo '5b-a\b-abc2' | grep -oP '[\Q\-\Eab]*'
b-a\b-ab

Here's an example with shell variables:

$ expr='(a^b)'

$ echo '\S*\Q'"$expr"'\E\S*'
\S*\Q(a^b)\E\S*

$ echo 'f*(2-a/b) - 3*(a^b)-42' | grep -oP '\S*\Q'"$expr"'\E\S*'
3*(a^b)-42

When you are working with external data (such as shell arguments), the data itself might have \Q and \E and might thus lead to conflicting behavior.

\G anchor

The \G anchor matches the start of the input string, just like the \A anchor. In addition, it will also match at the end of the previous match. This helps you to mark a particular location in the input string and continue from there instead of having the pattern to always check for the specific location. This is best understood with examples.

# all digits and optional hyphen combo from the start of string
$ echo '123-87-593 42 apple-12-345' | grep -oP '\G\d+-?'
123-
87-
593

In the above example, \G will first match the start of the string. So, the first four characters 123- will be matched since they satisfy the \d+-? pattern. The ending of this matched portion (fourth character) will now be considered as the new anchor for \G. The next three characters 87- will then match and \G assertion is satisfied due to the previous match. Same for 593. When the next character is considered, \G assertion is still satisfied but \d+-? fails due to the space character. Because the matching failed, \G will not be satisfied when the next digit sequence 42 is considered. So, no more characters can match since this particular example doesn't provide an alternate way for \G to be reactivated.

Here's another example of using \G without alternations:

# all word characters from the start of string
# only if it is followed by a word character
$ echo 'at_2 bat_100 kite_42' | grep -oP '\G\w(?=\w)'
a
t
_

Next, using \G as part of alternations so that it can be activated anywhere in the input string. Suppose you need to extract one or more numbers that follow a particular name. Here's one way to solve it:

$ marks='Joe 75 88 Mina 89 85 84 John 90'

$ echo "$marks" | grep -oP '(?:Mina|\G) \K\d+'
89
85
84

$ echo "$marks" | grep -oP '(?:John|\G) \K\d+'
90

\G matches the start of the string but the input string doesn't start with a space character. So the regular expression can be satisfied only after the other alternative is matched. Consider the first pattern where Mina is the other alternative. Once that string is found, a space and digit characters will satisfy the rest of the pattern. Ending of the match, i.e. Mina 89 in this case, will now be the \G anchoring position. This will allow 85 and 84 to be matched subsequently. After that, J fails the \d pattern and no more matches are possible (as Mina isn't found another time).

In some cases, \G anchoring at the start of the string will cause issues. One workaround is to add a negative lookaround assertion. Here's an example. Goal is to extract non-whitespace characters after : only for the given name.

$ p='Jo:x2 Mina:56 Rohit:abcdef'

# issue due to \G matching at the start of the string
# the first space separated field is also getting extracted
$ echo "$p" | grep -oP '(?:Mina:\K|\G)\S'
J
o
:
x
2
5
6

# adding a negative assertion helps
$ echo "$p" | grep -oP '(?:Mina:\K|\G(?!\A))\S'
5
6
$ echo "$p" | grep -oP '(?:Jo:\K|\G(?!\A))\S'
x
2

Skipping matches

Sometimes, you want to work with all matches except particular portions. Usually, there are common characteristics between the two types of matches that makes it hard to define a pattern only for the required matches. For example, extracting field values unless it is a particular name, or perhaps don't touch double quoted values and so on. To use the skipping feature, define the matches to be ignored suffixed by (*SKIP)(*FAIL) and then put the required matches as part of an alternation list. (*F) can also be used instead of (*FAIL).

# all whole words except 'imp' or 'ant'
$ words='tiger imp eagle ant important imp2 Cat'
$ echo "$words" | grep -oP '\b(?:imp|ant)\b(*SKIP)(*F)|\w+'
tiger
eagle
important
imp2
Cat

# all words except those surrounded by double quotes
# do you think grep -oP '(?<!")\w++(?!")' will work the same for all cases?
$ echo 'I like2 "mango" and "guava"' | grep -oP '"[^"]+"(*SKIP)(*F)|\w+'
I
like2
and

See also rexegg: The Greatest Regex Trick Ever and rexegg: Backtracking Control Verbs

Recursive matching

The subexpression call special group was introduced as analogous to function calls. And similar to functions, it does support recursion. Useful to match nested patterns, which is usually not recommended to be done with regular expressions. Indeed, you should use a proper parser tool or library for file formats like html, xml, json, csv, etc. But for some cases, a parser might not be available and using regular expressions might be simpler than writing one from scratch.

First up, a pattern to match a set of parentheses that is not nested (termed as level-one for reference).

$ eqn0='a + (b * c) - (d / e)'
$ eqn1='((f+x)^y-42)*((3-g)^z+2)'

# literal ( followed by non () characters followed by literal )
# use *+ instead of ++ if you want to match empty pairs as well
$ echo "$eqn0" | grep -oP '\([^()]++\)'
(b * c)
(d / e)
$ echo "$eqn1" | grep -oP '\([^()]++\)'
(f+x)
(3-g)

Next, matching a set of parentheses which may optionally contain any number of non-nested sets of parentheses (termed as level-two for reference). Breaking down the pattern, you can see ( and ) have to be matched literally. Inside that, valid string is made up of either non-parentheses characters or a non-nested parentheses sequence — i.e. level-one.

# x modifier used for readability
$ echo "$eqn1" | grep -oP '(?x) \( (?: [^()]++ | \([^()]++\) )++ \)'
((f+x)^y-42)
((3-g)^z+2)

$ eqn2='a + (b) + ((c)) + (((d)))'
$ echo "$eqn2" | grep -oP '(?x) \( (?: [^()]++ | \([^()]++\) )++ \)'
(b)
((c))
((d))

To recursively match any number of nested sets of parentheses, use a capture group and call it within the capture group itself. Since entire pattern needs to be called here, you can use the default zeroth capture group. Comparing with level-two, the only change is that subexpression call (?0) is used instead of the level-one in the second alternation.

# (?R) can also be used instead of (?0)
$ echo "$eqn0" | grep -oP '(?x) \( (?: [^()]++ | (?0) )++ \)'
(b * c)
(d / e)
$ echo "$eqn1" | grep -oP '(?x) \( (?: [^()]++ | (?0) )++ \)'
((f+x)^y-42)
((3-g)^z+2)
$ echo "$eqn2" | grep -oP '(?x) \( (?: [^()]++ | (?0) )++ \)'
(b)
((c))
(((d)))

$ eqn3='(3+a) * ((r-2)*(t+2)/6) + 42 * (a(b(c(d(e)))))'
$ echo "$eqn3" | grep -oP '(?x) \( (?: [^()]++ | (?0) )++ \)'
(3+a)
((r-2)*(t+2)/6)
(a(b(c(d(e)))))

Unicode

Similar to named character classes and escape sequences, the \p{} construct offers various predefined sets to work with Unicode strings.

# assumes that the current locale supports unicode
# extract all consecutive letters
$ echo 'fox:αλεπού,eagle:αετός' | grep -oP '\p{L}+'
fox
αλεπού
eagle
αετός

# extract all consecutive Greek letters
$ echo 'fox:αλεπού,eagle:αετός' | grep -oP '\p{Greek}+'
αλεπού
αετός

# extract all words
$ echo 'φοο12,βτ_4,bat' | grep -oP '\p{Xwd}+'
φοο12
βτ_4
bat

# extract all characters other than letters
# \p{^L} can also be used instead of \P{L}
$ echo 'φοο12,βτ_4,bat' | grep -oP '\P{L}+'
12,
_4,

Characters can be specified using octal \o and hexadecimal \x formats as well.

# \x{20} and \o{40} can be used instead of literal space character
$ echo 'a cat and a dog' | grep -P 't\x20a'
a cat and a dog

# {} are optional if only two hehexadecimal characters are needed
$ echo 'fox:αλεπού,eagle:αετός' | grep -oP '[\x61-\x7a]+'
fox
eagle

$ echo 'fox:αλεπού,eagle:αετός' | grep -oP '[\x{3b1}-\x{3bb}]+'
αλε
αε

See pcre manual under topic Unicode character properties and regular-expressions: Unicode for more details.

Summary

PCRE is one of the most feature rich regular expression library. Apart from use in command line tools like GNU grep, pcregrep and ripgrep, it is also used in programming languages — for example Nim. There are many more complex constructs that have not been presented here. However, I feel I've covered most of the features that might come up for command line usage with grep.

Exercises

The exercises directory has all the files used in this section.

1) From the sample.txt input file, extract from the start of a line to the first occurrence of he.

##### add your solution here
Hi the
He he

2) For the input file terms.txt, display line that do not contain a digit character.

##### add your solution here
are
not
go

3) From the pcre.txt input file, extract consecutive repeated occurrences of abc followed by a provided that the final a isn't part of abc. For example, abcabcadef should give abcabca as the output and abcabcabcd shouldn't match.

##### add your solution here
abcabcabca

4) What's the syntax for non-capturing group and name a use case for such a grouping.

5) What is negative backreferencing?

6) What's the difference between backreference and subexpression calls?

7) From the pcre.txt input file, extract from S: followed by a digit character to the very next occurrence of E: followed by two or more digits. For example, S:12 E:5 fig S:4 and E:123 should give S:4 and E:123 as the output and S:1 - E:2 shouldn't match.

##### add your solution here
S:4 and E:123
S:42 E:43
S:100 & E:10

8) From the sample.txt input file, extract all sequences made up of lowercase letters except those that start with a or h or i or t. Such sequences should not be surrounded by other word characters.

##### add your solution here
you
do
banana
papaya
mango
nothing

9) From the sample.txt input file, extract all sequences made up of lowercase letters except those that end with letters from g to z. Such sequences should not be surrounded by other word characters.

##### add your solution here
there
are
banana
papaya
he
he

10) From the pcre.txt input file, extract integer portion of floating-point numbers. Integers and numbers ending with . and no further digits should not be considered. For example, output for ab32.4 should be 32 and numbers like 2. and 456 should not be matched.

$ grep -oP '\d+\.\d+' pcre.txt
32.4
46.42

##### add your solution here
32
46

11) For the input file pcre.txt, filter lines that satisfy all of these rules:

at least 2 alphabets
at least 3 digits
at least 1 special character among % or * or # or $
should not contain Yz or if

##### add your solution here
F2H3u#9
A $ C1234

12) From the pcre.txt input file, extract from the second field to the second last field from rows having at least two columns considering ; as the delimiter. For example, b;c should be extracted from a;b;c;d and a line containing less than two ; characters shouldn't produce any output.

##### add your solution here
in;awe;b2b;3list
be;he;0;a;b

13) For the input file pcre.txt, match lines if it contains qty followed by price but not if there is any whitespace character or the string error between them.

##### add your solution here
23,qty,price,42
(qtyprice) (hi-there)
42\nqty-6,apple-56,price-234,error

14) From the pcre.txt input file, extract if followed by content within any number of nested parentheses.

##### add your solution here
if(3-(k*3+4)/12-(r+2/3))
if(a(b)c(d(e(f)1)2)3)

15) What does the \G anchor do?

16) From the patterns.txt input file, extract from car at the start of a line to the very next occurrence of book or lie in the file. Perform additional transformation to convert ASCII NUL characters, if any, to the newline character.

##### add your solution here
care
4*5]
a huge discarded pile of book
car
eden
rested replie

17) For the input file patterns.txt, match lines having the content present in the p shell variable literally at the end of lines. For example, if p='*[5]', then (9-2)*[5] would be a valid match, but not [4]*[5]+[6].

$ p='*[5]'
##### add your solution here
(9-2)*[5]

$ p='*4)'
##### add your solution here
12- (e+(j/k-3)*4)

$ p='42'
##### add your solution here
Hi42Bye nice1423 bad42

18) From the patterns.txt input file, extract all whole words if a line also contains car. But, any word occupying the first five characters in the line shouldn't be part of the output. For example, no scar shouldn't produce any output since both words have all/some characters within the first five characters in the line. part cart mart should produce cart and mart as output. two sets tests would fail the car condition, and thus shouldn't produce any output.

$ grep 'car' patterns.txt
scar
par car tar far Cart
care
a huge discarded pile of books
scare
car
part cart mart

##### add your solution here
tar
far
Cart
discarded
pile
of
books
cart
mart

19) What do the following unicode character sets match?

\p{L}
\P{L}
\p{Greek}
\p{Xwd}
\p{P}

20) What do the following escape sequences do?

\A
\z
\Z

CLI text processing with GNU grep and ripgrep