Gotchas and Tricks

Shell quoting

Always use single quotes for search string/pattern, unless other forms of shell expansion is needed and you know what you are doing.

$ # spaces separate command arguments
$ echo 'a cat and a dog' | grep and a
grep: a: No such file or directory
$ echo 'a cat and a dog' | grep 'and a'
a cat and a dog

$ # use of # indicates start of comment
$ printf 'foo\na#2\nb#3\n' | grep #2
Usage: grep [OPTION]... PATTERNS [FILE]...
Try 'grep --help' for more information.
$ printf 'foo\na#2\nb#3\n' | grep '#2'
a#2

$ # assumes 'bre_ere' as CWD
$ ls *.txt
word_anchors.txt  words.txt
$ # will get expanded to: grep -F word_anchors.txt words.txt
$ echo '*.txt' | grep -F *.txt
$ echo '*.txt' | grep -F '*.txt'
*.txt

When double quotes are needed, try to use them only for the portion required. See mywiki.wooledge Quotes for detailed discussion of various quoting and expansions in bash shell.

$ expr='(a^b)'
$ # in bash, strings placed next to each other will be concatenated
$ echo '\S*\Q'"$expr"'\E\S*'
\S*\Q(a^b)\E\S*

$ echo 'f*(2-a/b) - 3*(a^b)-42' | grep -oP '\S*\Q'"$expr"'\E\S*'
3*(a^b)-42

Patterns starting with hyphen

Patterns cannot start with - as it will be treated as a command line option. Either escape it or use -- as an option before the pattern to indicate that no more options will be used (handy if pattern is programmatically constructed). This problem and the solution is not unique to the grep command.

$ # command assumes - is start of an option, hence the errors
$ printf '-2+3=1\n'
bash: printf: -2: invalid option
printf: usage: printf [-v var] format [arguments]
$ echo '5*3-2=13' | grep '-2'
Usage: grep [OPTION]... PATTERNS [FILE]...
Try 'grep --help' for more information.

$ # escape it (won't work if -F option is also needed)
$ echo '5*3-2=13' | grep '\-2'
5*3-2=13

$ # or use --
$ echo '5*3-2=13' | grep -- '-2'
5*3-2=13
$ printf -- '-2+3=1\n'
-2+3=1

As a corollary, you can use options after filename arguments. This is useful if you forgot some option(s) and want to edit the previous command from history.

$ printf 'boat\nsite\nfoot' | grep '[aeo]+t'
$ printf 'boat\nsite\nfoot' | grep '[aeo]+t' -E
boat
foot

Word boundary differences

The -w option is not exactly the same as using word boundaries in regular expressions. The \b anchor by definition requires word characters to be present, but this is not the case with -w as described in the manual:

-w, --word-regexp Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore. This option has no effect if -x is also specified.

$ # no output because there are no word characters
$ echo '*$' | grep '\b\$\b'
$ # matches because $ is preceded by non-word character
$ # and followed by end of line
$ echo '*$' | grep -w '\$'
*$

Consider I have 12, he has 2! as sample text, shown below as image with vertical bars as word boundaries. The last character ! doesn't have end of word boundary as it is not a word character. This should hopefully make it clear the differences between using \b and -w and \<\> features.

word boundary

$ # \b matches both start and end of word boundaries
$ # 1st and 3rd line have space as second character
$ echo 'I have 12, he has 2!' | grep -o '\b..\b'
I 
12
, 
he
 2

$ # \< and \> strictly match only start and end word boundaries respectively
$ echo 'I have 12, he has 2!' | grep -o '\<..\>'
12
he

$ # -w ensures there are no word characters around the matching text
$ # same as: grep -oP '(?<!\w)..(?!\w)'
$ echo 'I have 12, he has 2!' | grep -ow '..'
12
he
2!

Faster execution for ASCII input

Changing locale to ASCII (assuming default is not ASCII locale) can give significant speed boost.

$ # assumes 'bre_ere' as CWD
$ # time shown is best result from multiple runs
$ # speed benefit will vary depending on computing resources, input, etc
$ time grep -xE '([a-d][r-z]){3}' words.txt > f1
real    0m0.032s

$ # LC_ALL=C will give ASCII locale, active only for this command
$ time LC_ALL=C grep -xE '([a-d][r-z]){3}' words.txt > f2
real    0m0.007s

$ # check that results are same for both versions of the command
$ diff -s f1 f2
Files f1 and f2 are identical

Here's another example.

$ time grep -xE '([a-z]..)\1' words.txt > f1
real    0m0.126s
$ time LC_ALL=C grep -xE '([a-z]..)\1' words.txt > f2
real    0m0.074s

$ # clean up temporary files
$ rm f[12]

info There's been plenty of speed improvements in recent versions, see release notes for details.

Speed benefits with PCRE

Using PCRE usually will be faster if search pattern has backreferences.

As mentioned earlier, from man grep under Known Bugs section (applies to BRE/ERE)

Large repetition counts in the {n,m} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory. Back-references are very slow, and may require exponential time.

$ time LC_ALL=C grep -xE '([a-z]..)\1' words.txt > f1
real    0m0.073s
$ time grep -xP '([a-z]..)\1' words.txt > f2
real    0m0.010s

$ # clean up
$ rm f[12]

Parallel execution

While searching huge code bases, you could consider using more than one processing resource (if available) to speed up results. This example dataset will be used again in ripgrep chapter.

warning xargs -P may return mangled output unlike parallel, see unix.stackexchange: xargs vs parallel for details.

$ # assumes 'gotchas_tricks' as CWD
$ # note that the download size is 154M
$ wget https://github.com/torvalds/linux/archive/v4.19.tar.gz
$ tar -zxf v4.19.tar.gz
$ du -sh linux-4.19
908M    linux-4.19

$ cd linux-4.19
$ # note that the time is significantly different from first run to next
$ # due to caching, in this case 0m34.174s to 0m0.285s
$ time grep -rl 'include' . > ../f1
real    0m0.285s
$ # sometimes find+grep may be faster than grep -r, so try that first
$ # note the use of -print0 and -0 to handle filenames correctly
$ time find -type f -print0 | xargs -0 grep -l 'include' > ../f2
real    0m0.291s
$ # much better performance as xargs will use as many processes as possible
$ # assuming output order of results do not matter
$ time find -type f -print0 | xargs -0 -P0 grep -l 'include' > ../f3
real    0m0.138s

$ # however, output is not usable for the third case
$ diff -sq <(sort ../f1) <(sort ../f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort ../f1) <(sort ../f3)
Files /dev/fd/63 and /dev/fd/62 differ

$ # clean up
$ rm ../f[1-3]

With this, chapters on GNU grep are done. Would highly suggest to maintain your own list of frequently used grep commands, tips and tricks, etc. Next chapter is on ripgrep which is rapidly gaining popularity, mainly due to its speed, recursive options and customization features. Also, do check out various resources linked in Further Reading chapter.