Gotchas and Tricks

This chapter will discuss some of the often made beginner mistakes, corner cases as well as a few tricks to improve performance. Some of the examples were already covered in the previous chapters.

The example_files directory has all the files used in the examples.

Shell quoting

Use single quotes to enclose the script on the command line to avoid potential conflict with shell metacharacters.

# space is a shell metacharacter, hence the error
$ echo 'a sunny day' | sed s/sunny day/cloudy day/
sed: -e expression #1, char 7: unterminated `s' command

# shell treats characters inside single quotes literally
$ echo 'a sunny day' | sed 's/sunny day/cloudy evening/'
a cloudy evening

On the other hand, beginners often do not realize the difference between single and double quotes and expect shell substitutions to work from within single quotes. See wooledge: Quotes and unix.stackexchange: Why does my shell script choke on whitespace or other special characters? for details about various quoting mechanisms.

# $USER won't get expanded within single quotes
$ echo 'User name: ' | sed 's/$/$USER/'
User name: $USER

# use double quotes for such cases
$ echo 'User name: ' | sed "s/$/$USER/"
User name: learnbyexample

When shell substitution is needed, surrounding entire command with double quotes may lead to issues due to conflict between sed and bash special characters. So, use double quotes only for the portion of the command where it is required.

# ! is one of special shell characters within double quotes
$ word='at'
# !d got expanded to 'date -Is' from my history and hence the error
$ printf 'sea\neat\ndrop\n' | sed "/${word}/!d"
printf 'sea\neat\ndrop\n' | sed "/${word}/date -Is"
sed: -e expression #1, char 6: extra characters after command

# works correctly when only the required portion is double quoted
$ printf 'sea\neat\ndrop\n' | sed '/'"${word}"'/!d'
eat

Escaping metacharacters

Another gotcha when applying variable or command substitution is the conflict between sed metacharacters and the value of the substituted string.

# variable being substituted cannot have the delimiter character
$ printf 'path\n' | sed 's/$/: '"$HOME"'/'
sed: -e expression #1, char 8: unknown option to `s'

# use a different delimiter that won't conflict with the variable value
$ printf 'path\n' | sed 's|$|: '"$HOME"'|'
path: /home/learnbyexample

But you might not have the luxury of choosing a delimiter that won't conflict with characters in the shell variable. Also, for literal search and replace, you'll have to preprocess the variable content to escape metacharacters. See the Shell substitutions chapter for details and examples for such cases.

Options at the end of the command

You can specify command line options even at the end of the command. Useful if you forgot some options and want to edit the previous command from the shell history.

# no output, as + is not special with default BRE
$ printf 'boat\nsite\nfoot\n' | sed -n 's/[aeo]+t/(&)/p'

# pressing up arrow will bring up the last command from history
# then you can add the option needed at the end of the command
$ printf 'boat\nsite\nfoot\n' | sed -n 's/[aeo]+t/(&)/p' -E
b(oat)
f(oot)

As a corollary, if a filename starts with -, you need to either escape it or use -- as an option to indicate that no more options will be used. The -- feature is not unique to the sed command, it is applicable to many other commands as well and typically used when filenames are obtained from another source or expanded by shell globs such as *.txt.

$ echo 'hi hello' > -dash.txt
$ sed 's/hi/HI/' -dash.txt
sed: invalid option -- 'd'

$ sed -- 's/hi/HI/' -dash.txt
HI hello

# clean up temporary file
$ rm -- -dash.txt

DOS style line endings

Your command might not work and/or produce weird output if your input content has DOS style line endings.

# substitution doesn't work here because of DOS style line ending
$ printf 'hi there\r\ngood day\r\n' | sed -E 's/\w+$/123/'
hi there
good day
# matching \r optionally is one way to solve this issue
# that way, it'll work for both \r\n and \n line endings
$ printf 'hi there\r\ngood day\r\n' | sed -E 's/\w+(\r?)$/123\1/'
hi 123
good 123

# swapping every two columns, works well with \n line ending
$ printf 'good,bad,42,24\n' | sed -E 's/([^,]+),([^,]+)/\2,\1/g'
bad,good,24,42
# output gets mangled with \r\n line ending
$ printf 'good,bad,42,24\r\n' | sed -E 's/([^,]+),([^,]+)/\2,\1/g'
,42,good,24

I use these bash functions (as part of .bashrc configuration) to easily switch between DOS and Unix style line endings. Some Linux distribution may come with these commands installed by default. See also stackoverflow: Why does my tool output overwrite itself and how do I fix it?

unix2dos() { sed -i 's/$/\r/' "$@" ; }
dos2unix() { sed -i 's/\r$//' "$@" ; }

No newline at the end of the last input line

Unlike grep, sed will not add a newline if the last line of input didn't have one.

# grep adds a newline even though 'drop' doesn't end with newline
$ printf 'sea\neat\ndrop' | grep -v 'at'
sea
drop

# sed will not do so
# note how the prompt appears after 'drop'
$ printf 'sea\neat\ndrop' | sed '/at/d'
sea
drop$

Command grouping and -e option

Some commands (for example, the s command) can be terminated with a semicolon or } (command grouping). But commands like a and r will treat them as part of the string argument. You can use a literal newline to terminate such commands or use the -e option as shown below.

# } gets treated as part of the string argument, hence the error
$ seq 3 | sed '2{s/^/*/; a hi}'
sed: -e expression #1, char 0: unmatched `{'

# -e to the rescue
# note the use of -e for the first portion of the command as well
$ seq 3 | sed -e '2{s/^/*/; a hi' -e '}'
1
*2
hi
3

Longest match wins

sed doesn't support something like the non-greedy quantifier found in other flavors like Perl and Python. See also Longest match wins section for more details.

$ s='food land bark sand band cue combat'
# this will always match from the first 'foo' to the last 'ba'
$ echo "$s" | sed 's/foo.*ba/X/'
Xt

# if you need to match from the first 'foo' to the first 'ba', then
# use a tool which supports non-greedy quantifiers
$ echo "$s" | perl -pe 's/foo.*?ba/X/'
Xrk sand band cue combat

For certain cases, character class can help in matching only the relevant characters. And in some cases, adding more qualifiers instead of just .* can help. See stackoverflow: How to replace everything until the first occurrence for an example.

# first { to the last }
$ echo '{52} apples and {31} mangoes' | sed 's/{.*}/42/g'
42 mangoes

# matches from { to the very next }
$ echo '{52} apples and {31} mangoes' | sed 's/{[^}]*}/42/g'
42 apples and 42 mangoes

Empty matches with * quantifier

Beware of empty matches when using the * quantifier.

# * matches zero or more times
$ echo '42,,,,,hello,bye,,,hi' | sed 's/,*/,/g'
,4,2,h,e,l,l,o,b,y,e,h,i,

# + matches one or more times
$ echo '42,,,,,hello,bye,,,hi' | sed -E 's/,+/,/g'
42,hello,bye,hi

BRE vs ERE

Unlike other implementations of sed, there are no feature differences between BRE and ERE flavors in GNU sed. Quoting from the manual:

In GNU sed, the only difference between basic and extended regular expressions is in the behavior of a few special characters: ?, +, parentheses, braces ({}), and |.

# no match as + is not special with default BRE
$ echo '52 apples and 31234 mangoes' | sed 's/[0-9]+/[&]/g'
52 apples and 31234 mangoes
# so, either use \+ with BRE or use + with ERE
$ echo '52 apples and 31234 mangoes' | sed 's/[0-9]\+/[&]/g'
[52] apples and [31234] mangoes

# the reverse is also a common beginner mistake
$ echo 'get {} set' | sed 's/\{\}/[]/'
sed: -e expression #1, char 10: Invalid preceding regular expression
$ echo 'get {} set' | sed 's/{}/[]/'
get [] set

Using online regexp tools

Online tools like regex101 and debuggex can be very useful, especially for debugging purposes. However, their popularity has lead to users trying out their pattern on these sites and expecting them to work as is for command line tools like grep, sed and awk. The issue arises when features like non-greedy and lookarounds are used, as they wouldn't work with BRE/ERE. See also unix.stackexchange: Why does my regular expression work in X but not in Y?

$ echo '1,,,two,,3' | sed -E 's/,\K(?=,)/NA/g'
sed: -e expression #1, char 15: Invalid preceding regular expression
$ echo '1,,,two,,3' | perl -pe 's/,\K(?=,)/NA/g'
1,NA,NA,two,NA,3

# \d is not available as a character set escape sequence
# will match 'd' instead
$ echo '52 apples and 31234 mangoes' | sed -E 's/\d+/[&]/g'
52 apples an[d] 31234 mangoes
$ echo '52 apples and 31234 mangoes' | perl -pe 's/\d+/[$&]/g'
[52] apples and [31234] mangoes

End of line matching

If you are facing issues with end of line matching, it is often due to DOS-style line ending (discussed earlier in this chapter) or whitespace characters at the end of line.

# there's no visual clue to indicate whitespace characters at the end of line
$ printf 'food bark \n1234 6789\t\n'
food bark 
1234 6789	
# no match
$ printf 'food bark \n1234 6789\t\n' | sed -E 's/\w+$/xyz/'
food bark 
1234 6789	

# cat command has options to indicate end of line, tabs, etc
$ printf 'food bark \n1234 6789\t\n' | cat -A
food bark $
1234 6789^I$
# works now, as whitespace characters at the end are matched too
$ printf 'food bark \n1234 6789\t\n' | sed -E 's/\w+\s*$/xyz/'
food xyz
1234 xyz

Word boundary differences

The word boundary \b matches both the start and end of word locations. Whereas, \< and \> will match exactly the start and end of word locations respectively. This leads to cases where you have to choose which of these word boundaries to use depending on the results desired. Consider I have 12, he has 2! as a sample text, shown below as an image with vertical bars marking the word boundaries. The last character ! doesn't have the end of word boundary marker as it is not a word character.

word boundary

# \b matches both the start and end of word boundaries
# the first match here used starting boundary of 'I' and 'have'
$ echo 'I have 12, he has 2!' | sed 's/\b..\b/[&]/g'
[I ]have [12][, ][he] has[ 2]!

# \< and \> only matches the start and end word boundaries respectively
$ echo 'I have 12, he has 2!' | sed 's/\<..\>/[&]/g'
I have [12], [he] has 2!

Here's another example to show the difference between the two types of word boundaries.

# add something to both the start/end of word
$ echo 'hi log_42 12b' | sed 's/\b/:/g'
:hi: :log_42: :12b:

# add something only at the start of word
$ echo 'hi log_42 12b' | sed 's/\</:/g'
:hi :log_42 :12b

# add something only at the end of word
$ echo 'hi log_42 12b' | sed 's/\>/:/g'
hi: log_42: 12b:

Filter and then substitute

For some cases, you can simplify and improve readability of a substitution command by adding a filter condition instead of using substitution only.

# insert 'Error: ' at the start of a line if it contains '42'
# also, remove all other starting whitespaces for such lines
$ printf '1423\n214\n   425\n' | sed -E 's/^\s*(.*42)/Error: \1/'
Error: 1423
214
Error: 425

# simpler and easier to understand
# also note that -E is no longer required
$ printf '1423\n214\n   425\n' | sed '/42/ s/^\s*/Error: /'
Error: 1423
214
Error: 425

Addressing input that only has a single line

Both 1 and $ will match as an address if the input has only one line of data.

$ printf '3.14\nhi\n42\n' | sed '1 s/^/start: /; $ s/$/ :end/'
start: 3.14
hi
42 :end
$ echo '3.14' | sed '1 s/^/start: /; $ s/$/ :end/'
start: 3.14 :end

# you can use control structures as a workaround
# this prevents ending address match if input has only one line
$ echo '3.14' | sed '1{s/^/start: /; b}; $ s/$/ :end/'
start: 3.14
# this prevents starting address match if input has only one line
$ echo '3.14' | sed '${s/$/ :end/; b}; 1 s/^/start: /'
3.14 :end

Behavior of n and N commands at the end of input

n and N commands will not execute further commands if there are no more input lines to fetch.

# last line matches the filtering condition
# but substitution didn't work for the last line
$ printf 'red\nblue\ncredible\n' | sed '/red/{N; s/e.*e/2/}'
r2
credible

# $!N will avoid executing the N command for the last line of input
$ printf 'red\nblue\ncredible\n' | sed '/red/{$!N; s/e.*e/2/}'
r2
cr2

Faster execution for ASCII input

Changing locale to ASCII (assuming that the default is not ASCII) can give a significant speed boost.

# time shown is the best result from multiple runs
# speed benefit will vary depending on computing resources, input, etc
$ time sed -nE '/^([a-d][r-z]){3}$/p' words.txt > f1
real    0m0.023s

# LC_ALL=C will give ASCII locale, active only for this command
$ time LC_ALL=C sed -nE '/^([a-d][r-z]){3}$/p' words.txt > f2
real    0m0.012s

# check if the results are identical for both commands
$ diff -s f1 f2
Files f1 and f2 are identical

Here's another example.

$ time sed -nE '/^([a-z]..)\1$/p' words.txt > f1
real    0m0.050s

$ time LC_ALL=C sed -nE '/^([a-z]..)\1$/p' words.txt > f2
real    0m0.029s

# clean up temporary files
$ rm f[12]

Substitution with ripgrep command

ripgrep (command name rg) is primarily an alternative to the grep command, but it also supports search and replace functionality. It has more regular expression features compared to BRE/ERE, supports unicode, multiline and fixed string matching and generally faster than sed. You can use rg --passthru -N 'search' -r 'replace' file to emulate sed 's/search/replace/g' file. There are plenty of reasons to recommend learning rg even though substitution features are limited (no in-place support, no address filtering, no control structures, etc).

# same as: sed 's/e/E/g' greeting.txt
# --passthru is needed to print lines that don't match the search pattern
$ rg --passthru -N 'e' -r 'E' greeting.txt
Hi thErE
HavE a nicE day

# non-greedy quantifier
$ s='food land bark sand band cue combat'
$ echo "$s" | rg --passthru 'foo.*?ba' -r 'X'
Xrk sand band cue combat

# Multiline search and replacement
$ printf '42\nHi there\nHave a Nice Day' | rg --passthru -U '(?s)the.*ice' -r ''
42
Hi  Day

# fixed string matching example, this one replaces [4]* with 2
$ printf '2.3/[4]*6\nfig\n5.3-[4]*9\n' | rg --passthru -F '[4]*' -r '2'
2.3/26
fig
5.3-29

# unicode support
$ echo 'fox:αλεπού,eagle:αετός' | rg '\p{L}+' -r '($0)'
(fox):(αλεπού),(eagle):(αετός)

# -P option enables PCRE if you need even more advanced features
$ echo 'car bat cod map' | rg -Pw '(bat|map)(*SKIP)(*F)|\w+' -r '[$0]'
[car] bat [cod] map

See my ebook CLI text processing with GNU grep and ripgrep for more examples and details.

Compiling sed script

Quoting from sed-bin: POSIX sed to C translator:

This project allows to translate sed to C to be able to compile the result and generate a binary that will have the exact same behavior as the original sed script

It could help in debugging a complex sed script, obfuscation, better speed, etc.

CLI text processing with GNU sed