Emulating regexp lookarounds in GNU sed
This stackoverflow Q&A got me thinking about various ways to construct a solution in GNU sed
if lookarounds are needed.
Only single line (with newline as the line separator) processing is presented here. Equivalent lookaround syntax with
grep -P
orperl
is also shown for comparison. Cases where multiple lines and/or ASCII NUL characters are present in the pattern space is left as an exercise.
Filtering🔗
Here, you only need to decide whether the input line has to be matched or not. sed
supports grouping commands inside {}
that should be executed only if a filtering condition is matched. The condition could be negated by adding a !
character. In this way, you can emulate chaining of multiple positive and/or negative lookaround conditions.
$ cat items.txt
1,2,3,4
apple=50 ;per kg
a,b,c,d
;foo xyz3
# lines containing a digit character followed by a ; character anywhere after
# lookaround isn't needed here
# same as: grep '[0-9].*;' or grep -P '\d(?=.*;)'
$ sed -n '/[0-9].*;/p' items.txt
apple=50 ;per kg
# lines containing both digit and ; characters in any order
# same as: grep -P '^(?=.*;).*\d'
$ sed -n '/;/{ /[0-9]/p }' items.txt
apple=50 ;per kg
;foo xyz3
# lines containing both digit and ; characters
# but not if the line also contains character a
# same as: grep -P '^(?!.*a)(?=.*;).*\d'
$ sed -n '/a/!{ /;/{ /[0-9]/p } }' items.txt
;foo xyz3
For some cases, multiple condition check like the previous examples is not enough. For example, filter a line if it contains par
as long as cart
isn't present later in the line. Presence of cart
earlier in the line shouldn't affect the outcome. In such cases, you can first change the input line to add a newline character wherever cart
is present and then construct a condition such that it depends on the newline character instead of cart
. If a match is found, delete all the newline characters and then print the line.
$ s='par carted spare cart park city\na parking cart\n'
# same as: grep -P 'par(?!.*cart)'
$ printf '%b' "$s" | sed -n 's/cart/\n&/g; /par[^\n]*$/{ s/\n//g; p }'
par carted spare cart park city
Newline is a safe character to choose for default line by line processing, as
sed
removes it from the pattern space. If you are processing a pattern space that contains newline character (for example:-z
option,N
command, etc), then you can still perform this trick as long as you know a character that is guaranteed to be absent from the input data.
Substitution🔗
In the previous section, you saw how to modify input line with newline character to make it easier to construct a lookaround condition. This trick comes in handy for substitution as well. However, for search and replace cases, you also need to emulate zero-width nature of lookarounds. To achieve this, you can make use of t
command to construct a loop that performs substitution as long as a match is found. See my chapter on Control structures for more details about branching commands in GNU sed
.
Here's an example of looping. Aim is to delete fin
from the given input recursively.
# manual repetition, assuming count is known
$ echo 'coffining' | sed 's/fin//'
cofing
$ echo 'coffining' | sed 's/fin//; s///'
cog
# :loop marks the 's' command with label 'loop'
# tloop will jump to label 'loop' as long as the substitution succeeds
$ echo 'coffining' | sed ':loop s/fin//; tloop'
cog
Negative lookarounds🔗
Some cases can be solved by performing substitution only if a condition is first satisfied. For this example, need to first select lines if it doesn't start with a ;
character. Then, for such lines, remove everything from the first space or comma character. Note that {}
grouping is optional here.
# same as: perl -ne 'print if s/^(?!;).*?\K[ ,].*//'
$ sed -n '/^;/! s/[ ,].*//p' items.txt
1
apple=50
a
For this example, need to change foo
to [baz]
only if it is not followed by a digit character. Note that foo
at the end of string also satisfies this assertion. foofoo
has two matches as the assertion is zero-width in nature, i.e. it doesn't consume characters. Here, the first step is inserting a newline character between foo
and a digit character. Then change all foo
to [baz]
as long as it is at the end of string or if it isn't followed by a newline character. Once the loop ends, remove all the newline characters.
$ s='hey food! foo42 foot5 foofoo'
# same as: perl -pe 's/foo(?!\d)/[baz]/g'
$ echo "$s" | sed -E 's/(foo)([0-9])/\1\n\2/g;
:a s/foo([^\n]|$)/[baz]\1/; ta;
s/\n//g'
hey [baz]d! foo42 [baz]t5 [baz][baz]
Change foo
to [baz]
only if it is not preceded by _
character. foo
at the start of string is matched as well.
$ s='foo _foo 42foofoo'
# same as: perl -pe 's/(?<!_)foo/[baz]/g'
$ echo "$s" | sed -E 's/(_)(foo)/\1\n\2/g;
:a s/(^|[^\n])foo/\1[baz]/; ta;
s/\n//g'
[baz] _foo 42[baz][baz]
Replace par
with [xyz]
as long as s
character is not present later in the input. This assumes that the assertion doesn't conflict with the search pattern, for example s
will not conflict with par
but would affect if it was r
and par
.
$ s='par spare part party'
# same as: perl -pe 's/par(?!.*s)/[xyz]/g'
$ echo "$s" | sed -E 's/s/&\n/g;
:a s/par([^\n]*)$/[xyz]\1/; ta;
s/\n//g'
par s[xyz]e [xyz]t [xyz]ty
Replace all empty fields with NA
for csv input (assuming no embedded comma, newline characters, etc).
$ s=',1,,,two,3,,,'
# same as: perl -lpe 's/(?<![^,])(?![^,])/NA/g'
$ echo "$s" | sed -E ':a s/,,/,NA,/g; ta; s/^,/NA,/; s/,$/,NA/'
NA,1,NA,NA,two,3,NA,NA,NA
Replace if go
is not there between at
and par
.
$ s='fox,cat,dog,parrot,dot,park,bat,go,spare,sat-in-a-park'
# same as: perl -pe 's/at((?!go).)*par/[xyz]/g'
$ echo "$s" | sed 's/go/\n&/g; s/at[^\n]*par/[xyz]/g; s/\n//g'
fox,c[xyz]k,bat,go,spare,s[xyz]k
Positive lookarounds🔗
In this example, need to surround fields with []
except first and last fields for csv input (assuming no embedded comma, newline characters, etc). With positive lookaround emulation, the modified string may continue to satisfy the matching condition, resulting in infinite looping. In this example, the fields themselves may contain []
characters, so you cannot use them to prevent infinite loop. The newline character trick comes in handy again.
$ s='1,t[w]o,[3],f[ou]r,5'
# same as: perl -pe 's/(?<=,)[^,]+(?=,)/[$&]/g'
$ echo "$s" | sed -E ':a s/,([^,\n]+),/,\n[\1],/g; ta; s/\n//g'
1,[t[w]o],[[3]],[f[ou]r],5
Add space at word boundaries, but not at the start or end of string. Also, don't add space if it is already present. Here, negated character class on space character is enough to emulate the assertion.
$ s='total= num1+35*42/num2'
# same as: perl -lpe 's/(?<=[^ ])\b(?=[^ ])/ /g'
$ echo "$s" | sed -E ':a s/([^ ])\b([^ ])/\1 \2/; ta;'
total = num1 + 35 * 42 / num2
Replace par
with [xyz]
as long as part
occurs as a whole word later in the line. Here, the nature of the modified string itself prevents the possibility of infinite loop.
$ s='par spare part party'
# same as: perl -pe 's/par(?=.*\bpart\b)/[xyz]/g'
$ echo "$s" | sed -E ':a s/par(.*\bpart\b)/[xyz]\1/; ta'
[xyz] s[xyz]e part party
Summary🔗
Branching commands and some creative preprocessing of the input can be combined to emulate lookaround assertions in sed
. Given that Unix utility sed is Turing complete, it's perhaps not a big surprise. Now, please excuse me, I'll be busy reaping points on stackoverflow/unix.stackexchange for this edge case ;)