Escaping madness to get literal field separators in awk
I'm building a tool called rcut that allows you to use cut
like syntax with features like regexp based delimiters. The solution uses awk
inside a bash
script.
Latest feature creep is fixed string field splitting. I thought it would be a simple enough solution to add.
I was wrong.
How many escapes for a single backslash?🔗
For reference, these are the versions I have on my machine:
$ gawk --version
GNU Awk 5.1.0, API: 3.0
$ mawk -W version
mawk 1.3.4 20200120
mawk
and gawk
differ when it comes to escaping backslashes. You'll later see the rule that'll work correctly for both implementations.
$ echo 'apple\bake\cake' | mawk -F'e\' '{print $2}'
bak
$ echo 'apple\bake\cake' | gawk -F'e\' '{print $2}'
gawk: fatal: invalid regexp: Trailing backslash: /e\/
$ echo 'apple\bake\cake' | gawk -F'e\\' '{print $2}'
gawk: fatal: invalid regexp: Trailing backslash: /e\/
$ echo 'apple\bake\cake' | gawk -F'e\\\' '{print $2}'
bak
The value assigned to FS
is treated as a string and then converted to a regexp. \
is a metacharacter for string and regexp both. So, \\
in a string means a single backslash and \\\\
means double backslash. Double backslash in regexp means a single backslash.
Conclusion: For a consistent behavior across both mawk
and gawk
and irrespective of trailing backslash errors, you need to use 4 backslashes for every backslash.
# both 2 and 4 backslashes here gets treated as single backslash
# hence the empty fields in the output
$ echo '1\\2\\3' | mawk -F'\\' -v OFS=, '{$1=$1} 1'
1,,2,,3
$ echo '1\\2\\3' | mawk -F'\\\\' -v OFS=, '{$1=$1} 1'
1,,2,,3
$ echo '1\\2\\3' | gawk -F'\\' -v OFS=, '{$1=$1} 1'
1,,2,,3
$ echo '1\\2\\3' | gawk -F'\\\\' -v OFS=, '{$1=$1} 1'
1,,2,,3
# 5-8 backslashes give expected results
$ echo '1\\2\\3' | mawk -F'\\\\\' -v OFS=, '{$1=$1} 1'
1,2,3
$ echo '1\\2\\3' | mawk -F'\\\\\\' -v OFS=, '{$1=$1} 1'
1,2,3
$ echo '1\\2\\3' | mawk -F'\\\\\\\' -v OFS=, '{$1=$1} 1'
1,2,3
$ echo '1\\2\\3' | mawk -F'\\\\\\\\' -v OFS=, '{$1=$1} 1'
1,2,3
# 5-6 backslashes give error, 7-8 backslashes give expected results
$ echo '1\\2\\3' | gawk -F'\\\\\' -v OFS=, '{$1=$1} 1'
gawk: fatal: invalid regexp: Trailing backslash: /\\\/
$ echo '1\\2\\3' | gawk -F'\\\\\\' -v OFS=, '{$1=$1} 1'
gawk: fatal: invalid regexp: Trailing backslash: /\\\/
$ echo '1\\2\\3' | gawk -F'\\\\\\\' -v OFS=, '{$1=$1} 1'
1,2,3
$ echo '1\\2\\3' | gawk -F'\\\\\\\\' -v OFS=, '{$1=$1} 1'
1,2,3
As an alternate method, you can use codepoint of the backslash character. This removes one level of escaping. See ASCII code table for codepoint reference.
Conclusion: You need \x5c\x5c
for every backslash.
$ echo 'apple\bake\cake' | mawk -F'e\x5c\x5c' '{print $2}'
bak
$ echo 'apple\bake\cake' | gawk -F'e\x5c\x5c' '{print $2}'
bak
$ echo '1\\2\\3' | mawk -F'\x5c\x5c\x5c\x5c' -v OFS=, '{$1=$1} 1'
1,2,3
$ echo '1\\2\\3' | gawk -F'\x5c\x5c\x5c\x5c' -v OFS=, '{$1=$1} 1'
1,2,3
Using awk to generate an escaped string🔗
Suppose you want to use \.
literally for field splitting. Here's some ways to do it that works for both mawk
and gawk
:
$ echo 'x\2\.y\.z' | gawk -F'\\\\\\.' -v OFS=, '{$1=$1} 1'
x\2,y,z
$ echo 'x\2\.y\.z' | gawk -F'\\\\[.]' -v OFS=, '{$1=$1} 1'
x\2,y,z
$ echo 'x\2\.y\.z' | gawk -F'\x5c\x5c[.]' -v OFS=, '{$1=$1} 1'
x\2,y,z
Now, the task is to generate one of the above strings passed to the -F
option from \.
as input. Using sed
is better, but for rcut, I didn't want to add another external tool.
Case 1: backslash madness🔗
You need to convert \
to 4 backslashes and escape regexp metacharacters with 2 backslashes. Note that you cannot escape all characters except \
with 2 backslashes, for example \\t
will become a tab character! Also, you need to escape \
first and then escape the other metacharacters.
Ready for the solution? I'm not even going to try explaining this, found it by experimenting.
# replacement string for the first gsub has 16 backslashes
# replacement string for the second gsub has 8 backslashes
$ echo 'a.b\c^d' | gawk '{gsub(/\\/, "\\\\\\\\\\\\\\\\");
gsub(/[{[(^$*?+.|]/, "\\\\\\\\&")} 1'
a\\.b\\\\c\\^d
gawk manual: Gory details might help you understand the above solution.
Case 2: character class🔗
One of the characteristic of character class is that you can enclose all characters except \
and ^
to match them literally. The \
character is special both inside/outside of character class and [^]
is invalid since ^
is special if used as the first character.
$ echo 'a.b\c^d' | gawk '{gsub(/\\/, "\\\\\\\\\\\\\\\\");
gsub(/[^^\\]/, "[&]");
gsub(/\^/, "\\\\^")} 1'
[a][.][b]\\\\[c]\\^[d]
Case 3: codepoint to represent backslash🔗
Finally, my preferred solutions that uses codepoint instead of escaping backslashes.
# case 1 alternate
$ echo 'a.b\c^d' | gawk '{gsub(/\\/, "\\x5c\\x5c");
gsub(/[{[(^$*?+.|]/, "\\x5c&")} 1'
a\x5c.b\x5c\x5cc\x5c^d
# case 2 alternate
$ echo 'a.b\c^d' | gawk '{gsub(/[^^\\]/, "[&]");
gsub(/\\/, "\\x5c\\x5c");
gsub(/\^/, "\\x5c^")} 1'
[a][.][b]\x5c\x5c[c]\x5c^[d]
Sanity check🔗
I probably lost my sanity trying to come up with a solution and again while writing this post. I did try a few sanity checks for the solutions presented here, but there's a chance I messed up or missed some corner case. If you spot an issue, do let me know.