GNU awk idioms explained
Do you find awk
one-liners cryptic? Stuff like !a[$0]++
, 1
, $1=$1
, NR==FNR
and -v RS=
? You'll find examples and brief explanations for such idioms in this post.
awk command structure🔗
awk 'cond1{action1} cond2{action2} ... condN{actionN}'
When a conditional expression isn't provided, the action is always executed. When an action isn't provided, the $0
variable (which has the contents of the current record being processed) is printed if the conditional expression evaluates to true.
Regexp filtering🔗
# same as: grep 'at' and sed -n '/at/p'
$ printf 'gate\napple\nwhat\nkite\n' | awk '/at/'
gate
what
# same as: grep -v 'e' and sed -n '/e/!p'
$ printf 'gate\napple\nwhat\nkite\n' | awk '!/e/'
what
The generic syntax is string ~ /regexp/
to check if the given string matches the regexp and string !~ /regexp/
to invert the condition.
/regexp/
is a shortcut for$0 ~ /regexp/{print $0}
!/regexp/
is a shortcut for$0 !~ /regexp/{print $0}
Idiomatic use of 1🔗
Non-zero numeric values and non-empty strings are truthy (zero and empty strings are falsy). Idiomatically, 1
is used as a conditional expression to print the contents of $0
.
$ echo 'ring amazing jar' | awk '{sub(/ing/, "ed", $2)} 1'
ring amazed jar
$ seq 2 | awk 'BEGIN{print "---"} 1; END{print "==="}'
---
1
2
===
Special variables🔗
$0
contains the current record being processed$1
first field$2
second field and so onFS
input field separatorOFS
output field separatorNF
number of fieldsRS
input record separatorORS
output record separatorNR
number of records (i.e. line number) for the entire inputFNR
number of records per file
Removing duplicates🔗
awk '!a[$0]++'
is one of the most famous awk
one-liners. It eliminates line based duplicates while retaining the input order.
$ cat purchases.txt
coffee
tea
washing powder
coffee
tea
coffee milkshake
soap
tea
washing soda
$ awk '{print +a[$0] "\t" $0; a[$0]++}' purchases.txt
0 coffee
0 tea
0 washing powder
1 coffee
1 tea
0 coffee milkshake
0 soap
2 tea
0 washing soda
# only the entries with zero in the first column will be retained
$ awk '!a[$0]++' purchases.txt
coffee
tea
washing powder
coffee milkshake
soap
washing soda
a[$0]
creates an uninitialized element in array a
with $0
as the key (if the key doesn't exist yet). Thus, !a[$0]
will succeed only on the first occurrence of an item (since an uninitialized value is falsy) and the post-increment operator will ensure that further instances of an item will fail the conditional expression.
Rebuild $0🔗
Sometimes you just want to change the field separator, or perform some record-level text processing and then print it with a new field separator. In such cases, you'll have to explicitly fake a field operation — otherwise the field separation update won't happen for $0
.
$ s='sample123string42with777numbers'
$ echo "$s" | awk -F'[0-9]+' -v OFS=, '{$1=$1} 1'
sample,string,with,numbers
$ echo "$s" | awk -F'[0-9]+' -v OFS=- '{gsub(/[aeiou]/, ""); $1=$1} 1'
smpl-strng-wth-nmbrs
Paragraph mode🔗
When RS
is set to an empty string, one or more consecutive empty lines is used as the input record separator.
$ cat para.txt
hello world
hi there
how are you
just doing
believe it
banana
papaya
mango
much ado about nothing
he he he
adios amigo
# uninitialized variable 's' will be empty for the first match
# afterwards, 's' will provide the empty line separation
$ awk -v RS= '/do/{print s $0; s="\n"}' para.txt
just doing
believe it
much ado about nothing
he he he
adios amigo
Two file processing🔗
For two files as input, NR==FNR
will be true only when the first file is being processed. The next
statement will skip the rest of the code for the current record.
$ cat marks.txt
dept name marks
ece raj 53
ece joel 72
eee moi 68
cse surya 81
eee tia 59
ece om 92
cse amy 67
$ cat dept_mark.txt
ece 70
eee 65
cse 80
# match dept and minimum marks specified in dept_mark.txt
$ awk 'NR==FNR{d[$1]=$2; next}
$1 in d && $3 >= d[$1]' dept_mark.txt marks.txt
ece joel 72
eee moi 68
cse surya 81
ece om 92
Note that the
NR==FNR
logic will fail if the first file is empty, since NR
wouldn't get a chance to increment. You can set a flag after the first file has been processed to avoid this issue — for example, awk '!f{a[$0]; next} !($0 in a)' file1 f=1 file2
. See this unix.stackexchange thread for more workarounds.
Forcing string and numeric context🔗
Strings are automatically converted to a number when used in an arithmetic expression (for example, "42" + 5
). You can use the unary +
and -
operators to force numeric context. If the string doesn't start with a valid number (ignoring any starting whitespaces), it will be treated as 0
.
$ seq 3 | awk '{sum += $0} END{print sum}'
6
$ awk '{sum += $0} END{print sum}' /dev/null
$ awk '{sum += $0} END{print +sum}' /dev/null
0
Similarly, you can concatenate a string to a number to force string context.
$ awk 'BEGIN{n1="5.0"; n2=5; if(n1==n2) print "equal"}'
$ awk 'BEGIN{n1="5.0"; n2=5; if(n1==n2".0") print "equal"}'
equal
See gawk manual: How awk Converts Between Strings and Numbers for more details.
Programming ebooks🔗
Check out my ebooks on Regular Expressions, Linux CLI tools, Python and Vim. You can get them all as a single bundle via leanpub or gumroad.