awk idioms explained

Do you find awk one-liners cryptic? Stuff like !a[$0]++, 1, $1=$1, NR==FNR and -v RS=? You'll find examples and brief explanations for such idioms in this post.

The examples presented here have been tested with GNU awk. These are likely to work with most other implementations of awk as well.

awk command structure🔗

awk 'cond1{action1} cond2{action2} ... condN{actionN}'

When a conditional expression isn't provided, the action is always executed. When an action isn't provided, the $0 variable (which has the contents of the current record being processed) is printed if the conditional expression evaluates to true.

Regexp filtering🔗

# same as: grep 'at' and sed -n '/at/p'
$ printf 'gate\napple\nwhat\nkite\n' | awk '/at/'
gate
what

# same as: grep -v 'e' and sed -n '/e/!p'
$ printf 'gate\napple\nwhat\nkite\n' | awk '!/e/'
what

The generic syntax is string ~ /regexp/ to check if the given string matches the regexp and string !~ /regexp/ to invert the condition.

/regexp/ is a shortcut for $0 ~ /regexp/{print $0}
!/regexp/ is a shortcut for $0 !~ /regexp/{print $0}

Idiomatic use of 1🔗

Non-zero numeric values and non-empty strings are truthy (zero and empty strings are falsy). Idiomatically, 1 is used as a conditional expression to print the contents of $0.

$ echo 'ring amazing jar' | awk '{sub(/ing/, "ed", $2)} 1'
ring amazed jar

$ seq 2 | awk 'BEGIN{print "---"} 1; END{print "==="}'
---
1
2
===

Special variables🔗

$0 contains the current record being processed
$1 first field
$2 second field and so on
FS input field separator
OFS output field separator
NF number of fields
RS input record separator
ORS output record separator
NR number of records (i.e. line number) for the entire input
FNR number of records per file

Removing duplicates🔗

awk '!a[$0]++' is one of the most famous awk one-liners. It eliminates line based duplicates while retaining the input order.

$ cat purchases.txt
coffee
tea
washing powder
coffee
tea
coffee milkshake
soap
tea
washing soda

$ awk '{print +a[$0] "\t" $0; a[$0]++}' purchases.txt
0	coffee
0	tea
0	washing powder
1	coffee
1	tea
0	coffee milkshake
0	soap
2	tea
0	washing soda

# only the entries with zero in the first column will be retained
$ awk '!a[$0]++' purchases.txt
coffee
tea
washing powder
coffee milkshake
soap
washing soda

a[$0] creates an uninitialized element in array a with $0 as the key (if the key doesn't exist yet). Thus, !a[$0] will succeed only on the first occurrence of an item (since an uninitialized value is falsy) and the post-increment operator will ensure that further instances of an item will fail the conditional expression.

Rebuild $0🔗

Sometimes you just want to change the field separator, or perform some record-level text processing and then print it with a new field separator. In such cases, you'll have to explicitly fake a field operation — otherwise the field separation update won't happen for $0.

$ s='sample123string42with777numbers'

$ echo "$s" | awk -F'[0-9]+' -v OFS=, '{$1=$1} 1'
sample,string,with,numbers

$ echo "$s" | awk -F'[0-9]+' -v OFS=- '{gsub(/[aeiou]/, ""); $1=$1} 1'
smpl-strng-wth-nmbrs

Paragraph mode🔗

When RS is set to an empty string, one or more consecutive empty lines is used as the input record separator.

$ cat para.txt
hello world

hi there
how are you

just doing
believe it

banana
papaya
mango

much ado about nothing
he he he
adios amigo

# uninitialized variable 's' will be empty for the first match
# afterwards, 's' will provide the empty line separation
$ awk -v RS= '/do/{print s $0; s="\n"}' para.txt
just doing
believe it

much ado about nothing
he he he
adios amigo

Two file processing🔗

For two files as input, NR==FNR will be true only when the first file is being processed. The next statement will skip the rest of the code for the current record.

$ cat marks.txt
dept    name    marks
ece     raj     53
ece     joel    72
eee     moi     68
cse     surya   81
eee     tia     59
ece     om      92
cse     amy     67

$ cat dept_mark.txt
ece 70
eee 65
cse 80

# match dept and minimum marks specified in dept_mark.txt
$ awk 'NR==FNR{d[$1]=$2; next}
       $1 in d && $3 >= d[$1]' dept_mark.txt marks.txt
ece     joel    72
eee     moi     68
cse     surya   81
ece     om      92

warning Note that the NR==FNR logic will fail if the first file is empty, since NR wouldn't get a chance to increment. You can set a flag after the first file has been processed to avoid this issue — for example, awk '!f{a[$0]; next} !($0 in a)' file1 f=1 file2. See this unix.stackexchange thread for more workarounds.

Forcing string and numeric context🔗

Strings are automatically converted to a number when used in an arithmetic expression (for example, "42" + 5). You can use the unary + and - operators to force numeric context. If the string doesn't start with a valid number (ignoring any starting whitespaces), it will be treated as 0.

$ seq 3 | awk '{sum += $0} END{print sum}'
6
$ awk '{sum += $0} END{print sum}' /dev/null

$ awk '{sum += $0} END{print +sum}' /dev/null
0

Similarly, you can concatenate a string to a number to force string context.

$ awk 'BEGIN{n1="5.0"; n2=5; if(n1==n2) print "equal"}'
$ awk 'BEGIN{n1="5.0"; n2=5; if(n1==n2".0") print "equal"}'
equal

See gawk manual: How awk Converts Between Strings and Numbers for more details.

Programming ebooks🔗

Check out my ebooks on Regular Expressions, Linux CLI tools, Python and Vim. You can get them all as a single bundle via leanpub or gumroad.

Contents