Built-in functions

You've already seen some built-in functions in detail, such as the sub, gsub and gensub functions. This chapter will discuss many more built-ins that are often used in one-liners. You'll also see more examples with arrays.

See gawk manual: Functions for details about all the built-in functions as well as how to define your own functions.

The example_files directory has all the files used in the examples.

length

The length function returns the number of characters for the given string argument. By default, it acts on the $0 variable. Numeric arguments will be automatically converted to strings.

$ awk 'BEGIN{print length("road"); print length(123456)}'
4
6

# recall that the record separator isn't part of $0
# so, line ending won't be counted here
$ printf 'fox\ntiger\n' | awk '{print length()}'
3
5

$ awk 'length($1) < 6' table.txt
brown bread mat hair 42
blue cake mug shirt -7

The -b command line option is handy if you need the number of bytes, instead of the number of characters. Locale also plays a role.

$ echo 'αλεπού' | awk '{print length()}'
6
$ echo 'αλεπού' | awk -b '{print length()}'
12
$ echo 'αλεπού' | LC_ALL=C awk '{print length()}'
12

For the above illustration, you can also use match($0, /$/)-1 to get the byte count, irrespective of the locale or the use of the -b option. This solution was suggested in this issue.

Array sorting

By default, array looping with the for(key in array) format gives you elements in random order. By setting a special value to PROCINFO["sorted_in"], you can control the order in which you wish to retrieve the elements. See gawk manual: Using Predefined Array Scanning Orders for other options and details.

# by default, array is traversed in random order
$ awk 'BEGIN{a["z"]=1; a["x"]=12; a["b"]=42; for(i in a) print i, a[i]}'
x 12
z 1
b 42

# index (i.e. keys) sorted in ascending order as strings
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a) print i, a[i]}'
b 42
x 12
z 1

# value sorted in ascending order as numbers
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a) print i, a[i]}'
z 1
x 12
b 42

Here's an example of sorting input lines in ascending order based on the second column, treating the data as strings.

$ awk 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc"}
       {a[$2]=$0} END{for(k in a) print a[k]}' table.txt
yellow banana window shoes 3.14
brown bread mat hair 42
blue cake mug shirt -7

split

The split function provides the same features as the record splitting done using FS. This is helpful when you need the results as an array for some reason, for example to use array sorting features. Or, when you need to further split a field content. split accepts four arguments, the last two being optional:

First argument is the string to be split
Second argument is the array variable that saves the results
Third argument is the separator, whose default is FS

The return value of the split function is number of fields, similar to the NF variable. The array gets indexed starting from 1 for the first element, 2 for the second element and so on. If the array already had some value, it gets overwritten with the new result.

# same as: awk '{print $2}'
$ printf '     one \t two\t\t\tthree  ' | awk '{split($0, a); print a[2]}'
two

# example with both FS and split in action
$ s='Joe,1996-10-25,64,78'
$ echo "$s" | awk -F, '{split($2, d, "-"); print $1 " was born in " d[1]}'
Joe was born in 1996

# single row to multiple rows based on splitting the last field
$ s='air,water,12:42:3'
$ echo "$s" | awk -F, '{n=split($NF, a, ":");
                       for(i=1; i<=n; i++) print $1, $2, a[i]}'
air water 12
air water 42
air water 3

Similar to FS, you can use a regular expression as the separator.

$ s='Sample123string42with777numbers'
$ echo "$s" | awk '{split($0, s, /[0-9]+/); print s[2], s[4]}'
string numbers

The fourth argument provides a feature not present with FS splitting. It allows you to save the portions matched by the separator in an array.

$ s='Sample123string42with777numbers'
$ echo "$s" | awk '{n=split($0, s, /[0-9]+/, seps);
                   for(i=1; i<n; i++) print seps[i]}'
123
42
777

Quoting from gawk manual: split():
If fieldsep is a single space, then any leading whitespace goes into seps[0] and any trailing whitespace goes into seps[n], where n is the return value of split() (i.e., the number of elements in array).

Here's an example where split helps to initialize an array using an empty separator. Unlike $N syntax where an expression resulting in a floating-point number is acceptable, array index has to be an integer only. Hence, the int function is used to convert the floating-point result to an integer in the example below.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

# adds a new grade column based on marks in the third column
$ awk 'BEGIN{OFS="\t"; split("DCBAS", g, //)}
       {$(NF+1) = NR==1 ? "Grade" : g[int($NF/10)-4]} 1' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

patsplit

The patsplit function will give you the features provided by FPAT. The argument order and optional arguments is same as the split function, with FPAT as the default separator. The return value is number of fields obtained from the split.

$ s='eagle,"fox,42",bee,frog'

$ echo "$s" | awk '{patsplit($0, a, /"[^"]*"|[^,]*/); print a[2]}'
"fox,42"

substr

The substr function helps to extract a specified number of characters from an input string based on indexing. The argument order is:

First argument is the input string
Second argument is the starting position
Third argument is the number of characters to extract

The index starts from 1. If the third argument is not specified, by default all characters until the end of the string is extracted. If the second argument is greater than the length of the string or if the third argument is less than or equal to 0, then an empty string is returned. The second argument will be converted to 1 if a number less than one is specified.

$ echo 'abcdefghij' | awk '{print substr($0, 1, 5)}'
abcde
$ echo 'abcdefghij' | awk '{print substr($0, 4, 3)}'
def

$ echo 'abcdefghij' | awk '{print substr($0, 6)}'
fghij

$ echo 'abcdefghij' | awk -v OFS=: '{print substr($0, 2, 3), substr($0, 6, 3)}'
bcd:fgh

If only a few characters are needed from the input record, you can also use empty FS.

$ echo 'abcdefghij' | awk -v FS= '{print $3}'
c
$ echo 'abcdefghij' | awk -v FS= '{print $3, $5}'
c e

match

The match function is useful to extract portion of an input string matched by a regexp. There are two ways to get the matched portion:

by using the substr function along with special variables RSTART (starting position of the match) and RLENGTH (length of the match)
by passing a third argument to match so that the results are available from an array

The first argument to match is the input string and the second one is the regexp. If the match fails, then RSTART gets 0 and RLENGTH gets -1. Return value is same as RSTART.

$ s='051 035 154 12 26 98234 3'

# using substr and RSTART/RLENGTH
# match a number with >= 4 digits
$ echo "$s" | awk 'match($0, /[0-9]{4,}/){print substr($0, RSTART, RLENGTH)}'
98234

# using array, note that index 0 is used here, not 1
# match a number >= 100 (with optional leading zeros)
$ echo "$s" | awk 'match($0, /0*[1-9][0-9]{2,}/, m){print m[0]}'
154

Both the above examples can also be easily solved using FPAT or patsplit. match has an advantage when it comes to getting portions matched only within capture groups. The first element of the array will still have the entire match. The second element will contain the portion matched by the first group, the third one will contain the portion matched by the second group and so on. See also stackoverflow: arithmetic replacement in a text file.

# entire matched portion
$ echo 'apple=42, fig=314' | awk 'match($0, /fig=([0-9]+)/, m){print m[0]}'
fig=314
# matched portion of the first capture group
$ echo 'apple=42, fig=314' | awk 'match($0, /fig=([0-9]+)/, m){print m[1]}'
314

If you need to get matching portions for all the matches instead of just the first match, you can use a loop and adjust the input string every iteration.

# extract numbers only if it is followed by a comma
$ s='42 apple-5, fig3; x-83, y-20: f12'
$ echo "$s" | awk '{ while( match($0, /([0-9]+),/, m) ){print m[1];
                   $0=substr($0, RSTART+RLENGTH)} }'
5
83

index

The index function is useful when you need to match a string literally. This is similar to the grep -F functionality of matching fixed strings. The first argument to this function is the input string and the second one is the string to be matched literally. The return value is the index of the matching location and 0 if there is no match.

$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

# no output because the metacharacters aren't escaped
$ awk '/i*(t+9-g)/' eqns.txt
# same as: grep -F 'i*(t+9-g)' eqns.txt
$ awk 'index($0, "i*(t+9-g)")' eqns.txt
i*(t+9-g)/8,4-a+b

# check only the last field
$ awk -F, 'index($NF, "a+b")' eqns.txt
i*(t+9-g)/8,4-a+b
# index not needed if the entire field/line is being compared
$ awk -F, '$1=="a+b"' eqns.txt
a+b,pi=3.14,5e12

The return value is useful to ensure that the match is found at specific positions only. For example, the start or end of the string.

# start of string
$ awk 'index($0, "a+b")==1' eqns.txt
a+b,pi=3.14,5e12

# end of string
$ awk -v s="a+b" 'index($0, s)==length()-length(s)+1' eqns.txt
i*(t+9-g)/8,4-a+b

Recall that the -v option gets parsed by awk's string processing rules. So, if you need to pass a literal string without falling in backslash hell, use ENVIRON instead.

$ printf '%s\n' 'a\b\c\d' | awk -v s='a\b' 'index($0, s)'
$ printf '%s\n' 'a\b\c\d' | awk -v s='a\\b' 'index($0, s)'
a\b\c\d
$ printf '%s\n' 'a\b\c\d' | s='a\b' awk 'index($0, ENVIRON["s"])'
a\b\c\d

system

External commands can be issued using the system function. Any output generated by the external command would be as usual on stdout unless redirected while calling the command.

$ awk 'BEGIN{system("echo Hello World")}'
Hello World

$ wc table.txt
 3 15 79 table.txt
$ awk 'BEGIN{system("wc table.txt")}'
 3 15 79 table.txt

$ awk 'BEGIN{system("seq 10 | paste -sd, > out.txt")}'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10

$ cat t2.txt
I bought two balls and 3 bats
$ echo 'f1,t2,f3' | awk -F, '{system("cat " $2 ".txt")}'
I bought two balls and 3 bats

The return value of system depends on the exit status of the executed command. See gawk manual: Input/Output Functions for details.

$ ls xyz.txt
ls: cannot access 'xyz.txt': No such file or directory
$ echo $?
2

$ awk 'BEGIN{s=system("ls xyz.txt"); print "Exit status: " s}'
ls: cannot access 'xyz.txt': No such file or directory
Exit status: 2

printf and sprintf

The printf function is useful over the print function when you need to format the data before printing. Another difference is that OFS and ORS do not affect the printf function. The formatting features are similar to those found in the C programming language and the printf shell built-in command.

# OFMT controls the formatting for numbers displayed with the print function
$ awk 'BEGIN{print OFMT}'
%.6g
$ awk 'BEGIN{sum = 3.1428 + 100; print sum}'
103.143
$ awk 'BEGIN{OFMT="%.5f"; sum = 3.1428 + 100; print sum}'
103.14280

# using printf function
# note the use of \n as ORS isn't appended unlike print
$ awk 'BEGIN{sum = 3.1428 + 10; printf "%f\n", sum}'
13.142800
$ awk 'BEGIN{sum = 3.1428 + 10; printf "%.3f\n", sum}'
13.143

Here are some more formatting examples for floating-point numbers.

# total length is 10, filled with space if needed
# [ and ] are used here for visualization purposes
$ awk 'BEGIN{pi = 3.14159; printf "[%10.3f]\n", pi}'
[     3.142]
$ awk 'BEGIN{pi = 3.14159; printf "[%-10.3f]\n", pi}'
[3.142     ]

# zero filled
$ awk 'BEGIN{pi = 3.14159; printf "%010.3f\n", pi}'
000003.142

# scientific notation
$ awk 'BEGIN{pi = 3.14159; printf "%e\n", pi}'
3.141590e+00

Here are some formatting examples for integers.

# note that there is no rounding
$ awk 'BEGIN{printf "%d\n", 1.99}'
1

# ensure there's always a sign prefixed for integers
$ awk 'BEGIN{printf "%+d\n", 100}'
+100
$ awk 'BEGIN{printf "%+d\n", -100}'
-100

Here are some formatting examples for strings.

# prefix remaining width with spaces
$ awk 'BEGIN{printf "|%10s|\n", "mango"}'
|     mango|

# suffix remaining width with spaces
$ awk 'BEGIN{printf "|%-10s|\n", "mango"}'
|mango     |

# truncate
$ awk '{printf "%.4s\n", $0}' table.txt
brow
blue
yell

You can also refer to an argument using N$ format, where N is the positional number of argument. One advantage with this method is that you can reuse an argument any number of times. You cannot mix this format with the normal way.

$ awk 'BEGIN{printf "%1$d + %2$d * %1$d = %3$d\n", 3, 4, 15}'
3 + 4 * 3 = 15
# remove # if you do not need the prefix
$ awk 'BEGIN{printf "hex=%1$#x\noct=%1$#o\ndec=%1$d\n", 15}'
hex=0xf
oct=017
dec=15

You can pass variables by specifying a * instead of a number in the formatting string.

# same as: awk 'BEGIN{pi = 3.14159; printf "%010.3f\n",  pi}'
$ awk 'BEGIN{d=10; p=3; pi = 3.14159; printf "%0*.*f\n", d, p, pi}'
000003.142

Passing a variable directly to printf without using a format specifier can result in an error depending upon the contents of the variable.
$ awk 'BEGIN{s="solve: 5 % x = 1"; printf s}'
awk: cmd. line:1: fatal: not enough arguments to satisfy format string
        `solve: 5 % x = 1'
                   ^ ran out for this one

So, as a good practice, always use variables with an appropriate format instead of passing it directly to printf.

$ awk 'BEGIN{s="solve: 5 % x = 1"; printf "%s\n", s}'
solve: 5 % x = 1

If % has to be used literally inside the format specifier, use %%. This is similar to using \\ in regexps to represent \ literally.

$ awk 'BEGIN{printf "n%%d gives the remainder\n"}'
n%d gives the remainder

To save the results of the formatting in a variable instead of printing, use the sprintf function. Unlike printf, parentheses are always required to use this function.

$ awk 'BEGIN{pi = 3.14159; s = sprintf("%010.3f", pi); print s}'
000003.142

See gawk manual: printf for complete list of formatting options and other details.

Redirecting print output

The results from the print and printf functions can be redirected to a shell command or a file instead of stdout. There's nothing special about it, you could have done it using shell redirections as well. The use case arises when you need to redirect only a specific portion or if you need multiple redirections within the same awk command. Here are some examples of redirecting to multiple files.

$ seq 6 | awk 'NR%2{print > "odd.txt"; next} {print > "even.txt"}'
$ cat odd.txt
1
3
5
$ cat even.txt
2
4
6

# dynamically creating filenames
$ awk -v OFS='\t' 'NR>1{print $2, $3 > $1".txt"}' marks.txt
# output for one of the departments
$ cat ECE.txt
Raj     53
Joel    72
Om      92

Note that the use of > doesn't mean that the file will get overwritten everytime. That happens only once if the file already existed prior to executing the awk command. Use >> if you wish to append to already existing files.

As seen in the above examples, the filenames are passed as string expressions. To redirect to a shell command, again you need to pass a string expression after the | pipe symbol. Here's an example:

$ awk '{print $2 | "paste -sd,"}' table.txt
bread,cake,banana

And here are some examples with multiple redirections.

$ awk '{print $2 | "sort | paste -sd,"}' table.txt
banana,bread,cake

# sort the output before writing to files
$ awk -v OFS='\t' 'NR>1{print $2, $3 | "sort > "$1".txt"}' marks.txt
# output for one of the departments
$ cat ECE.txt
Joel    72
Om      92
Raj     53

See gawk manual: Redirecting Output of print and printf for more details and operators on redirections. And see gawk manual: Closing Input and Output Redirections if you have too many redirections.

Summary

This chapter covered some of the built-in functions provided by awk. Do check the manual for more of them, for example math and time related functions.

Next chapter will cover features related to processing multiple files passed as input to awk.

Exercises

The exercises directory has all the files used in this section.

Exercises will also include functions and features not discussed in this chapter. Refer to gawk manual: Functions for details.

1) For the input file scores.csv, sort the rows in descending order based on the values in the Physics column. Header should be retained as the first line in the output.

$ awk ##### add your solution here
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80
Er,56,79,92
Ort,68,72,66
Blue,67,46,99

2) For the input file nums3.txt, calculate the square root of numbers and display the results in two different formats as shown below. First, with four digits after the fractional point and then in the scientific notation, again with four digits after the fractional point. Assume that the input has only a single column of positive numbers.

$ cat nums3.txt
3.14
4201
777
0323012

$ awk ##### add your solution here
1.7720
64.8151
27.8747
568.3414

$ awk ##### add your solution here
1.7720e+00
6.4815e+01
2.7875e+01
5.6834e+02

3) For the input file items.txt, assume space as the field separator. From the second field, remove the second : character and the number that follows. Modify the last field by multiplying it by the number that was deleted from the second field.

$ cat items.txt
apple rxg:12:-425 og 6.2
fig zwt:3.64:12.89e2 ljg 5
banana ysl:42:3.14 vle 45

$ awk ##### add your solution here
apple rxg:12 og -2635
fig zwt:3.64 ljg 6445
banana ysl:42 vle 141.3

4) For the input file sum.txt, assume space as the field separator. Replace the second field with the sum of the two numbers embedded in it. The numbers can be positive/negative integers or floating-point numbers but not scientific notation.

$ cat sum.txt
f2:z3 kt//-42\\3.14//tw 5y6
t5:x7 qr;wq<=>+10{-8764.124}yb u9
apple:fig 100:32 9j4

$ awk ##### add your solution here
f2:z3 -38.86 5y6
t5:x7 -8754.12 u9
apple:fig 132 9j4

5) For the given input strings, extract portion of the line starting from the matching location specified by the shell variable s till the end of the line. If there is no match, do not print that line. The contents of s should be matched literally.

$ s='(a^b)'
$ echo '3*f + (a^b) - 45' | ##### add your solution here
(a^b) - 45

$ s='\&/'
# should be no output for this input
$ printf '%s\n' 'f\&z\&2.14' | ##### add your solution here
# but this one has a match
$ printf '%s\n' 'f\&z\&/2.14' | ##### add your solution here
\&/2.14

6) Extract all positive integers preceded by - and followed by : or ;. Display the matching portions separated by a newline character.

$ s='42 apple-5; fig3; x-83, y-20:-34; f12'
$ echo "$s" | awk ##### add your solution here
5
20
34

7) For the input file scores.csv, calculate the average score for each row. Those with average greater than or equal to 80 should be saved in pass.csv and the rest in fail.csv. The output files should have the names followed by a tab character, and finally the average score (two decimal points).

$ awk ##### add your solution here

$ cat fail.csv
Blue    70.67
Er      75.67
Ort     68.67
$ cat pass.csv
Lin     80.33
Cy      96.67
Ith     100.00

8) For the input file files.txt, replace lines starting with a space with the output of that line executed as a shell command.

$ cat files.txt
 sed -n '2p' addr.txt
-----------
 wc -w sample.txt
===========
 awk '{print $1}' table.txt
-----------

$ awk ##### add your solution here
How are you
-----------
31 sample.txt
===========
brown
blue
yellow
-----------

9) For the input file fw.txt, format the last column in scientific notation with two digits after the decimal point.

$ awk ##### add your solution here
1.3  rs   90  1.35e-01
3.8           6.00e+00
5.2  ye       8.24e+00
4.2  kt   32  4.51e+01

10) For the input file addr.txt, display all lines containing e or u but not both.

Hint — gawk manual: Bit-Manipulation Functions.

$ awk ##### add your solution here
Hello World
This game is good
Today is sunny

11) For the input file patterns.txt, filter lines containing [5] at the start of a line. The search term should be matched literally.

$ awk ##### add your solution here
[5]*3

12) For the input file table.txt, uppercase the third field.

$ awk ##### add your solution here
brown bread MAT hair 42
blue cake MUG shirt -7
yellow banana WINDOW shoes 3.14

13) For the input files patterns.txt and sum.txt, match lines containing the literal value stored in the s variable. Assume that the s variable has regexp metacharacters.

$ s='[5]'
##### add your solution here
(9-2)*[5]
[5]*3

$ s='\\'
##### add your solution here
f2:z3 kt//-42\\3.14//tw 5y6

CLI text processing with GNU awk