Built-in functions

You've already seen some built-in functions in detail, such as sub, gsub and gensub functions. This chapter will discuss many more built-ins that are often used in one-liners. You'll also see more examples with arrays.

info See gawk manual: Functions for details about all the built-in functions as well as how to define your own functions.

length

length function returns number of characters for the given string argument. By default, it acts on $0 variable and a number argument is converted to string automatically.

$ awk 'BEGIN{print length("road"); print length(123456)}'
4
6

$ # recall that record separator isn't part of $0
$ # so, line ending won't be counted here
$ printf 'fox\ntiger\n' | awk '{print length()}'
3
5

$ awk 'length($1) < 6' table.txt
brown bread mat hair 42
blue cake mug shirt -7

If you need number of bytes, instead of number of characters, then use the -b command line option as well. Locale can also play a role.

$ echo 'αλεπού' | awk '{print length()}'
6
$ echo 'αλεπού' | awk -b '{print length()}'
12
$ echo 'αλεπού' | LC_ALL=C awk '{print length()}'
12

Array sorting

By default, array looping with for(key in array) format gives you elements in random order. By setting a special value to PROCINFO["sorted_in"], you can control the order in which you wish to retrieve the elements. See gawk manual: Using Predefined Array Scanning Orders for other options and details.

$ # by default, array is traversed in random order
$ awk 'BEGIN{a["z"]=1; a["x"]=12; a["b"]=42; for(i in a) print i, a[i]}'
x 12
z 1
b 42

$ # index (i.e. keys) sorted in ascending order as strings
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a) print i, a[i]}'
b 42
x 12
z 1

$ # value sorted in ascending order as numbers
$ awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_asc";
       a["z"]=1; a["x"]=12; a["b"]=42; for(i in a) print i, a[i]}'
z 1
x 12
b 42

Here's an example of sorting input lines in ascending order based on second column, treating the data as string.

$ awk 'BEGIN{PROCINFO["sorted_in"] = "@ind_str_asc"}
       {a[$2]=$0} END{for(k in a) print a[k]}' table.txt
yellow banana window shoes 3.14
brown bread mat hair 42
blue cake mug shirt -7

split

The split function provides the same features as the record splitting done using FS. This is helpful when you need the results as an array for some reason, for example to use array sorting features. Or, when you need to further split a field content. split accepts four arguments, the last two being optional.

  • First argument is the string to be split
  • Second argument is the array variable to save results
  • Third argument is the separator, whose default is FS

The return value of split function is number of fields, similar to NF variable. The array gets indexed starting from 1 for first element, 2 for second element and so on. If the array already had some value, it gets overwritten with the new value.

$ # same as: awk '{print $2}'
$ printf '     one \t two\t\t\tthree  ' | awk '{split($0, a); print a[2]}'
two

$ # example with both FS and split in action
$ s='Joe,1996-10-25,64,78'
$ echo "$s" | awk -F, '{split($2, d, "-"); print $1 " was born in " d[1]}'
Joe was born in 1996

$ # single row to multiple rows based on splitting last field
$ s='air,water,12:42:3'
$ echo "$s" | awk -F, '{n=split($NF, a, ":");
                       for(i=1; i<=n; i++) print $1, $2, a[i]}'
air water 12
air water 42
air water 3

Similar to FS, you can use regular expression as a separator.

$ s='Sample123string42with777numbers'
$ echo "$s" | awk '{split($0, s, /[0-9]+/); print s[2], s[4]}'
string numbers

The fourth argument provides a feature not present with FS splitting. It allows you to save the portions matched by the separator in an array. Quoting from gawk manual: split():

If fieldsep is a single space, then any leading whitespace goes into seps[0] and any trailing whitespace goes into seps[n], where n is the return value of split() (i.e., the number of elements in array).

$ s='Sample123string42with777numbers'
$ echo "$s" | awk '{n=split($0, s, /[0-9]+/, seps);
                   for(i=1; i<n; i++) print seps[i]}'
123
42
777

Here's an example where split is merely used to initialize an array based on empty separator. Unlike $N syntax where an expression resulting in floating-point number is acceptable, array index has to be an integer. Hence, int function is used to convert floating-point result to integer in the example below.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

$ # adds a new grade column based on marks in 3rd column
$ awk 'BEGIN{OFS="\t"; split("DCBAS", g, //)}
       {$(NF+1) = NR==1 ? "Grade" : g[int($NF/10)-4]} 1' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

patsplit

The patsplit function will give you the features provided by FPAT. The argument order and optional arguments is same as the split function, with FPAT as the default separator. The return value is number of fields obtained from the split.

$ s='eagle,"fox,42",bee,frog'

$ echo "$s" | awk '{patsplit($0, a, /"[^"]*"|[^,]*/); print a[2]}'
"fox,42"

substr

The substr function allows to extract specified number of characters from given string based on indexing. The argument order is:

  • First argument is the input string
  • Second argument is starting position
  • Third argument is number of characters to extract

The index starts from 1. If the third argument is not specified, by default all characters until the end of string input is extracted. If the second argument is greater than length of the string or if third argument is less than or equal to 0 then empty string is returned. Second argument will use 1 if a number less than one is specified.

$ echo 'abcdefghij' | awk '{print substr($0, 1, 5)}'
abcde
$ echo 'abcdefghij' | awk '{print substr($0, 4, 3)}'
def

$ echo 'abcdefghij' | awk '{print substr($0, 6)}'
fghij

$ echo 'abcdefghij' | awk -v OFS=: '{print substr($0, 2, 3), substr($0, 6, 3)}'
bcd:fgh

If only a few characters are needed from input record, can also use empty FS.

$ echo 'abcdefghij' | awk -v FS= '{print $3}'
c
$ echo 'abcdefghij' | awk -v FS= '{print $3, $5}'
c e

match

The match function is useful to extract portion of an input string matched by a regexp. There are two ways to get the matched portion:

  • by using substr function along with special variables RSTART and RLENGTH
  • by passing a third argument to match so that the results are available from an array

The first argument to match is the input string and second is the regexp. If the match fails, then RSTART gets 0 and RLENGTH gets -1. Return value is same as RSTART.

$ s='051 035 154 12 26 98234'

$ # using substr and RSTART/RLENGTH
$ echo "$s" | awk 'match($0, /[0-9]{4,}/){print substr($0, RSTART, RLENGTH)}'
98234

$ # using array, note that index 0 is used here, not 1
$ echo "$s" | awk 'match($0, /0*[1-9][0-9]{2,}/, m){print m[0]}'
154

Both the above examples can also be easily solved using FPAT or patsplit. match has an advantage when it comes to getting portions matched only within capture groups. The first element of array will still have the entire match. Second element will contain portion matched by first group, third element will contain portion matched by second group and so on. See also stackoverflow: arithmetic replacement in a text file.

$ # entire matched portion
$ echo 'foo=42, baz=314' | awk 'match($0, /baz=([0-9]+)/, m){print m[0]}'
baz=314
$ # matched portion of first capture group
$ echo 'foo=42, baz=314' | awk 'match($0, /baz=([0-9]+)/, m){print m[1]}'
314

If you need to get matching portions for all the matches instead of just the first match, you can use a loop and adjust the input string every iteration.

$ # extract numbers only if it is followed by a comma
$ s='42 foo-5, baz3; x-83, y-20: f12'
$ echo "$s" | awk '{ while( match($0, /([0-9]+),/, m) ){print m[1];
                   $0=substr($0, RSTART+RLENGTH)} }'
5
83

index

The index function is useful when you need to match a string literally in the given input string. This is similar to grep -F functionality of matching fixed strings. The first argument to this function is the input string and the second is the string to be matched literally. The return value is the index of matching location and 0 if there is no match.

$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # no output because the metacharacters aren't escaped
$ awk '/i*(t+9-g)/' eqns.txt
$ # same as: grep -F 'i*(t+9-g)' eqns.txt
$ awk 'index($0, "i*(t+9-g)")' eqns.txt
i*(t+9-g)/8,4-a+b

$ # check only the last field
$ awk -F, 'index($NF, "a+b")' eqns.txt
i*(t+9-g)/8,4-a+b
$ # index not needed if entire field/line is being compared
$ awk -F, '$1=="a+b"' eqns.txt
a+b,pi=3.14,5e12

The return value is also useful to ensure match is found at specific positions only. For example start or end of input string.

$ # start of string
$ awk 'index($0, "a+b")==1' eqns.txt
a+b,pi=3.14,5e12
$ # end of string
$ awk -v s="a+b" 'index($0, s)==length()-length(s)+1' eqns.txt
i*(t+9-g)/8,4-a+b

Recall that -v option gets parsed by awk's string processing rules. So, if you need to pass a literal string without falling in backslash hell, use ENVIRON instead of -v option.

$ echo 'a\b\c\d' | awk -v s='a\b' 'index($0, s)'
$ echo 'a\b\c\d' | awk -v s='a\\b' 'index($0, s)'
a\b\c\d
$ echo 'a\b\c\d' | s='a\b' awk 'index($0, ENVIRON["s"])'
a\b\c\d

system

External commands can be issued using the system function. Any output generated by the external command would be as usual on stdout unless redirected while calling the command.

$ awk 'BEGIN{system("echo Hello World")}'
Hello World

$ wc table.txt
 3 15 79 table.txt
$ awk 'BEGIN{system("wc table.txt")}'
 3 15 79 table.txt

$ awk 'BEGIN{system("seq 10 | paste -sd, > out.txt")}'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10

$ cat t2.txt
I bought two balls and 3 bats
$ echo 'f1,t2,f3' | awk -F, '{system("cat " $2 ".txt")}'
I bought two balls and 3 bats

Return value of system depends on exit status of the executed command. See gawk manual: Input/Output Functions for details.

$ ls xyz.txt
ls: cannot access 'xyz.txt': No such file or directory
$ echo $?
2

$ awk 'BEGIN{s=system("ls xyz.txt"); print "Exit status: " s}'
ls: cannot access 'xyz.txt': No such file or directory
Exit status: 2

printf and sprintf

The printf function is useful over print function when you need to format the data before printing. Another difference is that OFS and ORS do not affect the printf function. The features are similar to those found in C programming language and the shell built-in command.

$ # OFMT controls the formatting for numbers displayed with print function
$ awk 'BEGIN{print OFMT}'
%.6g
$ awk 'BEGIN{sum = 3.1428 + 100; print sum}'
103.143
$ awk 'BEGIN{OFMT="%.5f"; sum = 3.1428 + 100; print sum}'
103.14280

$ # using printf function
$ # note the use of \n as ORS isn't appended unlike print
$ awk 'BEGIN{sum = 3.1428 + 10; printf "%f\n", sum}'
13.142800
$ awk 'BEGIN{sum = 3.1428 + 10; printf "%.3f\n", sum}'
13.143

Here's some more formatting options for floating-point numbers.

$ # total length is 10, filled with space if needed
$ # [ and ] are used here for visualization purposes
$ awk 'BEGIN{pi = 3.14159; printf "[%10.3f]\n", pi}'
[     3.142]
$ awk 'BEGIN{pi = 3.14159; printf "[%-10.3f]\n", pi}'
[3.142     ]

$ # zero filled
$ awk 'BEGIN{pi = 3.14159; printf "%010.3f\n", pi}'
000003.142

$ # scientific notation
$ awk 'BEGIN{pi = 3.14159; printf "%e\n", pi}'
3.141590e+00

Here's some formatting options for integers.

$ # note that there is no rounding
$ awk 'BEGIN{printf "%d\n", 1.99}'
1

$ # ensure there's always a sign prefixed to integer
$ awk 'BEGIN{printf "%+d\n", 100}'
+100
$ awk 'BEGIN{printf "%+d\n", -100}'
-100

Here's some formatting options for strings.

$ # prefix remaining width with spaces
$ awk 'BEGIN{printf "|%10s|\n", "mango"}'
|     mango|

$ # suffix remaining width with spaces
$ awk 'BEGIN{printf "|%-10s|\n", "mango"}'
|mango     |

$ # truncate
$ awk '{printf "%.4s\n", $0}' table.txt
brow
blue
yell

You can also refer to an argument using N$ format, where N is the positional number of argument. One advantage with this method is that you can reuse an argument any number of times. You cannot mix this format with the normal way.

$ awk 'BEGIN{printf "%1$d + %2$d * %1$d = %3$d\n", 3, 4, 15}'
3 + 4 * 3 = 15
$ # remove # if you do not need the prefix
$ awk 'BEGIN{printf "hex=%1$#x\noct=%1$#o\ndec=%1$d\n", 15}'
hex=0xf
oct=017
dec=15

You can pass variables by specifying a * instead of a number in the formatting string.

$ # same as: awk 'BEGIN{pi = 3.14159; printf "%010.3f\n",  pi}'
$ awk 'BEGIN{d=10; p=3; pi = 3.14159; printf "%0*.*f\n", d, p, pi}'
000003.142

warning Passing a variable directly to printf without using a format specifier can result in error depending upon the contents of the variable.

$ awk 'BEGIN{s="solve: 5 % x = 1"; printf s}'
awk: cmd. line:1: fatal: not enough arguments to satisfy format string
    `solve: 5 % x = 1'
               ^ ran out for this one

So, as a good practice, always use variables with appropriate format instead of passing it directly to printf.

$ awk 'BEGIN{s="solve: 5 % x = 1"; printf "%s\n", s}'
solve: 5 % x = 1

If % has to be used literally inside the format specifier, use %%. This is similar to using \\ in regexp to represent \ literally.

$ awk 'BEGIN{printf "n%%d gives the remainder\n"}'
n%d gives the remainder

To save the results of the formatting in a variable instead of printing, use sprintf function. Unlike printf, parentheses are always required to use sprintf function.

$ awk 'BEGIN{pi = 3.14159; s = sprintf("%010.3f", pi); print s}'
000003.142

info See gawk manual: printf for complete list of formatting options and other details.

Redirecting print output

The results from print and printf functions can be redirected to a shell command or a file instead of stdout. There's nothing special about it, you could have done it normally on awk command as well. The use case arises when you need to redirect only a specific portion or if you need multiple redirections within the same awk command. Here's some examples of redirecting to multiple files.

$ seq 6 | awk 'NR%2{print > "odd.txt"; next} {print > "even.txt"}'
$ cat odd.txt
1
3
5
$ cat even.txt
2
4
6

$ # dynamically creating filenames
$ awk -v OFS='\t' 'NR>1{print $2, $3 > $1".txt"}' marks.txt
$ # output for one of the departments
$ cat ECE.txt
Raj     53
Joel    72
Om      92

Note that the use of > doesn't mean that the file will get overwritten everytime. That happens only once if the file already existed prior to executing the awk command. Use >> if you wish to append to already existing files.

As seen in above examples, the file names are passed as string expressions. To redirect to a shell command, again you need to pass a string expression after | pipe symbol. Here's an example.

$ awk '{print $2 | "paste -sd,"}' table.txt
bread,cake,banana

And here's some examples of multiple redirections.

$ awk '{print $2 | "sort | paste -sd,"}' table.txt
banana,bread,cake

$ # sort the output before writing to files
$ awk -v OFS='\t' 'NR>1{print $2, $3 | "sort > "$1".txt"}' marks.txt
$ # output for one of the departments
$ cat ECE.txt
Joel    72
Om      92
Raj     53

info See gawk manual: Redirecting Output of print and printf for more details and operators on redirections. And see gawk manual: Closing Input and Output Redirections if you have too many redirections.

Summary

This chapter covered some of the built-in functions provided by awk. Do check the manual for more of them, for example math and time related functions.

Next chapter will cover features related to processing multiple files passed as input to awk.

Exercises

info Exercises will also include functions and features not discussed in this chapter. Refer to gawk manual: Functions for details.

a) For the input file scores.csv, sort the rows based on Physics values in descending order. Header should be retained as the first line in output.

$ awk ##### add your solution here
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80
Er,56,79,92
Ort,68,72,66
Blue,67,46,99

b) For the input file nums3.txt, calculate the square root of numbers and display in two different formats. First with four digits after fractional point and next in scientific notation, again with four digits after fractional point. Assume input has only single column positive numbers.

$ awk ##### add your solution here
1.7720
64.8151
27.8747
568.3414

$ awk ##### add your solution here
1.7720e+00
6.4815e+01
2.7875e+01
5.6834e+02

c) Transform the given input strings to the corresponding output shown. Assume space as the field separators. From the second field, remove the second : and the number that follows. Modify the last field by multiplying it by the number that was deleted from the second field. The numbers can be positive/negative integers or floating-point numbers (including scientific notation).

$ echo 'go x:12:-425 og 6.2' | awk ##### add your solution here
go x:12 og -2635

$ echo 'rx zwt:3.64:12.89e2 ljg 5' | awk ##### add your solution here
rx zwt:3.64 ljg 6445

d) Transform the given input strings to the corresponding output shown. Assume space as the field separators. Replace the second field with sum of the two numbers embedded in it. The numbers can be positive/negative integers or floating-point numbers (but not scientific notation).

$ echo 'f2:z3 kt//-42\\3.14//tw 5y6' | awk ##### add your solution here
f2:z3 -38.86 5y6

$ echo 't5:x7 qr;wq<=>+10{-8764.124}yb u9' | awk ##### add your solution here
t5:x7 -8754.12 u9

e) For the given input strings, extract portion of the line starting from the matching location specified by shell variable s till the end of the line. If there is no match, do not print that line. The contents of s should be matched literally.

$ s='(a^b)'
$ echo '3*f + (a^b) - 45' | ##### add your solution here
(a^b) - 45

$ s='\&/'
$ # should be no output for this input
$ echo 'f\&z\&2.14' | ##### add your solution here
$ # but this one has a match
$ echo 'f\&z\&/2.14' | ##### add your solution here
\&/2.14

f) Extract all positive integers preceded by - and followed by : or ; and display all such matches separated by a newline character.

$ s='42 foo-5; baz3; x-83, y-20:-34; f12'
$ echo "$s" | awk ##### add your solution here
5
20
34

g) For the input file scores.csv, calculate the average of three marks for each Name. Those with average greater than or equal to 80 should be saved in pass.csv and the rest in fail.csv. The format is Name and average score (up to two decimal points) separated by a tab character.

$ awk ##### add your solution here

$ cat fail.csv
Blue    70.67
Er      75.67
Ort     68.67
$ cat pass.csv
Lin     80.33
Cy      96.67
Ith     100.00

h) For the input file files.txt, replace lines starting with a space with the output of that line executed as a shell command.

$ cat files.txt
 sed -n '2p' addr.txt
-----------
 wc -w sample.txt
===========
 awk '{print $1}' table.txt
-----------

$ awk ##### add your solution here
How are you
-----------
31 sample.txt
===========
brown
blue
yellow
-----------

i) For the input file fw.txt, format the last column of numbers in scientific notation with two digits after the decimal point.

$ awk ##### add your solution here
1.3  rs   90  1.35e-01
3.8           6.00e+00
5.2  ye       8.24e+00
4.2  kt   32  4.51e+01

j) For the input file addr.txt, display all lines containing e or u but not both.

info Hint — gawk manual: Bit-Manipulation Functions.

$ awk ##### add your solution here
Hello World
This game is good
Today is sunny