Field separators

Now that you are familiar with basic awk syntax and regular expressions, this chapter will dive deep into field processing. You'll learn how to set input and output field separators, how to use regexps for defining fields and how to work with fixed length fields.

The example_files directory has all the files used in the examples.

Default field separation

As seen earlier, awk automatically splits input into fields which are accessible using $N where N is the field number you need. You can also pass an expression instead of a numeric literal to specify the field required.

$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14

# print the fourth field if the first field starts with 'b'
$ awk '$1 ~ /^b/{print $4}' table.txt
hair
shirt

# print the field as specified by the value stored in the 'f' variable
$ awk -v f=3 '{print $f}' table.txt
mat
mug
window

The NF special variable will give you the number of fields for each input line. This is useful when you don't know how many fields are present in the input and you need to process fields from the end.

# print the last field of each input line
$ awk '{print $NF}' table.txt
42
-7
3.14

# print the last but one field
$ awk '{print $(NF-1)}' table.txt
hair
shirt
shoes

# don't forget the parentheses!
# this will subtract 1 from the last field and print it
$ awk '{print $NF-1}' table.txt
41
-8
2.14

By default, awk does more than split the input on spaces. It splits based on one or more sequence of space or tab or newline characters. In addition, any of these three characters at the start or end of input gets trimmed and won't be part of the field contents. Input containing newline characters will be covered in the Record separators chapter.

$ echo '   a   b   c   ' | awk '{print NF}'
3
# note that the leading spaces aren't part of the field content
$ echo '   a   b   c   ' | awk '{print $1}'
a
# note that the trailing spaces aren't part of the field content
$ echo '   a   b   c   ' | awk '{print $NF "."}'
c.

# here's another example with tab characters thrown in
$ printf '     one \t two\t\t\tthree  ' | awk '{print NF}'
3
$ printf '     one \t two\t\t\tthree  ' | awk '{print $2 "."}'
two.

When passing an expression for field number, floating-point result is acceptable too. The fractional portion is ignored. However, as precision is limited, it could result in rounding instead of truncation.
$ awk 'BEGIN{printf "%.16f\n", 2.999999999999999}'
2.9999999999999991
$ awk 'BEGIN{printf "%.16f\n", 2.9999999999999999}'
3.0000000000000000

# same as: awk '{print $2}' table.txt
$ awk '{print $2.999999999999999}' table.txt
bread
cake
banana

# same as: awk '{print $3}' table.txt
$ awk '{print $2.9999999999999999}' table.txt
mat
mug
window

Input field separator

The most common way to change the default field separator is to use the -F command line option. The value passed to the option will be treated as a string literal and then converted to a regexp. For now, here are some examples without any special regexp characters.

# use ':' as the input field separator
$ echo 'goal:amazing:whistle:kwality' | awk -F: '{print $1}'
goal
$ echo 'goal:amazing:whistle:kwality' | awk -F: '{print $NF}'
kwality

# use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | awk -F';' '{print $3}'
three

# first and last fields will have empty string as their values
$ echo '=a=b=c=' | awk -F= '{print $1 "[" $NF "]"}'
[]

# difference between empty lines and lines without field separator
$ printf '\nhello\napple,banana\n' | awk -F, '{print NF}'
0
1
2

You can also directly set the special FS variable to change the input field separator. This can be done from the command line using the -v option or within the code blocks.

$ echo 'goal:amazing:whistle:kwality' | awk -v FS=: '{print $2}'
amazing

# field separator can be multiple characters too
$ echo '1e4SPT2k6SPT3a5SPT4z0' | awk 'BEGIN{FS="SPT"} {print $3}'
3a5

If you wish to split the input as individual characters, use an empty string as the field separator.

# note that the space between -F and '' is necessary here
$ echo 'apple' | awk -F '' '{print $1}'
a
$ echo 'apple' | awk -v FS= '{print $NF}'
e

# depending upon the locale, you can work with multibyte characters too
$ echo 'αλεπού' | awk -v FS= '{print $3}'
ε

Here are some examples with regexp based field separators. The value passed to -F or FS is treated as a string and then converted to a regexp. So, you'll need \\ instead of \ to mean a backslash character. The good news is that for single characters that are also regexp metacharacters, they'll be treated literally and you do not need to escape them.

$ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' '{print $2}'
string
$ echo 'Sample123string42with777numbers' | awk -F'[a-zA-Z]+' '{print $2}'
123

# note the use of \\W to indicate \W
$ printf '%s\n' 'load;err_msg--\ant,r2..not' | awk -F'\\W+' '{print $3}'
ant

# same as: awk -F'\\.' '{print $2}'
$ echo 'hi.bye.hello' | awk -F. '{print $2}'
bye

# count the number of vowels for each input line
# note that empty lines will give -1 in the output
$ printf 'cool\nnice car\n' | awk -F'[aeiou]' '{print NF-1}'
2
3

The default value of FS is a single space character. So, if you set the input field separator to a single space, then it will be the same as if you are using the default split discussed in the previous section. If you want to override this behavior, put the space inside a character class.
# same as: awk '{print NF}'
$ echo '   a   b   c   ' | awk -F' ' '{print NF}'
3

# there are 12 space characters, thus 13 fields
$ echo '   a   b   c   ' | awk -F'[ ]' '{print NF}'
13

If IGNORECASE is set, it will affect field separation as well. Except when the field separator is a single character, which can be worked around by using a character class.

$ echo 'RECONSTRUCTED' | awk -F'[aeiou]+' -v IGNORECASE=1 '{print $NF}'
D

# when FS is a single character
$ echo 'RECONSTRUCTED' | awk -F'e' -v IGNORECASE=1 '{print $1}'
RECONSTRUCTED
$ echo 'RECONSTRUCTED' | awk -F'[e]' -v IGNORECASE=1 '{print $1}'
R

Output field separator

The OFS special variable controls the output field separator. OFS is used as the string between multiple arguments passed to the print function. It is also used whenever $0 has to be reconstructed as a result of field contents being modified. The default value for OFS is a single space character, just like FS. There is no equivalent command line option though, you'll have to change OFS directly.

# print the first and third fields, OFS is used to join these values
# note the use of , to separate print arguments
$ awk '{print $1, $3}' table.txt
brown mat
blue mug
yellow window

# same FS and OFS
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{print $2, $NF}'
amazing:kwality
$ echo 'goal:amazing:whistle:kwality' | awk 'BEGIN{FS=OFS=":"} {print $2, $NF}'
amazing:kwality

# different values for FS and OFS
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=- '{print $2, $NF}'
amazing-kwality

Here are some examples for changing field contents and then printing $0.

$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$2 = 42} 1'
goal:42:whistle:kwality
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=, '{$2 = 42} 1'
goal,42,whistle,kwality

# recall that spaces at the start/end gets trimmed for default FS
$ echo '   a   b   c   ' | awk '{$NF = "last"} 1'
a b last

Sometimes you want to print the contents of $0 with the new OFS value but field contents aren't being changed. In such cases, you can assign a field value to itself to force the reconstruction of $0.

# no change because there was no trigger to rebuild $0
$ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' -v OFS=, '1'
Sample123string42with777numbers

# assign a field to itself in such cases
$ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' -v OFS=, '{$1=$1} 1'
Sample,string,with,numbers

If you need to set the same input and output field separator, you can write a more concise one-liner using brace expansion. Here are some examples:
$ echo -v{,O}FS=:
-vFS=: -vOFS=:

$ echo 'goal:amazing:whistle:kwality' | awk -v{,O}FS=: '{$2 = 42} 1'
goal:42:whistle:kwality

$ echo 'goal:amazing:whistle:kwality' | awk '{$2 = 42} 1' {,O}FS=:
goal:42:whistle:kwality
However, this is not commonly used and doesn't save too many characters to be preferred over explicit assignment.

Manipulating NF

Changing the value of NF will rebuild $0 as well. Here are some examples:

# reducing fields
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=, '{NF=2} 1'
goal,amazing
# increasing fields
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$(NF+1)="sea"} 1'
goal:amazing:whistle:kwality:sea

# empty fields will be created as needed
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$8="go"} 1'
goal:amazing:whistle:kwality::::go

Assigning NF to 0 will delete all the fields. However, a negative value will result in an error.
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{NF=-1} 1'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: NF set to negative value

FPAT

The FS variable allows you to define the input field separator. In contrast, FPAT (field pattern) allows you to define what should the fields be made up of.

$ s='Sample123string42with777numbers'
# one or more consecutive digits
$ echo "$s" | awk -v FPAT='[0-9]+' '{print $2}'
42

$ s='coat Bin food tar12 best Apple fig_42'
# whole words made up of lowercase alphabets and digits only
$ echo "$s" | awk -v FPAT='\\<[a-z0-9]+\\>' -v OFS=, '{$1=$1} 1'
coat,food,tar12,best

$ s='items: "apple" and "mango"'
# get the first double quoted item
$ echo "$s" | awk -v FPAT='"[^"]+"' '{print $1}'
"apple"

If IGNORECASE is set, it will affect field matching as well. Unlike FS, there is no different behavior for a single character pattern.

# count the number of character 'e'
$ echo 'Read Eat Sleep' | awk -v FPAT='e' '{print NF}'
3
$ echo 'Read Eat Sleep' | awk -v IGNORECASE=1 -v FPAT='e' '{print NF}'
4
$ echo 'Read Eat Sleep' | awk -v IGNORECASE=1 -v FPAT='[e]' '{print NF}'
4

CSV processing with FPAT

FPAT can be effective to process CSV (Comma Separated Values) input even when the fields contain embedded delimiter characters. First, consider the issue shown below:

$ s='eagle,"fox,42",bee,frog'

# simply using , as separator isn't sufficient
$ echo "$s" | awk -F, '{print $2}'
"fox

For such cases, FPAT helps to define fields as starting and ending with double quotes or containing non-comma characters.

# * is used instead of + to allow empty fields
$ echo "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
"fox,42"

CSV processing with `--csv`

The solution presented in the last section will not work for all kinds of CSV files — for example, if the fields contain escaped double quotes, newline characters, etc.

$ s='"toy,eagle\"s","fox,42",bee,frog'

# the FPAT solution won't work if there are escaped quotes
$ printf '%b' "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
s"

GNU awk now has a native support for handing CSV files, which is activated with the --csv (or -k) option. You cannot customize the field separator with this feature. Also, the quotes around a field will not be retained. See gawk manual: Working With Comma Separated Value Files for more details.

# --csv or -k can be used instead
# however, you cannot customize the field separator
$ printf '%b' "$s" | awk -k '{print $2}'
fox,42
# and quotes around a field will be lost
$  printf '%b' "$s" | awk -k -v OFS=: '{$1=$1} 1'
toy,eagle\"s:fox,42:bee:frog

Here's an example with embedded newline characters:

$ cat newline.csv
apple,"1
2
3",good
fig,guava,"32
54",nice
$ awk -k 'NR==1{print $2}' newline.csv
1
2
3

See stackoverflow: What's the most robust way to efficiently parse CSV using awk? and csvquote for alternate solutions. You could also use other programming languages such as Perl, Python, Ruby, etc which come with standard CSV parsing libraries or have easy access to third party solutions. There are also specialized command line tools such as xsv.

You can also check out frawk, which is mostly similar to the awk command but also supports CSV parsing. goawk is another implementation with CSV support.

FIELDWIDTHS

FIELDWIDTHS is another feature where you get to define field contents. As indicated by the name, you have to specify the number of characters for each field. This method is useful to process fixed width data.

$ cat items.txt
apple   fig banana
50      10  200

# here field widths have been assigned such that
# extra spaces are placed at the end of each field
$ awk -v FIELDWIDTHS='8 4 6' '{print $2}' items.txt
fig 
10  
# note that the field contents will include the spaces as well
$ awk -v FIELDWIDTHS='8 4 6' '{print "[" $2 "]"}' items.txt
[fig ]
[10  ]

You can optionally prefix a field width with number of characters to be ignored.

# first field is 5 characters
# then 3 characters are ignored and 3 characters for the second field
# then 1 character is ignored and 6 characters for the third field
$ awk -v FIELDWIDTHS='5 3:3 1:6' '{print "[" $1 "]"}' items.txt
[apple]
[50   ]
$ awk -v FIELDWIDTHS='5 3:3 1:6' '{print "[" $2 "]"}' items.txt
[fig]
[10 ]

If an input line length exceeds the total width specified, the extra characters will simply be ignored. If you wish to access those characters, you can use * to represent the last field. See gawk manual: FIELDWIDTHS for more such corner cases.

$ awk -v FIELDWIDTHS='5 *' '{print "[" $1 "]"}' items.txt
[apple]
[50   ]

$ awk -v FIELDWIDTHS='5 *' '{print "[" $2 "]"}' items.txt
[   fig banana]
[   10  200]

Summary

Working with fields is the most popular feature of awk. This chapter discussed various ways in which you can split the input into fields and manipulate them. There are many more examples to be discussed related to fields in the coming chapters. I'd highly suggest to also read through gawk manual: Fields for more details regarding field processing.

Next chapter will discuss various ways to use record separators and related special variables.

Exercises

The exercises directory has all the files used in this section.

1) For the input file brackets.txt, extract only the contents between () or )( from each input line. Assume that () characters will be present only once every line.

$ cat brackets.txt
foo blah blah(ice) 123 xyz$ 
(almond-pista) choco
yo )yoyo( yo

$ awk ##### add your solution here
ice
almond-pista
yoyo

2) For the input file scores.csv, extract Name and Physics fields in the format shown below.

$ cat scores.csv
Name,Maths,Physics,Chemistry
Blue,67,46,99
Lin,78,83,80
Er,56,79,92
Cy,97,98,95
Ort,68,72,66
Ith,100,100,100

$ awk ##### add your solution here
Name:Physics
Blue:46
Lin:83
Er:79
Cy:98
Ort:72
Ith:100

3) For the input file scores.csv, display names of those who've scored above 70 in Maths.

$ awk ##### add your solution here
Lin
Cy
Ith

4) Display the number of word characters for the given inputs. Word definition here is same as used in regular expressions. Can you construct a solution with gsub and one without substitution functions?

$ echo 'hi there' | awk ##### add your solution here
7

$ echo 'u-no;co%."(do_12:as' | awk ##### add your solution here
12

5) For the input file quoted.txt, extract the first and third sequence of characters surrounded by double quotes and display them in the format shown below. Solution shouldn't use substitution functions.

$ cat quoted.txt
1 "grape" and "mango" and "guava"
("a 1""b""c-2""d")

$ awk ##### add your solution here
"grape","guava"
"a 1","c-2"

6) For the input file varying_fields.txt, construct a solution to get the output shown below. Solution shouldn't use substitution functions.

$ cat varying_fields.txt
hi,bye,there,was,here,to
1,2,3,4,5

$ awk ##### add your solution here
hi,bye,to
1,2,5

7) Transform the given input file fw.txt to get the output as shown below. If a field is empty (i.e. contains only space characters), replace it with NA.

$ cat fw.txt
1.3  rs   90  0.134563
3.8           6
5.2  ye       8.2387
4.2  kt   32  45.1

$ awk ##### add your solution here
1.3,rs,0.134563
3.8,NA,6
5.2,ye,8.2387
4.2,kt,45.1

8) Display only the third and fifth characters from each input line as shown below.

$ printf 'restore\ncat one\ncricket' | awk ##### add your solution here
so
to
ik

9) The fields.txt file has fields separated by the : character. Delete : and the last field if there is a digit character anywhere before the last field. Solution shouldn't use substitution functions.

$ cat fields.txt
42:cat
twelve:a2b
we:be:he:0:a:b:bother
apple:banana-42:cherry:
dragon:unicorn:centaur

$ awk ##### add your solution here
42
twelve:a2b
we:be:he:0:a:b
apple:banana-42:cherry
dragon:unicorn:centaur

10) Retain only the first three fields for the given sample string that uses ^ as the input field separator. Use , as the output field separator.

$ echo 'sit^eat^very^eerie^near' | awk ##### add your solution here
sit,eat,very

11) The sample string shown below uses cat as the field separator (irrespective of case). Use space as the output field separator and add 42 as the last field.

$ s='applecatfigCaT12345cAtbanana'
$ echo "$s" | awk ##### add your solution here
apple fig 12345 banana 42

12) For the input file sample.txt, filter lines containing 6 or more lowercase vowels.

$ awk ##### add your solution here
No doubt you like it too
Much ado about nothing

13) The input file concat.txt has contents of various files preceded by a line starting with ###. Replace such sequence of characters with an incrementing integer value (starting with 1) in the format shown below.

$ awk ##### add your solution here
1) addr.txt
How are you
This game is good
Today is sunny
2) broken.txt
top
1234567890
bottom
3) sample.txt
Just do-it
Believe it
4) mixed_fs.txt
pink blue white yellow
car,mat,ball,basket

14) The newline.csv file has fields with embedded newline characters. Display only the first and last fields as shown below.

$ cat newline.csv
apple,"1
2
3",good
fig,guava,"32
54",nice

$ awk ##### add your solution here
apple,good
fig,nice

15) The newline.csv file has fields with embedded newline characters, but no fields with escaped double quotes. Change the embedded newline characters to : without removing the double quotes around such fields.

$ cat newline.csv
apple,"1
2
3",good
fig,guava,"32
54",nice

$ awk ##### add your solution here
apple,"1:2:3",good
fig,guava,"32:54",nice

CLI text processing with GNU awk