Field separators
Now that you are familiar with basic awk
syntax and regular expressions, this chapter will dive deep into field processing. You'll learn how to set input and output field separators, how to use regexps for defining fields and how to work with fixed length fields.
The example_files directory has all the files used in the examples.
Default field separation
As seen earlier, awk
automatically splits input into fields which are accessible using $N
where N
is the field number you need. You can also pass an expression instead of a numeric literal to specify the field required.
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
# print the fourth field if the first field starts with 'b'
$ awk '$1 ~ /^b/{print $4}' table.txt
hair
shirt
# print the field as specified by the value stored in the 'f' variable
$ awk -v f=3 '{print $f}' table.txt
mat
mug
window
The NF
special variable will give you the number of fields for each input line. This is useful when you don't know how many fields are present in the input and you need to process fields from the end.
# print the last field of each input line
$ awk '{print $NF}' table.txt
42
-7
3.14
# print the last but one field
$ awk '{print $(NF-1)}' table.txt
hair
shirt
shoes
# don't forget the parentheses!
# this will subtract 1 from the last field and print it
$ awk '{print $NF-1}' table.txt
41
-8
2.14
By default, awk
does more than split the input on spaces. It splits based on one or more sequence of space or tab or newline characters. In addition, any of these three characters at the start or end of input gets trimmed and won't be part of the field contents. Input containing newline characters will be covered in the Record separators chapter.
$ echo ' a b c ' | awk '{print NF}'
3
# note that the leading spaces aren't part of the field content
$ echo ' a b c ' | awk '{print $1}'
a
# note that the trailing spaces aren't part of the field content
$ echo ' a b c ' | awk '{print $NF "."}'
c.
# here's another example with tab characters thrown in
$ printf ' one \t two\t\t\tthree ' | awk '{print NF}'
3
$ printf ' one \t two\t\t\tthree ' | awk '{print $2 "."}'
two.
When passing an expression for field number, floating-point result is acceptable too. The fractional portion is ignored. However, as precision is limited, it could result in rounding instead of truncation.
$ awk 'BEGIN{printf "%.16f\n", 2.999999999999999}' 2.9999999999999991 $ awk 'BEGIN{printf "%.16f\n", 2.9999999999999999}' 3.0000000000000000 # same as: awk '{print $2}' table.txt $ awk '{print $2.999999999999999}' table.txt bread cake banana # same as: awk '{print $3}' table.txt $ awk '{print $2.9999999999999999}' table.txt mat mug window
Input field separator
The most common way to change the default field separator is to use the -F
command line option. The value passed to the option will be treated as a string literal and then converted to a regexp. For now, here are some examples without any special regexp characters.
# use ':' as the input field separator
$ echo 'goal:amazing:whistle:kwality' | awk -F: '{print $1}'
goal
$ echo 'goal:amazing:whistle:kwality' | awk -F: '{print $NF}'
kwality
# use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | awk -F';' '{print $3}'
three
# first and last fields will have empty string as their values
$ echo '=a=b=c=' | awk -F= '{print $1 "[" $NF "]"}'
[]
# difference between empty lines and lines without field separator
$ printf '\nhello\napple,banana\n' | awk -F, '{print NF}'
0
1
2
You can also directly set the special FS
variable to change the input field separator. This can be done from the command line using -v
option or within the code blocks.
$ echo 'goal:amazing:whistle:kwality' | awk -v FS=: '{print $2}'
amazing
# field separator can be multiple characters too
$ echo '1e4SPT2k6SPT3a5SPT4z0' | awk 'BEGIN{FS="SPT"} {print $3}'
3a5
If you wish to split the input as individual characters, use an empty string as the field separator.
# note that the space between -F and '' is necessary here
$ echo 'apple' | awk -F '' '{print $1}'
a
$ echo 'apple' | awk -v FS= '{print $NF}'
e
# depending upon the locale, you can work with multibyte characters too
$ echo 'αλεπού' | awk -v FS= '{print $3}'
ε
Here are some examples with regexp based field separators. The value passed to -F
or FS
is treated as a string and then converted to a regexp. So, you'll need \\
instead of \
to mean a backslash character. The good news is that for single characters that are also regexp metacharacters, they'll be treated literally and you do not need to escape them.
$ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' '{print $2}'
string
$ echo 'Sample123string42with777numbers' | awk -F'[a-zA-Z]+' '{print $2}'
123
# note the use of \\W to indicate \W
$ echo 'load;err_msg--\ant,r2..not' | awk -F'\\W+' '{print $3}'
ant
# same as: awk -F'\\.' '{print $2}'
$ echo 'hi.bye.hello' | awk -F. '{print $2}'
bye
# count the number of vowels for each input line
# note that empty lines will give -1 in the output
$ printf 'cool\nnice car\n' | awk -F'[aeiou]' '{print NF-1}'
2
3
The default value of
FS
is a single space character. So, if you set the input field separator to a single space, then it will be the same as if you are using the default split discussed in the previous section. If you want to override this behavior, you can use space inside a character class.# same as: awk '{print NF}' $ echo ' a b c ' | awk -F' ' '{print NF}' 3 # there are 12 space characters, thus 13 fields $ echo ' a b c ' | awk -F'[ ]' '{print NF}' 13
If IGNORECASE
is set, it will affect field separation as well. Except when the field separator is a single character, which can be worked around by using a character class.
$ echo 'RECONSTRUCTED' | awk -F'[aeiou]+' -v IGNORECASE=1 '{print $NF}'
D
# when FS is a single character
$ echo 'RECONSTRUCTED' | awk -F'e' -v IGNORECASE=1 '{print $1}'
RECONSTRUCTED
$ echo 'RECONSTRUCTED' | awk -F'[e]' -v IGNORECASE=1 '{print $1}'
R
Output field separator
The OFS
special variable controls the output field separator. OFS
is used as the string between multiple arguments passed to the print
function. It is also used whenever $0
has to be reconstructed as a result of field contents being modified. The default value for OFS
is a single space character, just like FS
. There is no equivalent command line option though, you'll have to change OFS
directly.
# print the first and third fields, OFS is used to join these values
# note the use of , to separate print arguments
$ awk '{print $1, $3}' table.txt
brown mat
blue mug
yellow window
# same FS and OFS
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{print $2, $NF}'
amazing:kwality
$ echo 'goal:amazing:whistle:kwality' | awk 'BEGIN{FS=OFS=":"} {print $2, $NF}'
amazing:kwality
# different values for FS and OFS
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=- '{print $2, $NF}'
amazing-kwality
Here are some examples for changing field contents and then printing $0
.
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$2 = 42} 1'
goal:42:whistle:kwality
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=, '{$2 = 42} 1'
goal,42,whistle,kwality
# recall that spaces at the start/end gets trimmed for default FS
$ echo ' a b c ' | awk '{$NF = "last"} 1'
a b last
Sometimes you want to print the contents of $0
with the new OFS
value but field contents aren't being changed. In such cases, you can assign a field value to itself to force the reconstruction of $0
.
# no change because there was no trigger to rebuild $0
$ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' -v OFS=, '1'
Sample123string42with777numbers
# assign a field to itself in such cases
$ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' -v OFS=, '{$1=$1} 1'
Sample,string,with,numbers
If you need to set the same input and output field separator, you can write a more concise one-liner using brace expansion. Here are some examples:
$ echo -v{,O}FS=: -vFS=: -vOFS=: $ echo 'goal:amazing:whistle:kwality' | awk -v{,O}FS=: '{$2 = 42} 1' goal:42:whistle:kwality $ echo 'goal:amazing:whistle:kwality' | awk '{$2 = 42} 1' {,O}FS=: goal:42:whistle:kwality
However, this is not commonly used and doesn't save too many characters to be preferred over explicit assignment.
Manipulating NF
Changing the value of NF
will rebuild $0
as well. Here are some examples:
# reducing fields
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=, '{NF=2} 1'
goal,amazing
# increasing fields
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$(NF+1)="sea"} 1'
goal:amazing:whistle:kwality:sea
# empty fields will be created as needed
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$8="go"} 1'
goal:amazing:whistle:kwality::::go
Assigning
NF
to0
will delete all the fields. However, a negative value will result in an error.
$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{NF=-1} 1'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: NF set to negative value
FPAT
The FS
variable allows you to define the input field separator. In contrast, FPAT
(field pattern) allows you to define what should the fields be made up of.
$ s='Sample123string42with777numbers'
# one or more consecutive digits
$ echo "$s" | awk -v FPAT='[0-9]+' '{print $2}'
42
$ s='coat Bin food tar12 best Apple fig_42'
# whole words made up of lowercase alphabets and digits only
$ echo "$s" | awk -v FPAT='\\<[a-z0-9]+\\>' -v OFS=, '{$1=$1} 1'
coat,food,tar12,best
$ s='items: "apple" and "mango"'
# get the first double quoted item
$ echo "$s" | awk -v FPAT='"[^"]+"' '{print $1}'
"apple"
FPAT
is often used for CSV input where fields can contain embedded delimiter characters. For example, a field content "fox,42"
when ,
is the delimiter.
$ s='eagle,"fox,42",bee,frog'
# simply using , as separator isn't sufficient
$ echo "$s" | awk -F, '{print $2}'
"fox
For such simpler CSV input, FPAT
helps to define fields as starting and ending with double quotes or containing non-comma characters.
# * is used instead of + to allow empty fields
$ echo "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
"fox,42"
The above will not work for all kinds of CSV files, for example if fields contain escaped double quotes, newline characters, etc. See stackoverflow: What's the most robust way to efficiently parse CSV using awk? and csvquote for such cases. You could also use other programming languages such as Perl, Python, Ruby, etc which come with standard CSV parsing libraries or have easy access to third party solutions. There are also specialized command line tools such as xsv.
A proper CSV support is planned for a future version. You can also check out frawk, which is mostly similar to the
awk
command but also supports CSV parsing. goawk is another implementation with CSV support.
If IGNORECASE
is set, it will affect field matching as well. Unlike FS
, there is no different behavior for a single character pattern.
# count number of 'e' in the input string
$ echo 'Read Eat Sleep' | awk -v FPAT='e' '{print NF}'
3
$ echo 'Read Eat Sleep' | awk -v IGNORECASE=1 -v FPAT='e' '{print NF}'
4
$ echo 'Read Eat Sleep' | awk -v IGNORECASE=1 -v FPAT='[e]' '{print NF}'
4
FIELDWIDTHS
FIELDWIDTHS
is another feature where you get to define field contents. As indicated by the name, you have to specify the number of characters for each field. This method is useful to process fixed width data.
$ cat items.txt
apple fig banana
50 10 200
# here field widths have been assigned such that
# extra spaces are placed at the end of each field
$ awk -v FIELDWIDTHS='8 4 6' '{print $2}' items.txt
fig
10
# note that the field contents will include the spaces as well
$ awk -v FIELDWIDTHS='8 4 6' '{print "[" $2 "]"}' items.txt
[fig ]
[10 ]
You can optionally prefix a field width with number of characters to be ignored.
# first field is 5 characters
# then 3 characters are ignored and 3 characters for the second field
# then 1 character is ignored and 6 characters for the third field
$ awk -v FIELDWIDTHS='5 3:3 1:6' '{print "[" $1 "]"}' items.txt
[apple]
[50 ]
$ awk -v FIELDWIDTHS='5 3:3 1:6' '{print "[" $2 "]"}' items.txt
[fig]
[10 ]
If an input line length exceeds the total width specified, the extra characters will simply be ignored. If you wish to access those characters, you can use *
to represent the last field. See gawk manual: FIELDWIDTHS for more such corner cases.
$ awk -v FIELDWIDTHS='5 *' '{print "[" $1 "]"}' items.txt
[apple]
[50 ]
$ awk -v FIELDWIDTHS='5 *' '{print "[" $2 "]"}' items.txt
[ fig banana]
[ 10 200]
Summary
Working with fields is the most popular feature of awk
. This chapter discussed various ways in which you can split the input into fields and manipulate them. There are many more examples to be discussed related to fields in the coming chapters. I'd highly suggest to also read through gawk manual: Fields for more details regarding field processing.
Next chapter will discuss various ways to use record separators and related special variables.
Exercises
The exercises directory has all the files used in this section.
1) For the input file brackets.txt
, extract only the contents between ()
or )(
from each input line. Assume that ()
characters will be present only once every line.
$ cat brackets.txt
foo blah blah(ice) 123 xyz$
(almond-pista) choco
yo )yoyo( yo
$ awk ##### add your solution here
ice
almond-pista
yoyo
2) For the input file scores.csv
, extract Name
and Physics
fields in the format shown below.
$ cat scores.csv
Name,Maths,Physics,Chemistry
Blue,67,46,99
Lin,78,83,80
Er,56,79,92
Cy,97,98,95
Ort,68,72,66
Ith,100,100,100
$ awk ##### add your solution here
Name:Physics
Blue:46
Lin:83
Er:79
Cy:98
Ort:72
Ith:100
3) For the input file scores.csv
, display names of those who've scored above 70
in Maths.
$ awk ##### add your solution here
Lin
Cy
Ith
4) Display the number of word characters for the given inputs. Word definition here is same as used in regular expressions. Can you construct a solution with gsub
and one without substitution functions?
$ echo 'hi there' | awk ##### add your solution here
7
$ echo 'u-no;co%."(do_12:as' | awk ##### add your solution here
12
5) For the input file quoted.txt
, extract the first and third sequence of characters surrounded by double quotes and display them in the format shown below. Solution shouldn't use substitution functions.
$ cat quoted.txt
1 "grape" and "mango" and "guava"
("a 1""b""c-2""d")
$ awk ##### add your solution here
"grape","guava"
"a 1","c-2"
6) For the input file varying_fields.txt
, construct a solution to get the output shown below. Solution shouldn't use substitution functions.
$ cat varying_fields.txt
hi,bye,there,was,here,to
1,2,3,4,5
$ awk ##### add your solution here
hi,bye,to
1,2,5
7) Transform the given input file fw.txt
to get the output as shown below. If a field is empty (i.e. contains only space characters), replace it with NA
.
$ cat fw.txt
1.3 rs 90 0.134563
3.8 6
5.2 ye 8.2387
4.2 kt 32 45.1
$ awk ##### add your solution here
1.3,rs,0.134563
3.8,NA,6
5.2,ye,8.2387
4.2,kt,45.1
8) Display only the third and fifth characters from each input line as shown below.
$ printf 'restore\ncat one\ncricket' | awk ##### add your solution here
so
to
ik
9) The fields.txt
file has fields separated by the :
character. Delete :
and the last field if there is a digit character anywhere before the last field. Solution shouldn't use substitution functions.
$ cat fields.txt
42:cat
twelve:a2b
we:be:he:0:a:b:bother
apple:banana-42:cherry:
dragon:unicorn:centaur
$ awk ##### add your solution here
42
twelve:a2b
we:be:he:0:a:b
apple:banana-42:cherry
dragon:unicorn:centaur
10) Retain only the first three fields for the given sample string that uses ^
as the input field separator. Use ,
as the output field separator.
$ echo 'sit^eat^very^eerie^near' | awk ##### add your solution here
sit,eat,very
11) The sample string shown below uses cat
as the field separator (irrespective of case). Use space as the output field separator and add 42
as the last field.
$ s='applecatfigCaT12345cAtbanana'
$ echo "$s" | awk ##### add your solution here
apple fig 12345 banana 42
12) For the input file sample.txt
, filter lines containing 6 or more lowercase vowels.
$ awk ##### add your solution here
No doubt you like it too
Much ado about nothing
13) The input file concat.txt
has contents of various files preceded by a line starting with ###
. Replace such sequence of characters with an incrementing integer value (starting with 1
) in the format shown below.
$ awk ##### add your solution here
1) addr.txt
How are you
This game is good
Today is sunny
2) broken.txt
top
1234567890
bottom
3) sample.txt
Just do-it
Believe it
4) mixed_fs.txt
pink blue white yellow
car,mat,ball,basket