Field separators

Now that you are familiar with basic awk syntax and regular expressions, this chapter will dive deep into field processing. You'll learn how to set input and output field separators, how to use regexps for defining fields and how to work with fixed length fields.

info The example_files directory has all the files used in the examples.

Default field separation

As seen earlier, awk automatically splits input into fields which are accessible using $N where N is the field number you need. You can also pass an expression instead of a numeric literal to specify the field required.

$ cat table.txt brown bread mat hair 42 blue cake mug shirt -7 yellow banana window shoes 3.14 # print the fourth field if the first field starts with 'b' $ awk '$1 ~ /^b/{print $4}' table.txt hair shirt # print the field as specified by the value stored in the 'f' variable $ awk -v f=3 '{print $f}' table.txt mat mug window

The NF special variable will give you the number of fields for each input line. This is useful when you don't know how many fields are present in the input and you need to process fields from the end.

# print the last field of each input line $ awk '{print $NF}' table.txt 42 -7 3.14 # print the last but one field $ awk '{print $(NF-1)}' table.txt hair shirt shoes # don't forget the parentheses! # this will subtract 1 from the last field and print it $ awk '{print $NF-1}' table.txt 41 -8 2.14

By default, awk does more than split the input on spaces. It splits based on one or more sequence of space or tab or newline characters. In addition, any of these three characters at the start or end of input gets trimmed and won't be part of the field contents. Input containing newline characters will be covered in the Record separators chapter.

$ echo ' a b c ' | awk '{print NF}' 3 # note that the leading spaces aren't part of the field content $ echo ' a b c ' | awk '{print $1}' a # note that the trailing spaces aren't part of the field content $ echo ' a b c ' | awk '{print $NF "."}' c. # here's another example with tab characters thrown in $ printf ' one \t two\t\t\tthree ' | awk '{print NF}' 3 $ printf ' one \t two\t\t\tthree ' | awk '{print $2 "."}' two.

warning When passing an expression for field number, floating-point result is acceptable too. The fractional portion is ignored. However, as precision is limited, it could result in rounding instead of truncation.

$ awk 'BEGIN{printf "%.16f\n", 2.999999999999999}' 2.9999999999999991 $ awk 'BEGIN{printf "%.16f\n", 2.9999999999999999}' 3.0000000000000000 # same as: awk '{print $2}' table.txt $ awk '{print $2.999999999999999}' table.txt bread cake banana # same as: awk '{print $3}' table.txt $ awk '{print $2.9999999999999999}' table.txt mat mug window

Input field separator

The most common way to change the default field separator is to use the -F command line option. The value passed to the option will be treated as a string literal and then converted to a regexp. For now, here are some examples without any special regexp characters.

# use ':' as the input field separator $ echo 'goal:amazing:whistle:kwality' | awk -F: '{print $1}' goal $ echo 'goal:amazing:whistle:kwality' | awk -F: '{print $NF}' kwality # use quotes to avoid clashes with shell special characters $ echo 'one;two;three;four' | awk -F';' '{print $3}' three # first and last fields will have empty string as their values $ echo '=a=b=c=' | awk -F= '{print $1 "[" $NF "]"}' [] # difference between empty lines and lines without field separator $ printf '\nhello\napple,banana\n' | awk -F, '{print NF}' 0 1 2

You can also directly set the special FS variable to change the input field separator. This can be done from the command line using the -v option or within the code blocks.

$ echo 'goal:amazing:whistle:kwality' | awk -v FS=: '{print $2}' amazing # field separator can be multiple characters too $ echo '1e4SPT2k6SPT3a5SPT4z0' | awk 'BEGIN{FS="SPT"} {print $3}' 3a5

If you wish to split the input as individual characters, use an empty string as the field separator.

# note that the space between -F and '' is necessary here $ echo 'apple' | awk -F '' '{print $1}' a $ echo 'apple' | awk -v FS= '{print $NF}' e # depending upon the locale, you can work with multibyte characters too $ echo 'αλεπού' | awk -v FS= '{print $3}' ε

Here are some examples with regexp based field separators. The value passed to -F or FS is treated as a string and then converted to a regexp. So, you'll need \\ instead of \ to mean a backslash character. The good news is that for single characters that are also regexp metacharacters, they'll be treated literally and you do not need to escape them.

$ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' '{print $2}' string $ echo 'Sample123string42with777numbers' | awk -F'[a-zA-Z]+' '{print $2}' 123 # note the use of \\W to indicate \W $ printf '%s\n' 'load;err_msg--\ant,r2..not' | awk -F'\\W+' '{print $3}' ant # same as: awk -F'\\.' '{print $2}' $ echo 'hi.bye.hello' | awk -F. '{print $2}' bye # count the number of vowels for each input line # note that empty lines will give -1 in the output $ printf 'cool\nnice car\n' | awk -F'[aeiou]' '{print NF-1}' 2 3

warning The default value of FS is a single space character. So, if you set the input field separator to a single space, then it will be the same as if you are using the default split discussed in the previous section. If you want to override this behavior, put the space inside a character class.

# same as: awk '{print NF}' $ echo ' a b c ' | awk -F' ' '{print NF}' 3 # there are 12 space characters, thus 13 fields $ echo ' a b c ' | awk -F'[ ]' '{print NF}' 13

If IGNORECASE is set, it will affect field separation as well. Except when the field separator is a single character, which can be worked around by using a character class.

$ echo 'RECONSTRUCTED' | awk -F'[aeiou]+' -v IGNORECASE=1 '{print $NF}' D # when FS is a single character $ echo 'RECONSTRUCTED' | awk -F'e' -v IGNORECASE=1 '{print $1}' RECONSTRUCTED $ echo 'RECONSTRUCTED' | awk -F'[e]' -v IGNORECASE=1 '{print $1}' R

Output field separator

The OFS special variable controls the output field separator. OFS is used as the string between multiple arguments passed to the print function. It is also used whenever $0 has to be reconstructed as a result of field contents being modified. The default value for OFS is a single space character, just like FS. There is no equivalent command line option though, you'll have to change OFS directly.

# print the first and third fields, OFS is used to join these values # note the use of , to separate print arguments $ awk '{print $1, $3}' table.txt brown mat blue mug yellow window # same FS and OFS $ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{print $2, $NF}' amazing:kwality $ echo 'goal:amazing:whistle:kwality' | awk 'BEGIN{FS=OFS=":"} {print $2, $NF}' amazing:kwality # different values for FS and OFS $ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=- '{print $2, $NF}' amazing-kwality

Here are some examples for changing field contents and then printing $0.

$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$2 = 42} 1' goal:42:whistle:kwality $ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=, '{$2 = 42} 1' goal,42,whistle,kwality # recall that spaces at the start/end gets trimmed for default FS $ echo ' a b c ' | awk '{$NF = "last"} 1' a b last

Sometimes you want to print the contents of $0 with the new OFS value but field contents aren't being changed. In such cases, you can assign a field value to itself to force the reconstruction of $0.

# no change because there was no trigger to rebuild $0 $ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' -v OFS=, '1' Sample123string42with777numbers # assign a field to itself in such cases $ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' -v OFS=, '{$1=$1} 1' Sample,string,with,numbers

info If you need to set the same input and output field separator, you can write a more concise one-liner using brace expansion. Here are some examples:

$ echo -v{,O}FS=: -vFS=: -vOFS=: $ echo 'goal:amazing:whistle:kwality' | awk -v{,O}FS=: '{$2 = 42} 1' goal:42:whistle:kwality $ echo 'goal:amazing:whistle:kwality' | awk '{$2 = 42} 1' {,O}FS=: goal:42:whistle:kwality

However, this is not commonly used and doesn't save too many characters to be preferred over explicit assignment.

Manipulating NF

Changing the value of NF will rebuild $0 as well. Here are some examples:

# reducing fields $ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=, '{NF=2} 1' goal,amazing # increasing fields $ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$(NF+1)="sea"} 1' goal:amazing:whistle:kwality:sea # empty fields will be created as needed $ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{$8="go"} 1' goal:amazing:whistle:kwality::::go

warning Assigning NF to 0 will delete all the fields. However, a negative value will result in an error.

$ echo 'goal:amazing:whistle:kwality' | awk -F: -v OFS=: '{NF=-1} 1' awk: cmd. line:1: (FILENAME=- FNR=1) fatal: NF set to negative value

FPAT

The FS variable allows you to define the input field separator. In contrast, FPAT (field pattern) allows you to define what should the fields be made up of.

$ s='Sample123string42with777numbers' # one or more consecutive digits $ echo "$s" | awk -v FPAT='[0-9]+' '{print $2}' 42 $ s='coat Bin food tar12 best Apple fig_42' # whole words made up of lowercase alphabets and digits only $ echo "$s" | awk -v FPAT='\\<[a-z0-9]+\\>' -v OFS=, '{$1=$1} 1' coat,food,tar12,best $ s='items: "apple" and "mango"' # get the first double quoted item $ echo "$s" | awk -v FPAT='"[^"]+"' '{print $1}' "apple"

If IGNORECASE is set, it will affect field matching as well. Unlike FS, there is no different behavior for a single character pattern.

# count the number of character 'e' $ echo 'Read Eat Sleep' | awk -v FPAT='e' '{print NF}' 3 $ echo 'Read Eat Sleep' | awk -v IGNORECASE=1 -v FPAT='e' '{print NF}' 4 $ echo 'Read Eat Sleep' | awk -v IGNORECASE=1 -v FPAT='[e]' '{print NF}' 4

CSV processing with FPAT

FPAT can be effective to process CSV (Comma Separated Values) input even when the fields contain embedded delimiter characters. First, consider the issue shown below:

$ s='eagle,"fox,42",bee,frog' # simply using , as separator isn't sufficient $ echo "$s" | awk -F, '{print $2}' "fox

For such cases, FPAT helps to define fields as starting and ending with double quotes or containing non-comma characters.

# * is used instead of + to allow empty fields $ echo "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}' "fox,42"

CSV processing with --csv

The solution presented in the last section will not work for all kinds of CSV files — for example, if the fields contain escaped double quotes, newline characters, etc.

$ s='"toy,eagle\"s","fox,42",bee,frog' # the FPAT solution won't work if there are escaped quotes $ printf '%b' "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}' s"

GNU awk now has a native support for handing CSV files, which is activated with the --csv (or -k) option. You cannot customize the field separator with this feature. Also, the quotes around a field will not be retained. See gawk manual: Working With Comma Separated Value Files for more details.

# --csv or -k can be used instead # however, you cannot customize the field separator $ printf '%b' "$s" | awk -k '{print $2}' fox,42 # and quotes around a field will be lost $ printf '%b' "$s" | awk -k -v OFS=: '{$1=$1} 1' toy,eagle\"s:fox,42:bee:frog

Here's an example with embedded newline characters:

$ cat newline.csv apple,"1 2 3",good fig,guava,"32 54",nice $ awk -k 'NR==1{print $2}' newline.csv 1 2 3

See stackoverflow: What's the most robust way to efficiently parse CSV using awk? and csvquote for alternate solutions. You could also use other programming languages such as Perl, Python, Ruby, etc which come with standard CSV parsing libraries or have easy access to third party solutions. There are also specialized command line tools such as xsv.

You can also check out frawk, which is mostly similar to the awk command but also supports CSV parsing. goawk is another implementation with CSV support.

FIELDWIDTHS

FIELDWIDTHS is another feature where you get to define field contents. As indicated by the name, you have to specify the number of characters for each field. This method is useful to process fixed width data.

$ cat items.txt apple fig banana 50 10 200 # here field widths have been assigned such that # extra spaces are placed at the end of each field $ awk -v FIELDWIDTHS='8 4 6' '{print $2}' items.txt fig 10 # note that the field contents will include the spaces as well $ awk -v FIELDWIDTHS='8 4 6' '{print "[" $2 "]"}' items.txt [fig ] [10 ]

You can optionally prefix a field width with number of characters to be ignored.

# first field is 5 characters # then 3 characters are ignored and 3 characters for the second field # then 1 character is ignored and 6 characters for the third field $ awk -v FIELDWIDTHS='5 3:3 1:6' '{print "[" $1 "]"}' items.txt [apple] [50 ] $ awk -v FIELDWIDTHS='5 3:3 1:6' '{print "[" $2 "]"}' items.txt [fig] [10 ]

If an input line length exceeds the total width specified, the extra characters will simply be ignored. If you wish to access those characters, you can use * to represent the last field. See gawk manual: FIELDWIDTHS for more such corner cases.

$ awk -v FIELDWIDTHS='5 *' '{print "[" $1 "]"}' items.txt [apple] [50 ] $ awk -v FIELDWIDTHS='5 *' '{print "[" $2 "]"}' items.txt [ fig banana] [ 10 200]

Summary

Working with fields is the most popular feature of awk. This chapter discussed various ways in which you can split the input into fields and manipulate them. There are many more examples to be discussed related to fields in the coming chapters. I'd highly suggest to also read through gawk manual: Fields for more details regarding field processing.

Next chapter will discuss various ways to use record separators and related special variables.

Exercises

info The exercises directory has all the files used in this section.

1) For the input file brackets.txt, extract only the contents between () or )( from each input line. Assume that () characters will be present only once every line.

$ cat brackets.txt foo blah blah(ice) 123 xyz$ (almond-pista) choco yo )yoyo( yo $ awk ##### add your solution here ice almond-pista yoyo

2) For the input file scores.csv, extract Name and Physics fields in the format shown below.

$ cat scores.csv Name,Maths,Physics,Chemistry Blue,67,46,99 Lin,78,83,80 Er,56,79,92 Cy,97,98,95 Ort,68,72,66 Ith,100,100,100 $ awk ##### add your solution here Name:Physics Blue:46 Lin:83 Er:79 Cy:98 Ort:72 Ith:100

3) For the input file scores.csv, display names of those who've scored above 70 in Maths.

$ awk ##### add your solution here Lin Cy Ith

4) Display the number of word characters for the given inputs. Word definition here is same as used in regular expressions. Can you construct a solution with gsub and one without substitution functions?

$ echo 'hi there' | awk ##### add your solution here 7 $ echo 'u-no;co%."(do_12:as' | awk ##### add your solution here 12

5) For the input file quoted.txt, extract the first and third sequence of characters surrounded by double quotes and display them in the format shown below. Solution shouldn't use substitution functions.

$ cat quoted.txt 1 "grape" and "mango" and "guava" ("a 1""b""c-2""d") $ awk ##### add your solution here "grape","guava" "a 1","c-2"

6) For the input file varying_fields.txt, construct a solution to get the output shown below. Solution shouldn't use substitution functions.

$ cat varying_fields.txt hi,bye,there,was,here,to 1,2,3,4,5 $ awk ##### add your solution here hi,bye,to 1,2,5

7) Transform the given input file fw.txt to get the output as shown below. If a field is empty (i.e. contains only space characters), replace it with NA.

$ cat fw.txt 1.3 rs 90 0.134563 3.8 6 5.2 ye 8.2387 4.2 kt 32 45.1 $ awk ##### add your solution here 1.3,rs,0.134563 3.8,NA,6 5.2,ye,8.2387 4.2,kt,45.1

8) Display only the third and fifth characters from each input line as shown below.

$ printf 'restore\ncat one\ncricket' | awk ##### add your solution here so to ik

9) The fields.txt file has fields separated by the : character. Delete : and the last field if there is a digit character anywhere before the last field. Solution shouldn't use substitution functions.

$ cat fields.txt 42:cat twelve:a2b we:be:he:0:a:b:bother apple:banana-42:cherry: dragon:unicorn:centaur $ awk ##### add your solution here 42 twelve:a2b we:be:he:0:a:b apple:banana-42:cherry dragon:unicorn:centaur

10) Retain only the first three fields for the given sample string that uses ^ as the input field separator. Use , as the output field separator.

$ echo 'sit^eat^very^eerie^near' | awk ##### add your solution here sit,eat,very

11) The sample string shown below uses cat as the field separator (irrespective of case). Use space as the output field separator and add 42 as the last field.

$ s='applecatfigCaT12345cAtbanana' $ echo "$s" | awk ##### add your solution here apple fig 12345 banana 42

12) For the input file sample.txt, filter lines containing 6 or more lowercase vowels.

$ awk ##### add your solution here No doubt you like it too Much ado about nothing

13) The input file concat.txt has contents of various files preceded by a line starting with ###. Replace such sequence of characters with an incrementing integer value (starting with 1) in the format shown below.

$ awk ##### add your solution here 1) addr.txt How are you This game is good Today is sunny 2) broken.txt top 1234567890 bottom 3) sample.txt Just do-it Believe it 4) mixed_fs.txt pink blue white yellow car,mat,ball,basket

14) The newline.csv file has fields with embedded newline characters. Display only the first and last fields as shown below.

$ cat newline.csv apple,"1 2 3",good fig,guava,"32 54",nice $ awk ##### add your solution here apple,good fig,nice

15) The newline.csv file has fields with embedded newline characters, but no fields with escaped double quotes. Change the embedded newline characters to : without removing the double quotes around such fields.

$ cat newline.csv apple,"1 2 3",good fig,guava,"32 54",nice $ awk ##### add your solution here apple,"1:2:3",good fig,guava,"32:54",nice