Multiple file input

You have already seen control structures like BEGIN, END and next. This chapter will discuss control structures that are useful to make decisions around each file when there are multiple files passed as input.

BEGINFILE, ENDFILE and FILENAME

  • BEGINFILE — this block gets executed before start of each input file
  • ENDFILE — this block gets executed after processing each input file
  • FILENAME — special variable having file name of current input file
$ awk 'BEGINFILE{print "--- " FILENAME " ---"} 1' greeting.txt table.txt
--- greeting.txt ---
Hi there
Have a nice day
Good bye
--- table.txt ---
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14

$ # same as: tail -q -n1 greeting.txt table.txt
$ awk 'ENDFILE{print $0}' greeting.txt table.txt
Good bye
yellow banana window shoes 3.14

nextfile

nextfile will skip remaining records from the current file being processed and move on to the next file.

$ # print filename if it contains 'I' anywhere in the file
$ # same as: grep -l 'I' f[1-3].txt greeting.txt
$ awk '/I/{print FILENAME; nextfile}' f[1-3].txt greeting.txt
f1.txt
f2.txt

$ # print filename if it contains both 'o' and 'at' anywhere in the file
$ awk 'BEGINFILE{m1=m2=0} /o/{m1=1} /at/{m2=1}
       m1 && m2{print FILENAME; nextfile}' f[1-3].txt greeting.txt
f2.txt
f3.txt

$ # print filename if it contains 'at' but not 'o'
$ awk 'BEGINFILE{m1=m2=0} /o/{m1=1; nextfile} /at/{m2=1}
       ENDFILE{if(!m1 && m2) print FILENAME}' f[1-3].txt greeting.txt
f1.txt

warning nextfile cannot be used in BEGIN or END or ENDFILE blocks. See gawk manual: nextfile for more details, how it affects ENDFILE and other special cases.

ARGC and ARGV

The ARGC special variable contains total number of arguments passed to the awk command, including awk itself as an argument. The ARGV special array contains the arguments themselves.

$ # note that index starts with '0' here
$ awk 'BEGIN{for(i=0; i<ARGC; i++) print ARGV[i]}' f[1-3].txt greeting.txt
awk
f1.txt
f2.txt
f3.txt
greeting.txt

Similar to manipulating NF and modifying $N field contents, you can change the values of ARGC and ARGV to control how the arguments should be processed.

However, not all arguments are necessarily filenames. awk allows assigning variable values without -v option if it is done in the place where you usually provide file arguments. For example:

$ awk 'BEGIN{for(i=0; i<ARGC; i++) print ARGV[i]}' table.txt n=5 greeting.txt
awk
table.txt
n=5
greeting.txt

In the above example, the variable n will get a value of 5 after awk has finished processing table.txt file. Here's an example where FS is changed between two files.

$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
$ cat books.csv
Harry Potter,Mistborn,To Kill a Mocking Bird
Matilda,Castle Hangnail,Jane Eyre

$ # for table.txt, FS will be default value
$ # for books.csv, FS will be comma character
$ # OFS is comma for both files
$ awk -v OFS=, 'NF=2' table.txt FS=, books.csv
brown,bread
blue,cake
yellow,banana
Harry Potter,Mistborn
Matilda,Castle Hangnail

info See stackoverflow: extract positions 2-7 from a fasta sequence for a practical example of changing field/record separators between the files being processed.

Summary

This chapter introduced few more special blocks and variables are that handy for processing multiple file inputs. These will show up in examples in coming chapters as well.

Next chapter will discuss use cases where you need to take decisions based on multiple input records.

Exercises

a) Print the last field of first two lines for the input files table.txt, scores.csv and fw.txt. The field separators for these files are space, comma and fixed width respectively. To make the output more informative, print filenames and a separator as shown in the output below. Assume input files will have at least two lines.

$ awk ##### add your solution here
>table.txt<
42
-7
----------
>scores.csv<
Chemistry
99
----------
>fw.txt<
0.134563
6
----------

b) For the given list of input files, display all filenames that contain at or fun in the third field. Assume space as the field separator.

$ awk ##### add your solution here sample.txt secrets.txt addr.txt table.txt
secrets.txt
addr.txt
table.txt