Two file processing

This chapter focuses on solving problems which depend upon contents of two files. These are usually based on comparing records and fields. Sometimes, record number plays a role too. You'll also learn about the getline built-in function.

Comparing records

Consider the following input files which will be compared line wise to get common lines and unique lines.

$ cat color_list1.txt
teal
light blue
green
yellow
$ cat color_list2.txt
light blue
black
dark green
yellow

The key features used in the solution below:

  • For two files as input, NR==FNR will be true only when the first file is being processed
  • next will skip rest of code and fetch next record
  • a[$0] by itself is a valid statement. It will create an uninitialized element in array a with $0 as the key (assuming the key doesn't exist yet)
  • $0 in a checks if the given string ($0 here) exists as a key in array a
$ # common lines
$ # same as: grep -Fxf color_list1.txt color_list2.txt
$ awk 'NR==FNR{a[$0]; next} $0 in a' color_list1.txt color_list2.txt
light blue
yellow

$ # lines from color_list2.txt not present in color_list1.txt
$ # same as: grep -vFxf color_list1.txt color_list2.txt
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' color_list1.txt color_list2.txt
black
dark green

$ # reversing the order of input files gives
$ # lines from color_list1.txt not present in color_list2.txt
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' color_list2.txt color_list1.txt
teal
green

warning Note that the NR==FNR logic will fail if the first file is empty. See this unix.stackexchange thread for workarounds.

Comparing fields

In the previous section, you saw how to compare whole contents of records between two files. This section will focus on comparing only specific field(s). The below sample file will be one of the two file inputs for examples in this section.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

To start with, here's a single field comparison. The problem statement is to fetch all records from marks.txt if the first field matches any of the departments listed in dept.txt file.

$ cat dept.txt
CSE
ECE

$ # note that dept.txt is used to build the array keys first
$ awk 'NR==FNR{a[$1]; next} $1 in a' dept.txt marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

$ # if header is needed as well
$ awk 'NR==FNR{a[$1]; next} FNR==1 || $1 in a' dept.txt marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

For multiple field comparison, you need to construct the key robustly. Simply concatenating field values can lead to false matches. For example, field values abc and 123 will wrongly match ab and c123. To avoid this, you may introduce some string between the field values, say "_" (if you know the field themselves cannot have this character) or FS (safer option). You could also allow awk to bail you out. If you use , symbol (not "," as a string) between field values, the value of special variable SUBSEP is inserted. SUBSEP has a default value of the non-printing character \034 which is usually not used as part of text files.

$ cat dept_name.txt
EEE Moi
CSE Amy
ECE Raj

$ # uses SUBSEP as separator between field values to construct the key
$ # note the use of parentheses for key testing
$ awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' dept_name.txt marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67

In this example, one of the field is used for numerical comparison.

$ cat dept_mark.txt
ECE 70
EEE 65
CSE 80

$ # match Dept and minimum marks specified in dept_mark.txt
$ awk 'NR==FNR{d[$1]=$2; next}
       $1 in d && $3 >= d[$1]' dept_mark.txt marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92

Here's an example of adding a new field.

$ cat role.txt
Raj class_rep
Amy sports_rep
Tia placement_rep

$ awk -v OFS='\t' 'NR==FNR{r[$1]=$2; next}
         {$(NF+1) = FNR==1 ? "Role" : r[$2]} 1' role.txt marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep

getline

As the name indicates, getline function allows you to read a line from a file on demand. This is most useful when you need something based on line number. The following example shows how you can replace mth line from a file with nth line from another file. There are many syntax variations with getline, here the line read is saved in a variable.

$ # return value handling is not shown here, but should be done ideally
$ awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "greeting.txt"}
                     FNR==m{$0=s} 1' table.txt
brown bread mat hair 42
blue cake mug shirt -7
Have a nice day

Here's an example where two files are processed simultaneously. In this case, the return value of getline is also used. It will be 1 if line was read successfully, 0 if there's no more input to be read as end of file has already been reached and -1 if something went wrong. The ERRNO special variable will have details regarding the issues.

$ # print line from greeting.txt if last column of corresponding line
$ # from table.txt is +ve number
$ awk -v file='table.txt' '(getline line < file)==1{n=split(line, a);
                           if(a[n]>0) print}' greeting.txt
Hi there
Good bye

If a file is passed as argument to awk command and cannot be opened, you get an error. For example:

$ awk '{print $2}' xyz.txt
awk: fatal: cannot open file `xyz.txt' for reading: No such file or directory

It is recommended to always check for return value when using getline or perhaps use techniques from previous sections to avoid getline altogether.

$ # xyz.txt doesn't exist, but output doesn't show something went wrong
$ awk '{getline line < "xyz.txt"; print $NF, line}' table.txt
42 
-7 
3.14 

$ awk -v file='xyz.txt' '{ e=(getline line < file);
                           if(e<0){print file ": " ERRNO; exit}
                           print $NF, line }' table.txt
xyz.txt: No such file or directory

info See gawk manual: getline for details, especially about corner cases and errors. See also awk.freeshell: getline caveats.

Summary

This chapter discussed a few cases where you need to compare contents of two files. The NR==FNR trick is handy for such cases. The getline function is helpful for line number based comparisons.

Next chapter will discuss how to handle duplicate contents.

Exercises

a) Use contents of match_words.txt file to display matching lines from jumbled.txt and sample.txt. The matching criteria is that the second word of lines from these files should match the third word of lines from match_words.txt.

$ cat match_words.txt
%whole(Hello)--{doubt}==ado==
just,\joint*,concession<=nice

$ # 'concession' is one of the third words from 'match_words.txt'
$ # and second word from 'jumbled.txt'
$ awk ##### add your solution here
wavering:concession/woof\retailer
No doubt you like it too

b) Interleave contents of secrets.txt with the contents of a file passed via -v option as shown below.

$ awk -v f='table.txt' ##### add your solution here
stag area row tick
brown bread mat hair 42
---
deaf chi rate tall glad
blue cake mug shirt -7
---
Bi tac toe - 42
yellow banana window shoes 3.14
---

c) The file search_terms.txt contains one search string per line (these have no regexp metacharacters). Construct an awk command that reads this file and displays search terms (matched case insensitively) that were found in all of the other file arguments. Note that these terms should be matched with any part of the line, not just whole words.

$ cat search_terms.txt
hello
row
you
is
at

$ awk ##### add your solution here
##file list## search_terms.txt jumbled.txt mixed_fs.txt secrets.txt table.txt
at
row
$ awk ##### add your solution here
##file list## search_terms.txt addr.txt sample.txt
is
you
hello