# Two file processing

This chapter focuses on solving problems which depend upon the contents of two or more files. These are usually based on comparing records and fields. Sometimes, record number plays a role too. You'll also learn about the `getline` built-in function.

The example_files directory has all the files used in the examples.

## Comparing records

Consider the following input files which will be compared line wise to get the common and unique lines.

``````\$ cat colors_1.txt
teal
light blue
green
yellow
\$ cat colors_2.txt
light blue
black
dark green
yellow
``````

The key features used in the solution below:

• For two files as input, `NR==FNR` will be `true` only when the first file is being processed
• `next` will skip rest of the script and fetch the next record
• `a[\$0]` by itself is a valid statement. It will create an uninitialized element in array `a` with `\$0` as the key (assuming the key doesn't exist yet)
• `\$0 in a` checks if the given string (`\$0` here) exists as a key in the array `a`
``````# common lines
# same as: grep -Fxf colors_1.txt colors_2.txt
\$ awk 'NR==FNR{a[\$0]; next} \$0 in a' colors_1.txt colors_2.txt
light blue
yellow

# lines from colors_2.txt not present in colors_1.txt
# same as: grep -vFxf colors_1.txt colors_2.txt
\$ awk 'NR==FNR{a[\$0]; next} !(\$0 in a)' colors_1.txt colors_2.txt
black
dark green

# reversing the order of input files gives
# lines from colors_1.txt not present in colors_2.txt
\$ awk 'NR==FNR{a[\$0]; next} !(\$0 in a)' colors_2.txt colors_1.txt
teal
green
``````

Note that the `NR==FNR` logic will fail if the first file is empty, since `NR` wouldn't get a chance to increment. You can set a flag after the first file has been processed to avoid this issue. See this unix.stackexchange thread for more workarounds.

``````# no output
\$ awk 'NR==FNR{a[\$0]; next} !(\$0 in a)' /dev/null greeting.txt

# gives the expected output
\$ awk '!f{a[\$0]; next} !(\$0 in a)' /dev/null f=1 greeting.txt
Hi there
Have a nice day
Good bye
``````

## Comparing fields

In the previous section, you saw how to compare the contents of whole records between two files. This section will focus on comparing only specific fields. The below sample file will be one of the two file inputs for examples in this section.

``````\$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67
``````

To start with, here's a single field comparison. The problem statement is to fetch all records from `marks.txt` if the first field matches any of the departments listed in the `dept.txt` file.

``````\$ cat dept.txt
CSE
ECE

# note that dept.txt is used to build the array keys first
\$ awk 'NR==FNR{a[\$1]; next} \$1 in a' dept.txt marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

# if the header is needed as well
\$ awk 'NR==FNR{a[\$1]; next} FNR==1 || \$1 in a' dept.txt marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67
``````

For multiple field comparison, you need to construct the key robustly. Simply concatenating field values can lead to false matches. For example, field values `abc` and `123` will wrongly match `ab` and `c123`. To avoid this, you may introduce some string between the field values, say `"_"` (if you know the field themselves cannot have this character) or `FS` (safer option). You could also allow `awk` to bail you out. If you use the `,` symbol (not `","` as a string) between the field values, the value of the special variable `SUBSEP` is inserted. `SUBSEP` has a default value of the non-printing character `\034` which is usually not used as part of text files.

``````\$ cat dept_name.txt
EEE Moi
CSE Amy
ECE Raj

# uses SUBSEP as a separator between the field values to construct the key
# note the use of parentheses for key testing
\$ awk 'NR==FNR{a[\$1,\$2]; next} (\$1,\$2) in a' dept_name.txt marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67
``````

In this example, one of the field is used for numerical comparison.

``````\$ cat dept_mark.txt
ECE 70
EEE 65
CSE 80

# match Dept and minimum marks specified in dept_mark.txt
\$ awk 'NR==FNR{d[\$1]=\$2; next}
\$1 in d && \$3 >= d[\$1]' dept_mark.txt marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92
``````

Here's an example of adding a new field.

``````\$ cat role.txt
Raj class_rep
Amy sports_rep
Tia placement_rep

\$ awk -v OFS='\t' 'NR==FNR{r[\$1]=\$2; next}
{\$(NF+1) = FNR==1 ? "Role" : r[\$2]} 1' role.txt marks.txt
Dept    Name    Marks   Role
ECE     Raj     53      class_rep
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59      placement_rep
ECE     Om      92
CSE     Amy     67      sports_rep
``````

## getline

As the name indicates, the `getline` function allows you to read a line from a file on demand. This is easiest to use when you need something based on line numbers. The following example shows how you can replace the `m`th line from a file with the `n`th line from another file. There are many syntax variations with `getline`, here the line read is saved in a variable.

``````# return value handling is not shown here, but should be done ideally
\$ awk -v m=3 -v n=2 'BEGIN{while(n-- > 0) getline s < "greeting.txt"}
FNR==m{\$0=s} 1' table.txt
blue cake mug shirt -7
Have a nice day
``````

Here's an example where two files are processed simultaneously. In this case, the return value of `getline` is also used. It will be `1` if the line was read successfully, `0` if there's no more input to be read as end of file has already been reached and `-1` if something went wrong. The `ERRNO` special variable will have the error details.

``````# print line from greeting.txt if the last column of the corresponding line
# from table.txt is a positive number
\$ awk -v file='table.txt' '(getline line < file)==1{n=split(line, a);
if(a[n]>0) print}' greeting.txt
Hi there
Good bye
``````

If a file is passed as an argument to the `awk` command that cannot be opened, you get an error. For example:

``````\$ awk '{print \$2}' xyz.txt
awk: fatal: cannot open file 'xyz.txt' for reading: No such file or directory
``````

It is recommended to always check for the return value when using `getline` or perhaps use techniques from the previous sections to avoid `getline` altogether.

``````# xyz.txt doesn't exist, but output doesn't show something went wrong
\$ awk '{getline line < "xyz.txt"; print \$NF, line}' table.txt
42
-7
3.14

\$ awk -v file='xyz.txt' '{ e=(getline line < file);
if(e<0){print file ": " ERRNO; exit}
print \$NF, line }' table.txt
xyz.txt: No such file or directory
``````

See gawk manual: getline for details, especially about corner cases and errors. See also awk.freeshell: getline caveats.

## Summary

This chapter discussed a few cases where you need to compare contents between two files. The `NR==FNR` trick is handy for such cases. You also saw a few examples with the `getline` function.

Next chapter will discuss how to handle duplicate contents.

## Exercises

The exercises directory has all the files used in this section.

1) Use the contents of `match_words.txt` file to display matching lines from `jumbled.txt` and `sample.txt`. The matching criteria is that the second word of lines from these files should match the third word of lines from `match_words.txt`.

``````\$ cat match_words.txt
just,\joint*,concession<=nice

# 'concession' is one of the third words from 'match_words.txt'
# and second word from 'jumbled.txt'
wavering:concession/woof\retailer
No doubt you like it too
``````

2) Interleave the contents of `secrets.txt` with the contents of a file passed via the `-v` option as shown below.

``````\$ awk -v f='table.txt' ##### add your solution here
stag area row tick
---
blue cake mug shirt -7
---
Bi tac toe - 42
yellow banana window shoes 3.14
---
``````

3) The file `search_terms.txt` contains one search string per line, and these terms have no regexp metacharacters. Construct an `awk` command that reads this file and displays the search terms (matched case insensitively) that were found in every file passed as the arguments after `search_terms.txt`. Note that these terms should be matched anywhere in the line (so, don't use word boundaries).

``````\$ cat search_terms.txt
hello
row
you
is
at

##file list## search_terms.txt jumbled.txt mixed_fs.txt secrets.txt table.txt
at
row

is
you
hello
``````

4) Display lines from `scores.csv` by matching the first field based on a list of names from the `names.txt` file. Also, change the output field separator to a space character.

``````\$ cat names.txt
Lin
Cy
Ith

Lin 78 83 80
Cy 97 98 95
Ith 100 100 100
``````

5) What's the default value of the special variable `SUBSEP`? Where is it commonly used?

6) The `result.csv` file has three columns — name, subject and mark. The `criteria.txt` file has two columns — name and subject. Match lines from `result.csv` based on the two columns from `criteria.txt` provided the mark column is greater than 80.

``````\$ cat result.csv
Amy,maths,89
Amy,physics,75
Joe,maths,79
John,chemistry,77
John,physics,91
Moe,maths,81
Ravi,physics,84
Ravi,chemistry,70
Yui,maths,92

\$ cat criteria.txt
Amy maths
John chemistry
John physics
Ravi chemistry
Yui maths