Two file processing

This chapter focuses on solving problems which depend upon contents of two or more files. These are usually based on comparing records and fields. Sometimes, record number plays a role and some cases need entire file content. You'll also see examples with -r command line option and gets method.

Comparing records

Consider the following input files which will be compared line wise in this section.

$ cat color_list1.txt
teal
light blue
green
yellow

$ cat color_list2.txt
light blue
black
dark green
yellow

The -r command line option allows to specify a library required for the script. Set class is handy for two file processing cases. See ruby-doc: Set for documentation.

$ # common lines
$ # same as: grep -Fxf color_list1.txt color_list2.txt
$ # ARGV.size==1 will be true only for the first file (for two file input)
$ ruby -rset -ne 'BEGIN{s=Set.new}; (s.add($_); next) if ARGV.size==1;
                  print if s.include?($_)' color_list1.txt color_list2.txt
light blue
yellow

$ # lines from color_list2.txt not present in color_list1.txt
$ # same as: grep -vFxf color_list1.txt color_list2.txt
$ ruby -rset -ne 'BEGIN{s=Set.new}; (s.add($_); next) if ARGV.size==1;
                  print if !s.include?($_)' color_list1.txt color_list2.txt
black
dark green

$ # reversing the order of input files gives
$ # lines from color_list1.txt not present in color_list2.txt
$ ruby -rset -ne 'BEGIN{s=Set.new}; (s.add($_); next) if ARGV.size==1;
                  print if !s.include?($_)' color_list2.txt color_list1.txt
teal
green

Alternatively, you can store the contents of the two input files as arrays and use set operations between them.

$ # common lines, output order is based on array to the left of & operator
$ # note that only -e option is used and one of the files is passed as stdin
$ ruby -e 'f1=STDIN.readlines; f2=readlines;
           puts f1 & f2' <color_list1.txt color_list2.txt
light blue
yellow

$ # lines from color_list1.txt not present in color_list2.txt
$ ruby -e 'f1=STDIN.readlines; f2=readlines;
           puts f1 - f2' <color_list1.txt color_list2.txt
teal
green

$ # union of the two files, same as f1 | f2 if read as separate arrays
$ ruby -e 'puts readlines.uniq' color_list1.txt color_list2.txt
teal
light blue
green
yellow
black
dark green

Comparing fields

In the previous section, you saw how to compare whole contents of records between two files. This section will focus on comparing only specific field(s). The below sample file will be one of the two file inputs for examples in this section. Consider whitespace as the field separator, so -a option is enough to get the fields.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

To start with, here's a single field comparison. The problem statement is to fetch all the records from marks.txt if the first field matches any of the departments listed in dept.txt file.

$ cat dept.txt
CSE
ECE

$ # note that dept.txt is used to build the set first
$ ruby -rset -ane 'BEGIN{s=Set.new}; (s.add($F[0]); next) if ARGV.size==1;
                   print if s.include?($F[0])' dept.txt marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

For multiple field comparison, use subset of an array for comparison.

$ cat dept_name.txt
EEE Moi
CSE Amy
ECE Raj

$ ruby -rset -ane 'BEGIN{s=Set.new}; (s.add($F); next) if ARGV.size==1;
                   print if s.include?($F[0..1])' dept_name.txt marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67

In this example, one of the field is used for numerical comparison. Hash is needed here instead of Set.

$ cat dept_mark.txt
ECE 70
EEE 65
CSE 80

$ # match Dept and minimum marks specified in dept_mark.txt
$ # since the marks are consistently 2-digits, string comparison is enough
$ # otherwise, you need to convert to numbers before comparison
$ ruby -ane 'BEGIN{d={}}; (d[$F[0]]=$F[1]; next) if ARGV.size==1;
             print if d.key?($F[0]) && $F[2] >= d[$F[0]]' dept_mark.txt marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92

Here's an example of adding a new field.

$ # adds a new grade column based on marks in 3rd column
$ ruby -lane 'BEGIN{g = %w[D C B A S]};
              $F.append($.==1 ? "Grade" : g[$F[-1].to_i/10 - 5]);
              print $F * "\t"' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

gets

gets (or the readline) method allows you to read a record from a file on demand. This is most useful when you need something based on record number. The following example shows how you can replace mth line from a file with nth line from another file.

$ ruby -pe 'BEGIN{m=3; n=2; n.times {$s = STDIN.gets}};
            $_ = $s if $. == m' <greeting.txt table.txt
brown bread mat hair 42
blue cake mug shirt -7
Have a nice day

Here's an example where two files are processed simultaneously. This doesn't implement error detection for difference in number of lines between the two files though.

$ # print line from greeting.txt if last column of corresponding line
$ # from table.txt is a positive number
$ # STDIN.gets will override $_ which is why $_ is saved to another variable
$ ruby -ne 'a=$_; n = STDIN.gets.split[-1].to_f;
            print a if n > 0' <table.txt greeting.txt
Hi there
Good bye

Multiline fixed string substitution

You can use file slurping for fixed string multiline search and replace requirements. Both sub and gsub methods allow matching fixed string if the first argument is a string instead of a regexp object. Since \ is special in replacement section, you'll have to escape it to provide a fixed string.

The below example is substituting complete lines. The solution will work for partial lines as well, provided there is no newline character at the end of search.txt and repl.txt files.

$ head -n2 table.txt > search.txt
$ cat repl.txt
2$1$&3\0x\\yz
wise ice go goa

$ ruby -0777 -ne 'ARGV.size==2 ? s=$_ : ARGV.size==1 ? r=$_ :
                  print(gsub(s, r.gsub(/\\/, "\\\0")))
                 ' search.txt repl.txt table.txt
2$1$&3\0x\\yz
wise ice go goa
yellow banana window shoes 3.14

warning Don't save contents of search.txt and repl.txt in shell variables for passing them to the ruby script. Trailing newlines and ASCII NUL characters will cause issues. See stackoverflow: pitfalls of reading file into shell variable for details.

Add file content conditionally

Case 1: replace each matching line with entire contents of STDIN.

$ # same as: sed -e '/[ot]/{r dept.txt' -e 'd}' greeting.txt
$ ruby -pe 'BEGIN{r = STDIN.read}; $_ = r if /[ot]/' <dept.txt greeting.txt
CSE
ECE
Have a nice day
CSE
ECE

Case 2: insert entire contents of STDIN before each matching line.

$ # same as: sed '/nice/e cat dept.txt' greeting.txt
$ ruby -pe 'BEGIN{r = STDIN.read}; print r if /nice/' <dept.txt greeting.txt
Hi there
CSE
ECE
Have a nice day
Good bye

Case 3: append entire contents of STDIN after each matching line.

$ # same as: sed '/nice/r dept.txt' greeting.txt
$ ruby -pe 'BEGIN{r = STDIN.read}; $_ << r if /nice/' <dept.txt greeting.txt
Hi there
Have a nice day
CSE
ECE
Good bye

Summary

This chapter discussed use cases where you need to process the contents of two or more files based on entire record/file or field(s). The ARGV.size==1 trick is handy for such cases (where the number is n-1 to match first file passed among n input files). The gets method is helpful for record number based comparisons.

Exercises

a) Use contents of match_words.txt file to display matching lines from jumbled.txt and sample.txt. The matching criteria is that the second word of lines from these files should match the third word of lines from match_words.txt.

$ cat match_words.txt
%whole(Hello)--{doubt}==ado==
just,\joint*,concession<=nice

$ # 'concession' is one of the third words from 'match_words.txt'
$ # and second word from 'jumbled.txt'
##### add your solution here
wavering:concession/woof\retailer
No doubt you like it too

b) Interleave contents of secrets.txt with the contents of a file passed as stdin in the format as shown below.

##### add your solution here, use 'table.txt' as stdin
stag area row tick
brown bread mat hair 42
---
deaf chi rate tall glad
blue cake mug shirt -7
---
Bi tac toe - 42
yellow banana window shoes 3.14

c) The file search_terms.txt contains one search string per line (these have no regexp metacharacters). Construct a solution that reads this file and displays search terms (matched case insensitively) that were found in all of the other input file arguments. Note that these terms should be matched with any part of the line, not just whole words.

$ cat search_terms.txt
hello
row
you
is
at

$ # ip: search_terms.txt jumbled.txt mixed_fs.txt secrets.txt table.txt oops.txt
##### add your solution here
row
at

$ # ip: search_terms.txt ip.txt sample.txt oops.txt
##### add your solution here
hello
you
is

d) For the input file ip.txt, print all lines that contain are and the line that comes after such a line, if any. Use gets method to construct the solution.

$ # note that there shouldn't be an empty line at the end of the output
##### add your solution here
How are you
This game is good
You are funny

Bonus: Will grep -A1 'is' ip.txt give identical results for your solution with is as the search term? If not, why?

e) Replace third to fifth lines of input file ip.txt with second to fourth lines from file para.txt

##### add your solution here
Hello World
How are you
Start working on that
project you always wanted
to, do not let it end
You are funny

f) Insert one line from jumbled.txt before every two lines of idx.txt

##### add your solution here
overcoats;furrowing-typeface%pewter##hobby
match after the last newline character
and then you want to test
wavering:concession/woof\retailer
this is good bye then
you were there to see?

g) Use entire contents of match.txt to search error.txt and replace with contents of jumbled.txt. Partial lines should NOT be matched.

$ cat match.txt
print+this
but not that
$ cat error.txt
print+this
but not that or this
print+this
but not that
if print+this
but not that
print+this
but not that

##### add your solution here
print+this
but not that or this
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
if print+this
but not that
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer