Two file processing

This chapter focuses on solving problems which depend upon the contents of two or more files. These are usually based on comparing records and fields. Sometimes, record number plays a role and some cases need the entire file content. You'll also see examples with the -r command line option and the gets method.

info The example_files directory has all the files used in the examples.

Comparing records

Consider the following input files which will be compared line wise in this section.

$ cat colors_1.txt
teal
light blue
green
yellow

$ cat colors_2.txt
light blue
black
dark green
yellow

The -r command line option allows you to specify a library required for the script. The Set class is handy for two file processing cases. See ruby-doc: Set for documentation.

# common lines
# same as: grep -Fxf colors_1.txt colors_2.txt
# ARGV.size==1 will be true only for the first file (for two file input)
$ ruby -rset -ne 'BEGIN{s=Set.new}; (s.add($_); next) if ARGV.size==1;
                  print if s.include?($_)' colors_1.txt colors_2.txt
light blue
yellow

# lines from colors_2.txt not present in colors_1.txt
# same as: grep -vFxf colors_1.txt colors_2.txt
$ ruby -rset -ne 'BEGIN{s=Set.new}; (s.add($_); next) if ARGV.size==1;
                  print if !s.include?($_)' colors_1.txt colors_2.txt
black
dark green

# reversing the order of input files gives
# lines from colors_1.txt not present in colors_2.txt
$ ruby -rset -ne 'BEGIN{s=Set.new}; (s.add($_); next) if ARGV.size==1;
                  print if !s.include?($_)' colors_2.txt colors_1.txt
teal
green

Alternatively, you can store the contents of the two input files as arrays and use set operations between them.

# common lines, output order is based on the array to the left of & operator
# note that only the -e option is used and one of the files is passed as stdin
$ ruby -e 'f1=STDIN.readlines; f2=readlines;
           puts f1 & f2' <colors_1.txt colors_2.txt
light blue
yellow

# lines from colors_1.txt not present in colors_2.txt
$ ruby -e 'f1=STDIN.readlines; f2=readlines;
           puts f1 - f2' <colors_1.txt colors_2.txt
teal
green

# union of the two files, same as f1 | f2 if read as separate arrays
$ ruby -e 'puts readlines.uniq' colors_1.txt colors_2.txt
teal
light blue
green
yellow
black
dark green

Comparing fields

In the previous section, you saw how to compare the contents of whole records between two files. This section will focus on comparing only specific fields. The below sample file will be one of the two file inputs for examples in this section. Consider whitespace as the field separator, so the -a option is enough to get the fields.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

To start with, here's an example with a single field comparison. The problem statement is to fetch all records from marks.txt if the first field matches any of the departments listed in the dept.txt file.

$ cat dept.txt
CSE
ECE

# note that dept.txt is used to build the set first
$ ruby -rset -ane 'BEGIN{s=Set.new}; (s.add($F[0]); next) if ARGV.size==1;
                   print if s.include?($F[0])' dept.txt marks.txt
ECE     Raj     53
ECE     Joel    72
CSE     Surya   81
ECE     Om      92
CSE     Amy     67

For multiple field comparison, use subset of an array for comparison.

$ cat dept_name.txt
EEE Moi
CSE Amy
ECE Raj

$ ruby -rset -ane 'BEGIN{s=Set.new}; (s.add($F); next) if ARGV.size==1;
                   print if s.include?($F[0..1])' dept_name.txt marks.txt
ECE     Raj     53
EEE     Moi     68
CSE     Amy     67

In this example, one of the fields is used for numerical comparison.

$ cat dept_mark.txt
ECE 70
EEE 65
CSE 80

# match Dept and minimum marks specified in dept_mark.txt
# since the marks are consistently 2-digits, string comparison is enough
# otherwise, you'd need to convert to numbers before comparison
$ ruby -ane 'BEGIN{d={}}; (d[$F[0]]=$F[1]; next) if ARGV.size==1;
             print if d.key?($F[0]) && $F[2] >= d[$F[0]]' dept_mark.txt marks.txt
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
ECE     Om      92

Here's an example of adding a new field based on the contents of another field.

$ ruby -lane 'BEGIN{g = %w[D C B A S]};
              $F.append($.==1 ? "Grade" : g[$F[-1].to_i/10 - 5]);
              print $F * "\t"' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

gets

The gets method (or readline) helps you to read a record from a file on demand. This is most useful when you need something based on the record number. The following example shows how you can replace the mth line from a file with the nth line from another file.

# replace the 3rd line of table.txt with the 2nd line of greeting.txt
$ ruby -pe 'BEGIN{m=3; n=2; n.times {$s = STDIN.gets}};
            $_ = $s if $. == m' <greeting.txt table.txt
brown bread mat hair 42
blue cake mug shirt -7
Have a nice day

Here's an example where two files are processed simultaneously. This doesn't implement error detection for difference in the number of lines between the two files though.

# print line from greeting.txt if the last column of the corresponding line
# from table.txt is a positive number
# STDIN.gets will override $_ which is why $_ is saved to another variable
$ ruby -ne 'a=$_; n = STDIN.gets.split[-1].to_f;
            print a if n > 0' <table.txt greeting.txt
Hi there
Good bye

Multiline fixed string substitution

You can use file slurping for fixed string multiline search and replace requirements. Both sub and gsub methods allow matching fixed string if the first argument is a string instead of a regexp object. Since \ is special in the replacement section, you'll have to use block form to provide the replacement string.

The below example is substituting complete lines. The solution will work for partial lines as well, provided there is no newline character at the end of search.txt and repl.txt files.

$ head -n2 table.txt > search.txt
$ cat repl.txt
2$1$&3\0x\\yz
wise ice go goa

$ ruby -0777 -ne 'ARGV.size==2 ? s=$_ : ARGV.size==1 ? r=$_ :
                  print(gsub(s) {r})
                 ' search.txt repl.txt table.txt
2$1$&3\0x\\yz
wise ice go goa
yellow banana window shoes 3.14

warning Don't save the contents of search.txt and repl.txt in shell variables for passing them to the Ruby script. Trailing newlines and ASCII NUL characters will cause issues. See stackoverflow: pitfalls of reading file into shell variable for details.

Add file content conditionally

Case 1: replace each matching line with the entire contents of STDIN.

# same as: sed -e '/[ot]/{r dept.txt' -e 'd}' greeting.txt
$ ruby -pe 'BEGIN{r = STDIN.read}; $_ = r if /[ot]/' <dept.txt greeting.txt
CSE
ECE
Have a nice day
CSE
ECE

Case 2: insert the entire contents of STDIN before each matching line.

# same as: sed '/nice/e cat dept.txt' greeting.txt
$ ruby -pe 'BEGIN{r = STDIN.read}; print r if /nice/' <dept.txt greeting.txt
Hi there
CSE
ECE
Have a nice day
Good bye

Case 3: append the entire contents of STDIN after each matching line.

# same as: sed '/nice/r dept.txt' greeting.txt
$ ruby -pe 'BEGIN{r = STDIN.read}; $_ << r if /nice/' <dept.txt greeting.txt
Hi there
Have a nice day
CSE
ECE
Good bye

Summary

This chapter discussed use cases where you need to process the contents of two or more files based on entire record/file or field(s). The ARGV.size==1 trick is handy for such cases (where the number is n-1 to match the first file passed among n input files). The gets method is helpful for record number based comparisons.

Exercises

info The exercises directory has all the files used in this section.

1) Use the contents of match_words.txt file to display matching lines from jumbled.txt and sample.txt. The matching criteria is that the second word of lines from these files should match the third word of lines from match_words.txt.

$ cat match_words.txt
%whole(Hello)--{doubt}==ado==
just,\joint*,concession<=nice

# 'concession' is one of the third words from 'match_words.txt'
# and second word from 'jumbled.txt'
##### add your solution here
wavering:concession/woof\retailer
No doubt you like it too

2) Interleave the contents of secrets.txt with the contents of a file passed as stdin in the format as shown below.

##### add your solution here, use 'table.txt' as stdin
stag area row tick
brown bread mat hair 42
---
deaf chi rate tall glad
blue cake mug shirt -7
---
Bi tac toe - 42
yellow banana window shoes 3.14

3) The file search_terms.txt contains one search string per line, and these terms have no regexp metacharacters. Construct a solution that reads this file and displays the search terms (matched case insensitively) that were found in every file passed as the arguments after search_terms.txt. Note that these terms should be matched anywhere in the line (so, don't use word boundaries).

$ cat search_terms.txt
hello
row
you
is
at

# ip: search_terms.txt jumbled.txt mixed_fs.txt secrets.txt table.txt oops.txt
##### add your solution here
row
at

# ip: search_terms.txt ip.txt sample.txt oops.txt
##### add your solution here
hello
you
is

4) For the input file ip.txt, print all lines that contain are and the line that comes after such a line, if any. Use the gets method to construct the solution.

# note that there shouldn't be an empty line at the end of the output
##### add your solution here
How are you
This game is good
You are funny

Bonus: Will grep -A1 'is' ip.txt give identical results for your solution with is as the search term? If not, why?

5) Replace the third to fifth lines of the input file ip.txt with the second to fourth lines from the file para.txt.

##### add your solution here
Hello World
How are you
Start working on that
project you always wanted
to, do not let it end
You are funny

6) Insert one line from jumbled.txt before every two lines of copyright.txt.

##### add your solution here
overcoats;furrowing-typeface%pewter##hobby
bla bla 2015 bla
blah 2018 blah
wavering:concession/woof\retailer
bla bla bla
copyright: 2018

7) Use the entire contents of match.txt to search error.txt and replace matching portions with the contents of jumbled.txt. Partial lines should NOT be matched.

$ cat match.txt
print+this
but not that
$ cat error.txt
print+this
but not that or this
print+this
but not that
if print+this
but not that
print+this
but not that

##### add your solution here
print+this
but not that or this
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
joint[]seer{intuition}titanic
if print+this
but not that
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
joint[]seer{intuition}titanic

8) Display lines from scores.csv by matching the first field based on a list of names from the names.txt file. Also, change the output field separator to a space character.

$ cat names.txt
Lin
Cy
Ith

##### add your solution here
Lin 78 83 80
Cy 97 98 95
Ith 100 100 100

9) The result.csv file has three columns — name, subject and mark. The criteria.txt file has two columns — name and subject. Match lines from result.csv based on the two columns from criteria.txt provided the mark column is greater than 80.

$ cat result.csv
Amy,maths,89
Amy,physics,75
Joe,maths,79
John,chemistry,77
John,physics,91
Moe,maths,81
Ravi,physics,84
Ravi,chemistry,70
Yui,maths,92

$ cat criteria.txt
Amy maths
John chemistry
John physics
Ravi chemistry
Yui maths

##### add your solution here
Amy,maths,89
John,physics,91
Yui,maths,92

10) Insert the contents of hex.txt before a line matching cake of table.txt.

##### add your solution here
brown bread mat hair 42
start: 0xA0, func1: 0xA0
end: 0xFF, func2: 0xB0
restart: 0xA010, func3: 0x7F
blue cake mug shirt -7
yellow banana window shoes 3.14

11) For the input file ip.txt, replace lines containing are with the contents of hex.txt.

##### add your solution here
Hello World
start: 0xA0, func1: 0xA0
end: 0xFF, func2: 0xB0
restart: 0xA010, func3: 0x7F
This game is good
Today is sunny
12345
start: 0xA0, func1: 0xA0
end: 0xFF, func2: 0xB0
restart: 0xA010, func3: 0x7F