Two file processing
This chapter focuses on solving problems which depend upon the contents of two or more files. These are usually based on comparing records and fields. Sometimes, record number plays a role too. You'll also see some examples where the entire file content is used.
The example_files directory has all the files used in the examples.
Comparing records
Consider the following input files which will be compared line wise to get the common and unique lines.
$ cat colors_1.txt
teal
light blue
green
yellow
$ cat colors_2.txt
light blue
black
dark green
yellow
If you do not wish to use modules, you can make use of a hash to compare the records.
# common lines
# same as: grep -Fxf colors_1.txt colors_2.txt
# for two file input, $#ARGV will be 0 only for the first file
# note that 'exists' isn't strictly necessary here
$ perl -ne 'if(!$#ARGV){$h{$_}=1; next}
print if exists $h{$_}' colors_1.txt colors_2.txt
light blue
yellow
# lines from colors_2.txt not present in colors_1.txt
# same as: grep -vFxf colors_1.txt colors_2.txt
$ perl -ne 'if(!$#ARGV){$h{$_}=1; next}
print if !exists $h{$_}' colors_1.txt colors_2.txt
black
dark green
# reversing the order of input files gives
# lines from colors_1.txt not present in colors_2.txt
$ perl -ne 'if(!$#ARGV){$h{$_}=1; next}
print if !exists $h{$_}' colors_2.txt colors_1.txt
teal
green
Here are some alternate ways to construct a solution for above examples.
# using if-else instead of next
$ perl -ne 'if(!$#ARGV){ $h{$_}=1 }
else{ print if exists $h{$_} }' colors_1.txt colors_2.txt
light blue
yellow
# read all lines from the first file passed as STDIN in the BEGIN block
$ perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> }
print if exists $h{$_}' <colors_1.txt colors_2.txt
light blue
yellow
Using modules for set operations
You can use the uniq
function from the List::Util
module to preserve only one copy of duplicates from one or more input files. See the Dealing with duplicates chapter for field based duplicate processing.
# input order of lines is preserved
# this is same as performing union between two sets
$ perl -MList::Util=uniq -e 'print uniq <>' colors_1.txt colors_2.txt
teal
light blue
green
yellow
black
dark green
The metacpan: List::Compare module supports set operations like union, intersection, symmetric difference etc. See also metacpan: Array::Utils.
# union, input order of lines is NOT preserved
# note that only the -e option is used and one of the files is passed as stdin
$ perl -MList::Compare -e '@a1=<STDIN>; @a2=<>;
print List::Compare->new(\@a1, \@a2)->get_union
' <colors_1.txt colors_2.txt
black
dark green
green
light blue
teal
yellow
# intersection (common lines)
$ perl -MList::Compare -e '@a1=<STDIN>; @a2=<>;
print List::Compare->new(\@a1, \@a2)->get_intersection
' <colors_1.txt colors_2.txt
light blue
yellow
# lines from colors_1.txt not present in colors_2.txt
$ perl -MList::Compare -e '@a1=<STDIN>; @a2=<>;
print List::Compare->new(\@a1, \@a2)->get_unique
' <colors_1.txt colors_2.txt
green
teal
Comparing fields
In the previous sections, you saw how to compare the contents of whole records between two files. This section will focus on comparing only specific fields. The below sample file will be one of the two file inputs for examples in this section. Consider whitespace as the field separator, so the -a
option is enough to get the fields.
$ cat marks.txt
Dept Name Marks
ECE Raj 53
ECE Joel 72
EEE Moi 68
CSE Surya 81
EEE Tia 59
ECE Om 92
CSE Amy 67
To start with, here's an example with a single field comparison. The problem statement is to fetch all records from marks.txt
if the first field matches any of the departments listed in the dept.txt
file.
$ cat dept.txt
CSE
ECE
$ perl -ane 'if(!$#ARGV){ $h{$F[0]}=1 }
else{ print if exists $h{$F[0]} }' dept.txt marks.txt
ECE Raj 53
ECE Joel 72
CSE Surya 81
ECE Om 92
CSE Amy 67
For multiple field comparison, you can use comma separated values to construct the hash keys. The special variable $;
(whose default is \034
) will be used to join these values. The \034
character is usually not present in text files. If you cannot guarantee the absence of this character, you can use some other character or use a hash of hashes. See also stackoverflow: using array as hash key.
$ cat dept_name.txt
EEE Moi
CSE Amy
ECE Raj
# don't use array slice as hash keys
$ perl -anE '$h{@F[0..1]}=1; say join ",", keys %h' dept_name.txt | cat -v
Moi
Moi,Amy
Moi,Raj,Amy
# default value of $; is \034, same as SUBSEP in awk
$ perl -anE '$h{$F[0],$F[1]}=1; say join ",", keys %h' dept_name.txt | cat -v
EEE^\Moi
CSE^\Amy,EEE^\Moi
ECE^\Raj,CSE^\Amy,EEE^\Moi
$ perl -ane 'if(!$#ARGV){ $h{$F[0],$F[1]}=1 }
else{ print if exists $h{$F[0],$F[1]} }' dept_name.txt marks.txt
ECE Raj 53
EEE Moi 68
CSE Amy 67
Here's an alternate solution with a hash of hashes. See also perldoc: REFERENCES.
$ perl -ane 'if(!$#ARGV){ $h{$F[0]}{$F[1]}=1 }
else{ print if exists $h{$F[0]}{$F[1]} }' dept_name.txt marks.txt
ECE Raj 53
EEE Moi 68
CSE Amy 67
In this example, one of the fields is used for numerical comparison.
$ cat dept_mark.txt
ECE 70
EEE 65
CSE 80
# match Dept and minimum marks specified in dept_mark.txt
$ perl -ane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ print if exists $h{$F[0]} && $F[2]>=$h{$F[0]} }
' dept_mark.txt marks.txt
ECE Joel 72
EEE Moi 68
CSE Surya 81
ECE Om 92
Here's an example of adding a new field.
$ cat role.txt
Raj class_rep
Amy sports_rep
Tia placement_rep
# $.=0 is needed to allow header line comparison for the second file
$ perl -lane 'if(!$#ARGV){ $r{$F[0]}=$F[1]; $.=0 }
else{ print join "\t", @F, $.==1 ? "Role" : $r{$F[1]} }
' role.txt marks.txt
Dept Name Marks Role
ECE Raj 53 class_rep
ECE Joel 72
EEE Moi 68
CSE Surya 81
EEE Tia 59 placement_rep
ECE Om 92
CSE Amy 67 sports_rep
Based on line numbers
Here's an example that shows how you can replace the m
th line from a file with the n
th line from another file.
# replace 3rd line of table.txt with 2nd line of greeting.txt
$ perl -pe 'BEGIN{ $m=3; $n=2; $s = <STDIN> for 1..$n }
$_ = $s if $. == $m' <greeting.txt table.txt
brown bread mat hair 42
blue cake mug shirt -7
Have a nice day
Here's an example where two files are processed simultaneously.
# print line from greeting.txt if the last column of corresponding line
# from table.txt is a positive number
$ perl -ne 'print if (split " ", <STDIN>)[-1] > 0' <table.txt greeting.txt
Hi there
Good bye
Multiline fixed string substitution
You can use file slurping for fixed string multiline search and replace requirements. The below example is substituting complete lines. The solution will work for partial lines as well, provided there is no newline character at the end of search.txt
and repl.txt
files.
$ head -n2 table.txt > search.txt
$ cat repl.txt
2$1$&3
wise ice go goa
$ perl -0777 -ne '$#ARGV==1 ? $s=$_ : $#ARGV==0 ? $r=$_ :
print s/\Q$s/$r/gr' search.txt repl.txt table.txt
2$1$&3
wise ice go goa
yellow banana window shoes 3.14
Don't save the contents of
search.txt
andrepl.txt
in shell variables for passing them to the Perl script. Trailing newlines and ASCII NUL characters will cause issues. See stackoverflow: pitfalls of reading file into shell variable for details.
Add file content conditionally
Case 1: replace each matching line with the entire contents of STDIN
.
# same as: sed -e '/[ot]/{r dept.txt' -e 'd}' greeting.txt
$ perl -pe 'BEGIN{$r = join "", <STDIN>} $_=$r if /[ot]/' <dept.txt greeting.txt
CSE
ECE
Have a nice day
CSE
ECE
Case 2: insert the entire contents of STDIN
before each matching line.
# same as: sed '/nice/e cat dept.txt' greeting.txt
$ perl -pe 'BEGIN{$r = join "", <STDIN>}
print $r if /nice/' <dept.txt greeting.txt
Hi there
CSE
ECE
Have a nice day
Good bye
Case 3: append the entire contents of STDIN
after each matching line.
# same as: sed '/nice/r dept.txt' greeting.txt
$ perl -pe 'BEGIN{$r = join "", <STDIN>}
$_ .= $r if /nice/' <dept.txt greeting.txt
Hi there
Have a nice day
CSE
ECE
Good bye
Summary
This chapter discussed use cases where you need to process the contents of two or more files based on entire record/file or fields. The value of $#ARGV
is handy for such cases (formula is n-2
to match the first file passed among n
input files). The next chapter discusses more such examples, based solely on occurrences of duplicate values.
Exercises
The exercises directory has all the files used in this section.
1) Use the contents of match_words.txt
file to display matching lines from jumbled.txt
and sample.txt
. The matching criteria is that the second word of lines from these files should match the third word of lines from match_words.txt
.
$ cat match_words.txt
%whole(Hello)--{doubt}==ado==
just,\joint*,concession<=nice
# 'concession' is one of the third words from 'match_words.txt'
# and second word from 'jumbled.txt'
##### add your solution here
wavering:concession/woof\retailer
No doubt you like it too
2) Interleave the contents of secrets.txt
with the contents of a file passed as stdin in the format as shown below.
##### add your solution here, use 'table.txt' for stdin data
stag area row tick
brown bread mat hair 42
---
deaf chi rate tall glad
blue cake mug shirt -7
---
Bi tac toe - 42
yellow banana window shoes 3.14
3) The file search_terms.txt
contains one search string per line, and these terms have no regexp metacharacters. Construct a solution that reads this file and displays the search terms (matched case insensitively) that were found in every file passed as the arguments after search_terms.txt
. Note that these terms should be matched anywhere in the line (so, don't use word boundaries).
$ cat search_terms.txt
hello
row
you
is
at
# ip: search_terms.txt jumbled.txt mixed_fs.txt secrets.txt table.txt oops.txt
##### add your solution here
row
at
# ip: search_terms.txt ip.txt sample.txt oops.txt
##### add your solution here
hello
you
is
4) Replace the third to fifth lines of the input file ip.txt
with the second to fourth lines from the file para.txt
.
##### add your solution here
Hello World
How are you
Start working on that
project you always wanted
to, do not let it end
You are funny
5) Insert one line from jumbled.txt
before every two lines of copyright.txt
.
##### add your solution here
overcoats;furrowing-typeface%pewter##hobby
bla bla 2015 bla
blah 2018 blah
wavering:concession/woof\retailer
bla bla bla
copyright: 2020
6) Use the entire contents of match.txt
to search error.txt
and replace matching portions with the contents of jumbled.txt
. Partial lines should NOT be matched.
$ cat match.txt
print this
but not that
$ cat error.txt
print this
but not that or this
print this
but not that
if print this
but not that
print this
but not that
##### add your solution here
print this
but not that or this
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
joint[]seer{intuition}titanic
if print this
but not that
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
joint[]seer{intuition}titanic
7) Display lines from scores.csv
by matching the first field based on a list of names from the names.txt
file. Also, change the output field separator to a space character.
$ cat names.txt
Lin
Cy
Ith
##### add your solution here
Lin 78 83 80
Cy 97 98 95
Ith 100 100 100
8) The result.csv
file has three columns — name, subject and mark. The criteria.txt
file has two columns — name and subject. Match lines from result.csv
based on the two columns from criteria.txt
provided the mark column is greater than 80.
$ cat result.csv
Amy,maths,89
Amy,physics,75
Joe,maths,79
John,chemistry,77
John,physics,91
Moe,maths,81
Ravi,physics,84
Ravi,chemistry,70
Yui,maths,92
$ cat criteria.txt
Amy maths
John chemistry
John physics
Ravi chemistry
Yui maths
##### add your solution here
Amy,maths,89
John,physics,91
Yui,maths,92
9) Insert the contents of hex.txt
before a line matching cake
of the input file table.txt
##### add your solution here
brown bread mat hair 42
start: 0xA0, func1: 0xA0
end: 0xFF, func2: 0xB0
restart: 0xA010, func3: 0x7F
blue cake mug shirt -7
yellow banana window shoes 3.14
10) For the input file ip.txt
, replace lines containing are
with the contents of hex.txt
.
##### add your solution here
Hello World
start: 0xA0, func1: 0xA0
end: 0xFF, func2: 0xB0
restart: 0xA010, func3: 0x7F
This game is good
Today is sunny
12345
start: 0xA0, func1: 0xA0
end: 0xFF, func2: 0xB0
restart: 0xA010, func3: 0x7F