Dealing with duplicates
Often, you need to eliminate duplicates from an input file. This could be based on the entire line content or based on certain fields. These are typically solved with the sort
and uniq
commands. Advantages with Ruby include regexp based field separator, record separator other than newline, input doesn't have to be sorted, and in general more flexibility because it is a programming language.
The example_files directory has all the files used in the examples.
Whole line duplicates
Using the uniq
method on readlines
is the easiest and most compact solution if memory isn't an issue.
$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea
# note that the -n or -p options aren't needed here
# this will work for multiple input files as well
$ ruby -e 'puts readlines.uniq' purchases.txt
coffee
tea
washing powder
toothpaste
soap
If there are lots of duplicate lines and having the whole input file in an array can cause memory issues, then using a Set
might help.
# add? returns nil if element already exists, else adds to the set
$ ruby -rset -ne 'BEGIN{s=Set.new}; print if s.add?($_)' purchases.txt
coffee
tea
washing powder
toothpaste
soap
See also huniq, a faster alternative for removing line based duplicates.
Column wise duplicates
The set
based solution is easy to adapt for removing field based duplicates. Just change $_
to the required fields after setting the appropriate field separator. With uniq
, you can use blocks to specify the condition.
$ cat duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
# based on the last field
# same as: ruby -e 'puts readlines.uniq {_1.split(",")[-1]}'
$ ruby -rset -F, -ane 'BEGIN{s=Set.new}; print if s.add?($F[-1])' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
Multiple fields example.
# based on the first and third fields
# same as: ruby -e 'puts readlines.uniq {_1.split(",").values_at(0,2)}'
$ ruby -rset -F, -ane 'BEGIN{s=Set.new};
print if s.add?($F.values_at(0,2))' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
Duplicate count
In this section, how many times a duplicate record is found plays a role in determining the output.
First up, printing only a specific numbered duplicate. As seen before, Hash.new(0)
will initialize the value of a new key to 0
.
# print only the second occurrence of duplicates based on the second field
$ ruby -F, -ane 'BEGIN{h=Hash.new(0)};
print if (h[$F[1]]+=1)==2' duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
# print only the third occurrence of duplicates based on the last field
$ ruby -F, -ane 'BEGIN{h=Hash.new(0)};
print if (h[$F[-1]]+=1)==3' duplicates.txt
light red,purse,rose,333
Next, printing only the last copy of duplicates. Since the count isn't known, the tac
command comes in handy again.
# reverse the input line-wise, retain the first copy and then reverse again
$ tac duplicates.txt | ruby -rset -F, -ane 'BEGIN{s=Set.new};
print if s.add?($F[-1])' | tac
brown,toy,bread,42
dark red,sky,rose,555
white,sky,bread,111
light red,purse,rose,333
To get all the records based on a duplicate count, you can pass the input file twice. Then use the two file processing trick to make decisions.
# all duplicates based on the last column
$ ruby -F, -ane 'BEGIN{h=Hash.new(0)}; ARGV.size==1 ? h[$F[-1]]+=1 :
h[$F[-1]]>1 && print' duplicates.txt duplicates.txt
dark red,ruby,rose,111
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
# all duplicates based on the last column, minimum 3 duplicates
$ ruby -F, -ane 'BEGIN{h=Hash.new(0)}; ARGV.size==1 ? h[$F[-1]]+=1 :
h[$F[-1]]>2 && print' duplicates.txt duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
light red,purse,rose,333
# only unique lines based on the third column
$ ruby -F, -ane 'BEGIN{h=Hash.new(0)}; ARGV.size==1 ? h[$F[2]]+=1 :
h[$F[2]]==1 && print' duplicates.txt duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
Summary
This chapter showed how to work with duplicate contents for records and fields. If you don't need regexp based separators and if your input is too big to handle, then specialized command line tools like sort
, uniq
and huniq
will be better suited.
Exercises
The exercises directory has all the files used in this section.
1) Retain only the first copy of a line for the input file lines.txt
. Case should be ignored while comparing the lines. For example, hi there
and HI TheRE
should be considered as duplicates.
$ cat lines.txt
Go There
come on
go there
---
2 apples and 5 mangoes
come on!
---
2 Apples
COME ON
##### add your solution here
Go There
come on
---
2 apples and 5 mangoes
come on!
2 Apples
2) Retain only the first copy of a line for the input file twos.txt
. Assume space as the field separator with exactly two fields per line. Compare the lines irrespective of the order of the fields. For example, hehe haha
and haha hehe
should be considered as duplicates.
$ cat twos.txt
hehe haha
door floor
haha hehe
6;8 3-4
true blue
hehe bebe
floor door
3-4 6;8
tru eblue
haha hehe
##### add your solution here
hehe haha
door floor
6;8 3-4
true blue
hehe bebe
tru eblue
3) For the input file twos.txt
, display only the unique lines. Assume space as the field separator with exactly two fields per line. Compare the lines irrespective of the order of the fields. For example, hehe haha
and haha hehe
should be considered as duplicates.
##### add your solution here
true blue
hehe bebe
tru eblue