Dealing with duplicates

Often, you need to eliminate duplicates from input file(s), based on entire line content, field(s), etc. These are typically solved with sort and uniq commands. Advantage with perl include regexp based field separators, record separator other than newline, input doesn't have to be sorted, and in general more flexibility because it is a programming language.

Whole line duplicates

You can use uniq function from List::Util module or use a hash to retain only first copy of duplicates from one or more input files.

$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea

$ # same as: perl -MList::Util=uniq -e 'print uniq <>' purchases.txt
$ # can also use: perl -ne 'print if !exists $h{$_}; $h{$_}=1'
$ perl -ne 'print if !$h{$_}++' purchases.txt
coffee
tea
washing powder
toothpaste
soap

Column wise duplicates

The hash based solution is easy to adapt for removing field based duplicates. Just change $_ to the required field(s) after setting the appropriate field separator.

$ cat duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333

$ # based on last field
$ # -l isn't needed if all the lines end with newline character
$ perl -F, -ane 'print if !$h{$F[-1]}++' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555

Multiple fields example. As seen in Comparing fields section, you can either use comma separated values to construct the hash key or use hash of hashes.

$ # based on first and third field
$ # can also use: perl -F, -ane 'print if !$h{$F[0]}{$F[2]}++'
$ perl -F, -ane 'print if !$h{$F[0],$F[2]}++' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333

Duplicate count

In this section, how many times a duplicate record is found plays a role in determining the output. First up, printing only a specific numbered duplicate.

$ # print only the second occurrence of duplicates based on 2nd field
$ perl -F, -ane 'print if ++$h{$F[1]} == 2' duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111

$ # print only the third occurrence of duplicates based on last field
$ perl -F, -ane 'print if ++$h{$F[-1]} == 3' duplicates.txt
light red,purse,rose,333

Next, printing only the last copy of duplicate. Since the count isn't known, the tac command comes in handy again.

$ # reverse the input line-wise, retain first copy and then reverse again
$ tac duplicates.txt | perl -F, -ane 'print if !$h{$F[-1]}++' | tac
brown,toy,bread,42
dark red,sky,rose,555
white,sky,bread,111
light red,purse,rose,333

To get all the records based on a duplicate count, you can pass the input file twice. Then use the two file processing tricks to make decisions.

$ # all duplicates based on last column
$ perl -F, -ane '!$#ARGV ? $h{$F[-1]}++ :
                 $h{$F[-1]}>1 && print' duplicates.txt duplicates.txt
dark red,ruby,rose,111
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333

$ # all duplicates based on last column, minimum 3 duplicates
$ perl -F, -ane '!$#ARGV ? $h{$F[-1]}++ :
                 $h{$F[-1]}>2 && print' duplicates.txt duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
light red,purse,rose,333

$ # only unique lines based on 3rd column
$ perl -F, -ane '!$#ARGV ? $h{$F[2]}++ :
                 $h{$F[2]}==1 && print' duplicates.txt duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333

Summary

This chapter showed how to work with duplicate contents, both record and field based. If you don't need regexp based separators and if your input is too big to handle, then specialized command line tools sort and uniq will be better suited.

Exercises

a) Retain only first copy of a line for the input file lines.txt. Case should be ignored while comparing lines. For example hi there and HI TheRE will be considered as duplicates.

$ cat lines.txt
Go There
come on
go there
---
2 apples and 5 mangoes
come on!
---
2 Apples
COME ON

##### add your solution here
Go There
come on
---
2 apples and 5 mangoes
come on!
2 Apples

b) Retain only first copy of a line for the input file twos.txt. Assume space as field separator with two fields on each line. Compare the lines irrespective of order of the fields. For example, hehe haha and haha hehe will be considered as duplicates.

$ cat twos.txt
hehe haha
door floor
haha hehe
6;8 3-4
true blue
hehe bebe
floor door
3-4 6;8
tru eblue
haha hehe

##### add your solution here
hehe haha
door floor
6;8 3-4
true blue
hehe bebe
tru eblue

c) For the input file twos.txt, display only unique lines. Assume space as field separator with two fields on each line. Compare the lines irrespective of order of the fields. For example, hehe haha and haha hehe will be considered as duplicates.

##### add your solution here
true blue
hehe bebe
tru eblue