Dealing with duplicates
Often, you need to eliminate duplicates from an input file. This could be based on entire line content or based on certain fields. These are typically solved with sort
and uniq
commands. Advantage with awk
include regexp based field and record separators, input doesn't have to be sorted, and in general more flexibility because it is a programming language.
Whole line duplicates
awk '!a[$0]++'
is one of the most famous awk
one-liners. It eliminates line based duplicates while retaining input order. The following example shows it in action along with an illustration of how the logic works.
$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea
$ awk '{print +a[$0] "\t" $0; a[$0]++}' purchases.txt
0 coffee
0 tea
0 washing powder
1 coffee
0 toothpaste
1 tea
0 soap
2 tea
$ # only those entries with zero in first column will be retained
$ awk '!a[$0]++' purchases.txt
coffee
tea
washing powder
toothpaste
soap
Column wise duplicates
Removing field based duplicates is simple for single field comparison. Just change $0
to the required field number after setting the appropriate field separator.
$ cat duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
$ # based on last field
$ awk -F, '!seen[$NF]++' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
For multiple fields comparison, separate the fields with ,
so that SUBSEP
is used to combine the field values to generate the key. As mentioned before, SUBSEP
has a default value of \034
non-printing character, which is typically not used in text files.
$ # based on first and third field
$ awk -F, '!seen[$1,$3]++' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
Duplicate count
In this section, how many times a duplicate record is found plays a role in determining the output.
First up, printing only a specific numbered duplicate.
$ # print only the second occurrence of duplicates based on 2nd field
$ awk -F, '++seen[$2]==2' duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
$ # print only the third occurrence of duplicates based on last field
$ awk -F, '++seen[$NF]==3' duplicates.txt
light red,purse,rose,333
Next, printing only the last copy of duplicate. Since the count isn't known, the tac
command comes in handy again.
$ # reverse the input line-wise, retain first copy and then reverse again
$ tac duplicates.txt | awk -F, '!seen[$NF]++' | tac
brown,toy,bread,42
dark red,sky,rose,555
white,sky,bread,111
light red,purse,rose,333
To get all the records based on a duplicate count, you can pass the input file twice. Then use the two file processing trick to make decisions.
$ # all duplicates based on last column
$ awk -F, 'NR==FNR{a[$NF]++; next} a[$NF]>1' duplicates.txt duplicates.txt
dark red,ruby,rose,111
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
$ # all duplicates based on last column, minimum 3 duplicates
$ awk -F, 'NR==FNR{a[$NF]++; next} a[$NF]>2' duplicates.txt duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
light red,purse,rose,333
$ # only unique lines based on 3rd column
$ awk -F, 'NR==FNR{a[$3]++; next} a[$3]==1' duplicates.txt duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
Summary
This chapter showed how to work with duplicate contents, both record and field based. If you don't need regexp based separators and if your input is too big to handle, then specialized command line tools sort
and uniq
will be better suited compared to awk
.
Next chapter will show how to write awk
scripts instead of the usual one-liners.
Exercises
a) Retain only first copy of a line for the input file lines.txt
. Case should be ignored while comparing lines. For example hi there
and HI TheRE
will be considered as duplicates.
$ cat lines.txt
Go There
come on
go there
---
2 apples and 5 mangoes
come on!
---
2 Apples
COME ON
$ awk ##### add your solution here
Go There
come on
---
2 apples and 5 mangoes
come on!
2 Apples
b) Retain only first copy of a line for the input file lines.txt
. Assume space as field separator with two fields on each line. Compare the lines irrespective of order of the fields. For example, hehe haha
and haha hehe
will be considered as duplicates.
$ cat twos.txt
hehe haha
door floor
haha hehe
6;8 3-4
true blue
hehe bebe
floor door
3-4 6;8
tru eblue
haha hehe
$ awk ##### add your solution here
hehe haha
door floor
6;8 3-4
true blue
hehe bebe
tru eblue
c) For the input file twos.txt
, create a file uniq.txt
with all the unique lines and dupl.txt
with all the duplicate lines. Assume space as field separator with two fields on each line. Compare the lines irrespective of order of the fields. For example, hehe haha
and haha hehe
will be considered as duplicates.
$ awk ##### add your solution here
$ cat uniq.txt
true blue
hehe bebe
tru eblue
$ cat dupl.txt
hehe haha
door floor
haha hehe
6;8 3-4
floor door
3-4 6;8
haha hehe