CLI tip 26: removing duplicate lines with GNU awk
awk '!a[$0]++'
is one of the most famous Awk one-liners. It eliminates line based duplicates while retaining input order. The following example shows it in action along with an illustration of how the logic works.
$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea
$ awk '{print +a[$0] "\t" $0; a[$0]++}' purchases.txt
0 coffee
0 tea
0 washing powder
1 coffee
0 toothpaste
1 tea
0 soap
2 tea
# only those entries with zero in the first column will be retained
$ awk '!a[$0]++' purchases.txt
coffee
tea
washing powder
toothpaste
soap
Removing field based duplicates is simple for single field comparison. Just change $0
to the required field number after setting the appropriate field separator.
$ cat duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
# based on the last field
$ awk -F, '!seen[$NF]++' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
For multiple fields comparison, separate the fields with ,
so that SUBSEP
is used to combine the field values to generate the key. SUBSEP
has a default value of \034
which is a non-printing character and not usually used in text files.
# based on the first and third fields
$ awk -F, '!seen[$1,$3]++' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
huniq is a faster alternative for removing line based duplicates.
Video demo:
See also my CLI text processing with GNU awk ebook.