uniq
The uniq
command identifies similar lines that are adjacent to each other. There are various options to help you filter unique or duplicate lines, count them, group them, etc.
Retain single copy of duplicates
This is the default behavior of the uniq
command. If adjacent lines are the same, only the first copy will be displayed in the output.
# only the adjacent lines are compared to determine duplicates
# which is why you get 'red' twice in the output for this input
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq
red
green
red
blue
You'll need sorted input to make sure all the input lines are considered to determine duplicates. For some cases, sort -u
is enough, like the example shown below:
# same as sort -u for this case
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | sort | uniq
blue
green
red
Sometimes though, you may need to sort based on some specific criteria and then identify duplicates based on the entire line contents. Here's an example:
# can't use sort -n -u here
$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -n | uniq
2 balls
2 pins
13 pens
sort+uniq
won't be suitable if you need to preserve the input order as well. You can use alternatives likeawk
,perl
and huniq for such cases.# retain only the first copy of duplicates, maintain input order $ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | awk '!seen[$0]++' red green blue
Duplicates only
The -d
option will display only the duplicate entries. That is, only if a line is seen more than once.
$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea
$ sort purchases.txt | uniq -d
coffee
tea
To display all the copies of duplicates, use the -D
option.
$ sort purchases.txt | uniq -D
coffee
coffee
tea
tea
tea
Unique only
The -u
option will display only the unique entries. That is, only if a line doesn't occur more than once.
$ sort purchases.txt | uniq -u
soap
toothpaste
washing powder
# reminder that uniq works based on adjacent lines only
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq -u
green
red
Grouping similar lines
The --group
options allows you to visually separate groups of similar lines with an empty line. This option can accept four values — separate
, prepend
, append
and both
. The default is separate
, which adds a newline character between the groups. prepend
will add a newline before the first group as well and append
will add a newline after the last group. both
combines the prepend
and append
behavior.
$ sort purchases.txt | uniq --group
coffee
coffee
soap
tea
tea
tea
toothpaste
washing powder
The --group
option cannot be used with the -c
, -d
, -D
or -u
options. The --all-repeated
alias for the -D
option uses none
as the default grouping. You can change that to separate
or prepend
values.
$ sort purchases.txt | uniq --all-repeated=prepend
coffee
coffee
tea
tea
tea
Prefix count
If you want to know how many times a line has been repeated, use the -c
option. This will be added as a prefix.
$ sort purchases.txt | uniq -c
2 coffee
1 soap
3 tea
1 toothpaste
1 washing powder
$ sort purchases.txt | uniq -dc
2 coffee
3 tea
The output of this option is usually piped to sort
for ordering the output based on the count.
$ sort purchases.txt | uniq -c | sort -n
1 soap
1 toothpaste
1 washing powder
2 coffee
3 tea
$ sort purchases.txt | uniq -c | sort -nr
3 tea
2 coffee
1 washing powder
1 toothpaste
1 soap
Ignoring case
Use the -i
option to ignore case while determining duplicates.
# depending on your locale, sort and sort -f can give the same results
$ printf 'hat\nbat\nHAT\ncar\nbat\nmat\nmoat' | sort -f | uniq -iD
bat
bat
hat
HAT
Partial match
uniq
has three options to change the matching criteria to partial parts of the input line. These aren't as powerful as the sort -k
option, but they do come in handy for some use cases.
The -f
option allows you to skip the first N
fields. Field separation is based on one or more space/tab characters only. Note that these separators will still be part of the field contents, so this will not work with variable number of blanks.
# skip the first field, works as expected since the no. of blanks is consistent
$ printf '2 cars\n5 cars\n10 jeeps\n5 jeeps\n3 trucks\n' | uniq -f1 --group
2 cars
5 cars
10 jeeps
5 jeeps
3 trucks
# example with variable number of blanks
# 'cars' entries were identified as duplicates, but not 'jeeps'
$ printf '2 cars\n5 cars\n1 jeeps\n5 jeeps\n3 trucks\n' | uniq -f1
2 cars
1 jeeps
5 jeeps
3 trucks
The -s
option allows you to skip the first N
characters (calculated as bytes).
# skip the first character
$ printf '* red\n- green\n* green\n* blue\n= blue' | uniq -s1
* red
- green
* blue
The -w
option restricts the comparison to the first N
characters (calculated as bytes).
# compare only the first 2 characters
$ printf '1) apple\n1) almond\n2) banana\n3) cherry' | uniq -w2
1) apple
2) banana
3) cherry
When these options are used simultaneously, the priority is -f
first, then -s
and finally the -w
option. Remember that blanks are part of the field content.
# skip the first field
# then skip the first two characters (including the blank character)
# use the next two characters for comparison ('bl' and 'ch' in this example)
$ printf '2 @blue\n10 :black\n5 :cherry\n3 @chalk' | uniq -f1 -s2 -w2
2 @blue
5 :cherry
If a line doesn't have enough fields or characters to satisfy the
-f
and-s
options respectively, a null string is used for comparison.
Specifying output file
uniq
can accept filename as the source of input contents, but only a maximum of one file. If you specify another file, it will be used as the output file.
$ printf 'apple\napple\nbanana\ncherry\ncherry\ncherry' > ip.txt
$ uniq ip.txt op.txt
$ cat op.txt
apple
banana
cherry
NUL separator
Use the -z
option if you want to use NUL character as the line separator. In this scenario, uniq
will ensure to add a final NUL character even if not present in the input.
$ printf 'cherry\0cherry\0cherry\0apple\0banana' | uniq -z | cat -v
cherry^@apple^@banana^@
If grouping is specified, NUL will be used as the separator instead of the newline character.
Alternatives
Here are some alternate commands you can explore if uniq
isn't enough to solve your task.
- Dealing with duplicates chapter from my GNU awk ebook
- Dealing with duplicates chapter from my Perl one-liners ebook
- huniq — remove duplicates from entire input contents, input order is maintained, supports count option as well
Exercises
The exercises directory has all the files used in this section.
1) Will uniq
throw an error if the input is not sorted? What do you think will be the output for the following input?
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq
2) Are there differences between sort -u file
and sort file | uniq
?
3) What are the differences between sort -u
and uniq -u
options, if any?
4) Filter the third column items from duplicates.csv
. Construct three solutions to display only unique items, duplicate items and all duplicates.
$ cat duplicates.csv
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
# unique
##### add your solution here
flower
water
# duplicates
##### add your solution here
bread
rose
# all duplicates
##### add your solution here
bread
bread
rose
rose
rose
5) What does the --group
option do? What customization features are available?
6) Count the number of times input lines are repeated and display the results in the format shown below.
$ s='brown\nbrown\nbrown\ngreen\nbrown\nblue\nblue'
$ printf '%b' "$s" | ##### add your solution here
1 green
2 blue
4 brown
7) For the input file f1.txt
, retain only unique entries based on the first two characters of each line. For example, abcd
and ab12
should be considered as duplicates and neither of them will be part of the output.
$ cat f1.txt
3) cherry
1) apple
2) banana
1) almond
4) mango
2) berry
3) chocolate
1) apple
5) cherry
##### add your solution here
4) mango
5) cherry
8) For the input file f1.txt
, display only the duplicate items without considering the first two characters of each line. For example, abcd
and 12cd
should be considered as duplicates. Assume that the third character of each line is always a space character.
##### add your solution here
1) apple
3) cherry
9) What does the -s
option do?
10) Filter only unique lines, but ignore differences due to case.
$ printf 'cat\nbat\nCAT\nCar\nBat\nmat\nMat' | ##### add your solution here
Car