uniq

The uniq command identifies similar lines that are adjacent to each other. There are various options to help you filter unique or duplicate lines, count them, group them, etc.

Retain single copy of duplicates

This is the default behavior of the uniq command. If adjacent lines are the same, only the first copy will be displayed in the output.

# only the adjacent lines are compared to determine duplicates
# which is why you get 'red' twice in the output for this input
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq
red
green
red
blue

You'll need sorted input to make sure all the input lines are considered to determine duplicates. For some cases, sort -u is enough, like the example shown below:

# same as sort -u for this case
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | sort | uniq
blue
green
red

Sometimes though, you may need to sort based on some specific criteria and then identify duplicates based on the entire line contents. Here's an example:

# can't use sort -n -u here
$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -n | uniq
2 balls
2 pins
13 pens

sort+uniq won't be suitable if you need to preserve the input order as well. You can use alternatives like awk, perl and huniq for such cases.
# retain only the first copy of duplicates, maintain input order
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | awk '!seen[$0]++'
red
green
blue

Duplicates only

The -d option will display only the duplicate entries. That is, only if a line is seen more than once.

$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea

$ sort purchases.txt | uniq -d
coffee
tea

To display all the copies of duplicates, use the -D option.

$ sort purchases.txt | uniq -D
coffee
coffee
tea
tea
tea

Unique only

The -u option will display only the unique entries. That is, only if a line doesn't occur more than once.

$ sort purchases.txt | uniq -u
soap
toothpaste
washing powder

# reminder that uniq works based on adjacent lines only
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq -u
green
red

The --group options allows you to visually separate groups of similar lines with an empty line. This option can accept four values — separate, prepend, append and both. The default is separate, which adds a newline character between the groups. prepend will add a newline before the first group as well and append will add a newline after the last group. both combines the prepend and append behavior.

$ sort purchases.txt | uniq --group
coffee
coffee

soap

tea
tea
tea

toothpaste

washing powder

The --group option cannot be used with the -c, -d, -D or -u options. The --all-repeated alias for the -D option uses none as the default grouping. You can change that to separate or prepend values.

$ sort purchases.txt | uniq --all-repeated=prepend

coffee
coffee

tea
tea
tea

Prefix count

If you want to know how many times a line has been repeated, use the -c option. This will be added as a prefix.

$ sort purchases.txt | uniq -c
      2 coffee
      1 soap
      3 tea
      1 toothpaste
      1 washing powder

$ sort purchases.txt | uniq -dc
      2 coffee
      3 tea

The output of this option is usually piped to sort for ordering the output based on the count.

$ sort purchases.txt | uniq -c | sort -n
      1 soap
      1 toothpaste
      1 washing powder
      2 coffee
      3 tea

$ sort purchases.txt | uniq -c | sort -nr
      3 tea
      2 coffee
      1 washing powder
      1 toothpaste
      1 soap

Ignoring case

Use the -i option to ignore case while determining duplicates.

# depending on your locale, sort and sort -f can give the same results
$ printf 'hat\nbat\nHAT\ncar\nbat\nmat\nmoat' | sort -f | uniq -iD
bat
bat
hat
HAT

Partial match

uniq has three options to change the matching criteria to partial parts of the input line. These aren't as powerful as the sort -k option, but they do come in handy for some use cases.

The -f option allows you to skip the first N fields. Field separation is based on one or more space/tab characters only. Note that these separators will still be part of the field contents, so this will not work with variable number of blanks.

# skip the first field, works as expected since the no. of blanks is consistent
$ printf '2 cars\n5 cars\n10 jeeps\n5 jeeps\n3 trucks\n' | uniq -f1 --group
2 cars
5 cars

10 jeeps
5 jeeps

3 trucks

# example with variable number of blanks
# 'cars' entries were identified as duplicates, but not 'jeeps'
$ printf '2 cars\n5 cars\n1 jeeps\n5  jeeps\n3 trucks\n' | uniq -f1
2 cars
1 jeeps
5  jeeps
3 trucks

The -s option allows you to skip the first N characters (calculated as bytes).

# skip the first character
$ printf '* red\n- green\n* green\n* blue\n= blue' | uniq -s1
* red
- green
* blue

The -w option restricts the comparison to the first N characters (calculated as bytes).

# compare only the first 2 characters
$ printf '1) apple\n1) almond\n2) banana\n3) cherry' | uniq -w2
1) apple
2) banana
3) cherry

When these options are used simultaneously, the priority is -f first, then -s and finally the -w option. Remember that blanks are part of the field content.

# skip the first field
# then skip the first two characters (including the blank character)
# use the next two characters for comparison ('bl' and 'ch' in this example)
$ printf '2 @blue\n10 :black\n5 :cherry\n3 @chalk' | uniq -f1 -s2 -w2
2 @blue
5 :cherry

If a line doesn't have enough fields or characters to satisfy the -f and -s options respectively, a null string is used for comparison.

Specifying output file

uniq can accept filename as the source of input contents, but only a maximum of one file. If you specify another file, it will be used as the output file.

$ printf 'apple\napple\nbanana\ncherry\ncherry\ncherry' > ip.txt
$ uniq ip.txt op.txt

$ cat op.txt
apple
banana
cherry

NUL separator

Use the -z option if you want to use NUL character as the line separator. In this scenario, uniq will ensure to add a final NUL character even if not present in the input.

$ printf 'cherry\0cherry\0cherry\0apple\0banana' | uniq -z | cat -v
cherry^@apple^@banana^@

If grouping is specified, NUL will be used as the separator instead of the newline character.

Alternatives

Here are some alternate commands you can explore if uniq isn't enough to solve your task.

Dealing with duplicates chapter from my GNU awk ebook
Dealing with duplicates chapter from my Perl one-liners ebook
huniq — remove duplicates from entire input contents, input order is maintained, supports count option as well

Exercises

The exercises directory has all the files used in this section.

1) Will uniq throw an error if the input is not sorted? What do you think will be the output for the following input?

$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq

2) Are there differences between sort -u file and sort file | uniq?

3) What are the differences between sort -u and uniq -u options, if any?

4) Filter the third column items from duplicates.csv. Construct three solutions to display only unique items, duplicate items and all duplicates.

$ cat duplicates.csv
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333

# unique
##### add your solution here
flower
water

# duplicates
##### add your solution here
bread
rose

# all duplicates
##### add your solution here
bread
bread
rose
rose
rose

5) What does the --group option do? What customization features are available?

6) Count the number of times input lines are repeated and display the results in the format shown below.

$ s='brown\nbrown\nbrown\ngreen\nbrown\nblue\nblue'
$ printf '%b' "$s" | ##### add your solution here
      1 green
      2 blue
      4 brown

7) For the input file f1.txt, retain only unique entries based on the first two characters of each line. For example, abcd and ab12 should be considered as duplicates and neither of them will be part of the output.

$ cat f1.txt
3) cherry
1) apple
2) banana
1) almond
4) mango
2) berry
3) chocolate
1) apple
5) cherry

##### add your solution here
4) mango
5) cherry

8) For the input file f1.txt, display only the duplicate items without considering the first two characters of each line. For example, abcd and 12cd should be considered as duplicates. Assume that the third character of each line is always a space character.

##### add your solution here
1) apple
3) cherry

9) What does the -s option do?

10) Filter only unique lines, but ignore differences due to case.

$ printf 'cat\nbat\nCAT\nCar\nBat\nmat\nMat' | ##### add your solution here
Car

CLI text processing with GNU Coreutils