Sorting Stuff

In this chapter, you'll learn how to sort input based on various criteria. And then, you'll learn about tools that typically require sorted input for performing operations like finding unique entries, comparing two files line wise and so on.

The example_files directory has the sample input files used in this chapter.

sort

As the name implies, this command is used to sort the contents of input files. Alphabetic sort and numeric sort? Possible. How about sorting a particular column? Possible. Prioritized multiple sorting order? Possible. Randomize? Unique? Lots of features supported by this powerful command.

Common options

Commonly used options are shown below. Examples will be discussed in the later sections.

-n sort numerically
-g general numeric sort
-V version sort (aware of numbers within text)
-h sort human readable numbers (ex: 4K, 3M, 12G, etc)
-k sort via key (column sorting)
-t single byte character as the field separator (default is non-blank to blank transition)
-u sort uniquely
-R random sort
-r reverse the sort output
-o redirect sorted result to a specified filename (ex: for inplace sorting)

Default sort

By default, sort orders the input lexicographically in ascending order. You can use the -r option to reverse the results.

# default sort
$ printf 'banana\ncherry\napple' | sort
apple
banana
cherry

# sort and then display the results in reversed order
$ printf 'peace\nrest\nquiet' | sort -r
rest
quiet
peace

Use the -f option if you want to ignore case. See also coreutils FAQ: Sort does not sort in normal order!.

Numerical sort

There are several ways to deal with input containing different kind of numbers:

$ printf '20\n2\n-3\n111\n3.14' | sort -n
-3
2
3.14
20
111

# sorting human readable numbers
$ sort -hr file_size.txt
1.4G    games
316M    projects
746K    report.log
104K    power.log
20K     sample.txt

# version sort
$ sort -V timings.txt
3m20.058s
3m42.833s
4m3.083s
4m11.130s
5m35.363s

Unique sort

The -u option will keep only the first copy of lines that are deemed to be equal.

# -f option ignores case differences
$ printf 'CAT\nbat\ncat\ncar\nbat\n' | sort -fu
bat
car
CAT

Column sort

The -k option allows you to sort based on specific columns instead of the entire input line. By default, the empty string between non-blank and blank characters is considered as the separator. This option accepts arguments in various ways. You can specify starting and ending column numbers separated by a comma. If you specify only the starting column, the last column will be used as the ending column. Usually you just want to sort by a single column, in which case the same number is specified as both the starting and ending columns. Here's an example:

$ cat shopping.txt
apple   50
toys    5
Pizza   2
mango   25
Banana  10

# sort based on the 2nd column numbers
$ sort -k2,2n shopping.txt
Pizza   2
toys    5
Banana  10
mango   25
apple   50

You can use the -t option to specify a single byte character as the field separator. Use \0 to specify ASCII NUL as the separator.

Use the -s option to retain the original order of input lines when two or more lines are deemed equal. You can still use multiple keys to specify your own tie breakers, -s only prevents the last resort comparison.

uniq

This command helps you to identify and remove duplicates. Usually used with sorted inputs as the comparison is made between adjacent lines only.

Common options

Commonly used options are shown below. Examples will be discussed in the later sections.

-u display only the unique entries
-d display only the duplicate entries
-D display all the copies of duplicates
-c prefix count
-i ignore case while determining duplicates
-f skip the first N fields (separator is space/tab characters)
-s skip the first N characters
-w restricts the comparison to the first N characters

Default uniq

By default, uniq retains only one copy of duplicate lines:

# same as sort -u for this case
$ printf 'brown\nbrown\nbrown\ngreen\nbrown\nblue\nblue' | sort | uniq
blue
brown
green

# can't use sort -n -u here
$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -n | uniq
2 balls
2 pins
13 pens

Unique and duplicate entries

The -u option will display only the unique entries. That is, only if a line doesn't occur more than once.

$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea

$ sort purchases.txt | uniq -u
soap
toothpaste
washing powder

The -d option will display only the duplicate entries. That is, only if a line is seen more than once. To display all the copies of duplicates, use the -D option.

$ sort purchases.txt | uniq -d
coffee
tea

$ sort purchases.txt | uniq -D
coffee
coffee
tea
tea
tea

Prefix count

If you want to know how many times a line has been repeated, use the -c option. This will be added as a prefix.

$ sort purchases.txt | uniq -c
      2 coffee
      1 soap
      3 tea
      1 toothpaste
      1 washing powder

$ sort purchases.txt | uniq -dc
      2 coffee
      3 tea

# sorting by number of occurrences
$ sort purchases.txt | uniq -c | sort -nr
      3 tea
      2 coffee
      1 washing powder
      1 toothpaste
      1 soap

Partial match

uniq has three options to change the matching criteria to partial parts of the input line. These aren't as powerful as the sort -k option, but they do come in handy for some use cases.

# compare only the first 2 characters
$ printf '1) apple\n1) almond\n2) banana\n3) cherry\n3) cup' | uniq -w2
1) apple
2) banana
3) cherry

# -f1 skips the first field
# -s2 then skips two characters (including the blank character)
# -w2 uses the next two characters for comparison ('bl' and 'ch' in this example)
$ printf '2 @blue\n10 :black\n5 :cherry\n3 @chalk' | uniq -f1 -s2 -w2
2 @blue
5 :cherry

comm

The comm command finds common and unique lines between two sorted files. By default, you'll get a tabular output with three columns:

first column has lines unique to the first file
second column has lines unique to the second file
third column has lines common to both the files

# side by side view of already sorted sample files
$ paste c1.txt c2.txt
Blue    Black
Brown   Blue
Orange  Green
Purple  Orange
Red     Pink
Teal    Red
White   White

# default three column output
$ comm c1.txt c2.txt
        Black
                Blue
Brown
        Green
                Orange
        Pink
Purple
                Red
Teal
                White

You can use one or more of the following options to suppress columns:

-1 to suppress lines unique to the first file
-2 to suppress lines unique to the second file
-3 to suppress lines common to both the files

# only the common lines
$ comm -12 c1.txt c2.txt
Blue
Orange
Red
White

# lines unique to the second file
$ comm -13 c1.txt c2.txt
Black
Green
Pink

join

By default, the join command combines two files based on the first field content (also referred as key). Only the lines with common keys will be part of the output.

The key field will be displayed first in the output (this distinction will come into play if the first field isn't the key). Rest of the line will have the remaining fields from the first and second files, in that order. One or more blanks (space or tab) will be considered as the input field separator and a single space will be used as the output field separator. If present, blank characters at the start of the input lines will be ignored.

# sample sorted input files
$ cat shopping_jan.txt
apple   10
banana  20
soap    3
tshirt  3
$ cat shopping_feb.txt
banana  15
fig     100
pen     2
soap    1

# combine common lines based on the first field
$ join shopping_jan.txt shopping_feb.txt
banana 20 15
soap 3 1

Note that the collating order used for join should be same as the one used to sort the input files. Use join -i to ignore case, similar to sort -f usage.

If a field value is present multiple times in the same input file, all possible combinations will be present in the output. As shown below, join will also ensure to add a final newline character even if not present in the input.

$ join <(printf 'a f1_x\na f1_y') <(printf 'a f2_x\na f2_y')
a f1_x f2_x
a f1_x f2_y
a f1_y f2_x
a f1_y f2_y

There are many more features such as specifying field delimiter, selecting specific fields from each input file in a particular order, filling fields for non-matching lines and so on. See the join chapter from my CLI text processing with GNU Coreutils ebook for explanations and examples.

Exercises

Use the example_files/text_files directory for input files used in the following exercises.

1) Default sort doesn't work for numbers. Correct the command used below:

# wrong output
$ printf '100\n10\n20\n3000\n2.45\n' | sort
10
100
20
2.45
3000

# expected output
$ printf '100\n10\n20\n3000\n2.45\n' | sort # ???
2.45
10
20
100
3000

2) Which sort option will help you ignore case?

$ printf 'Super\nover\nRUNE\ntea\n' | LC_ALL=C sort # ???
over
RUNE
Super
tea

3) Go through the sort manual and use appropriate options to get the output shown below.

# wrong output
$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort -n
-1.53
+120
3.14e+4
42.1e-2

# expected output
$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort # ???
-1.53
42.1e-2
+120
3.14e+4

4) Sort the scores.csv file numerically in ascending order using the contents of the second field. Header line should be preserved as the first line as shown below. Hint: see the Shell Features chapter.

# ???
Name,Maths,Physics,Chemistry
Lin,78,83,80
Cy,97,98,95
Ith,100,100,100

5) Sort the contents of duplicates.txt by the fourth column numbers in descending order. Retain only the first copy of lines with the same number.

# ???
dark red,sky,rose,555
blue,ruby,water,333
dark red,ruby,rose,111
brown,toy,bread,42

6) Will uniq throw an error if the input is not sorted? What do you think will be the output for the following input?

$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq
# ???

7) Retain only the unique entries based on the first two characters of the input lines. Sort the input if necessary.

$ printf '3) cherry\n1) apple\n2) banana\n1) almond\n'
3) cherry
1) apple
2) banana
1) almond

$ printf '3) cherry\n1) apple\n2) banana\n1) almond\n' | # ???
2) banana
3) cherry

8) Count the number of times input lines are repeated and display the results in the format shown below.

$ printf 'brown\nbrown\nbrown\ngreen\nbrown\nblue\nblue' | # ???
      1 green
      2 blue
      4 brown

9) Display lines present in c1.txt but not in c2.txt using the comm command. Assume that the input files are already sorted.

# ???
Brown
Purple
Teal

10) Use appropriate options to get the expected output shown below.

# wrong usage, no output
$ join <(printf 'apple 2\nfig 5') <(printf 'Fig 10\nmango 4')

# expected output
# ???
fig 5 10

11) What are the differences between sort -u and uniq -u options, if any?

Linux Command Line Computing