Sorting Stuff
In this chapter, you'll learn how to sort input based on various criteria. And then, you'll learn about tools that typically require sorted input for performing operations like finding unique entries, comparing two files line wise and so on.
The example_files directory has the sample input files used in this chapter.
sort
As the name implies, this command is used to sort the contents of input files. Alphabetic sort and numeric sort? Possible. How about sorting a particular column? Possible. Prioritized multiple sorting order? Possible. Randomize? Unique? Lots of features supported by this powerful command.
Common options
Commonly used options are shown below. Examples will be discussed in the later sections.
-n
sort numerically-g
general numeric sort-V
version sort (aware of numbers within text)-h
sort human readable numbers (ex: 4K, 3M, 12G, etc)-k
sort via key (column sorting)-t
single byte character as the field separator (default is non-blank to blank transition)-u
sort uniquely-R
random sort-r
reverse the sort output-o
redirect sorted result to a specified filename (ex: for inplace sorting)
Default sort
By default, sort
orders the input lexicographically in ascending order. You can use the -r
option to reverse the results.
# default sort
$ printf 'banana\ncherry\napple' | sort
apple
banana
cherry
# sort and then display the results in reversed order
$ printf 'peace\nrest\nquiet' | sort -r
rest
quiet
peace
Use the
-f
option if you want to ignore case. See also coreutils FAQ: Sort does not sort in normal order!.
Numerical sort
There are several ways to deal with input containing different kind of numbers:
$ printf '20\n2\n-3\n111\n3.14' | sort -n
-3
2
3.14
20
111
# sorting human readable numbers
$ sort -hr file_size.txt
1.4G games
316M projects
746K report.log
104K power.log
20K sample.txt
# version sort
$ sort -V timings.txt
3m20.058s
3m42.833s
4m3.083s
4m11.130s
5m35.363s
Unique sort
The -u
option will keep only the first copy of lines that are deemed to be equal.
# -f option ignores case differences
$ printf 'CAT\nbat\ncat\ncar\nbat\n' | sort -fu
bat
car
CAT
Column sort
The -k
option allows you to sort based on specific columns instead of the entire input line. By default, the empty string between non-blank and blank characters is considered as the separator. This option accepts arguments in various ways. You can specify starting and ending column numbers separated by a comma. If you specify only the starting column, the last column will be used as the ending column. Usually you just want to sort by a single column, in which case the same number is specified as both the starting and ending columns. Here's an example:
$ cat shopping.txt
apple 50
toys 5
Pizza 2
mango 25
Banana 10
# sort based on the 2nd column numbers
$ sort -k2,2n shopping.txt
Pizza 2
toys 5
Banana 10
mango 25
apple 50
You can use the
-t
option to specify a single byte character as the field separator. Use\0
to specify ASCII NUL as the separator.
Use the
-s
option to retain the original order of input lines when two or more lines are deemed equal. You can still use multiple keys to specify your own tie breakers,-s
only prevents the last resort comparison.
uniq
This command helps you to identify and remove duplicates. Usually used with sorted inputs as the comparison is made between adjacent lines only.
Common options
Commonly used options are shown below. Examples will be discussed in the later sections.
-u
display only the unique entries-d
display only the duplicate entries-D
display all the copies of duplicates-c
prefix count-i
ignore case while determining duplicates-f
skip the firstN
fields (separator is space/tab characters)-s
skip the firstN
characters-w
restricts the comparison to the firstN
characters
Default uniq
By default, uniq
retains only one copy of duplicate lines:
# same as sort -u for this case
$ printf 'brown\nbrown\nbrown\ngreen\nbrown\nblue\nblue' | sort | uniq
blue
brown
green
# can't use sort -n -u here
$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -n | uniq
2 balls
2 pins
13 pens
Unique and duplicate entries
The -u
option will display only the unique entries. That is, only if a line doesn't occur more than once.
$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea
$ sort purchases.txt | uniq -u
soap
toothpaste
washing powder
The -d
option will display only the duplicate entries. That is, only if a line is seen more than once. To display all the copies of duplicates, use the -D
option.
$ sort purchases.txt | uniq -d
coffee
tea
$ sort purchases.txt | uniq -D
coffee
coffee
tea
tea
tea
Prefix count
If you want to know how many times a line has been repeated, use the -c
option. This will be added as a prefix.
$ sort purchases.txt | uniq -c
2 coffee
1 soap
3 tea
1 toothpaste
1 washing powder
$ sort purchases.txt | uniq -dc
2 coffee
3 tea
# sorting by number of occurrences
$ sort purchases.txt | uniq -c | sort -nr
3 tea
2 coffee
1 washing powder
1 toothpaste
1 soap
Partial match
uniq
has three options to change the matching criteria to partial parts of the input line. These aren't as powerful as the sort -k
option, but they do come in handy for some use cases.
# compare only the first 2 characters
$ printf '1) apple\n1) almond\n2) banana\n3) cherry\n3) cup' | uniq -w2
1) apple
2) banana
3) cherry
# -f1 skips the first field
# -s2 then skips two characters (including the blank character)
# -w2 uses the next two characters for comparison ('bl' and 'ch' in this example)
$ printf '2 @blue\n10 :black\n5 :cherry\n3 @chalk' | uniq -f1 -s2 -w2
2 @blue
5 :cherry
comm
The comm
command finds common and unique lines between two sorted files. By default, you'll get a tabular output with three columns:
- first column has lines unique to the first file
- second column has lines unique to the second file
- third column has lines common to both the files
# side by side view of already sorted sample files
$ paste c1.txt c2.txt
Blue Black
Brown Blue
Orange Green
Purple Orange
Red Pink
Teal Red
White White
# default three column output
$ comm c1.txt c2.txt
Black
Blue
Brown
Green
Orange
Pink
Purple
Red
Teal
White
You can use one or more of the following options to suppress columns:
-1
to suppress lines unique to the first file-2
to suppress lines unique to the second file-3
to suppress lines common to both the files
# only the common lines
$ comm -12 c1.txt c2.txt
Blue
Orange
Red
White
# lines unique to the second file
$ comm -13 c1.txt c2.txt
Black
Green
Pink
join
By default, the join
command combines two files based on the first field content (also referred as key). Only the lines with common keys will be part of the output.
The key field will be displayed first in the output (this distinction will come into play if the first field isn't the key). Rest of the line will have the remaining fields from the first and second files, in that order. One or more blanks (space or tab) will be considered as the input field separator and a single space will be used as the output field separator. If present, blank characters at the start of the input lines will be ignored.
# sample sorted input files
$ cat shopping_jan.txt
apple 10
banana 20
soap 3
tshirt 3
$ cat shopping_feb.txt
banana 15
fig 100
pen 2
soap 1
# combine common lines based on the first field
$ join shopping_jan.txt shopping_feb.txt
banana 20 15
soap 3 1
Note that the collating order used for
join
should be same as the one used tosort
the input files. Usejoin -i
to ignore case, similar tosort -f
usage.
If a field value is present multiple times in the same input file, all possible combinations will be present in the output. As shown below, join
will also ensure to add a final newline character even if not present in the input.
$ join <(printf 'a f1_x\na f1_y') <(printf 'a f2_x\na f2_y')
a f1_x f2_x
a f1_x f2_y
a f1_y f2_x
a f1_y f2_y
There are many more features such as specifying field delimiter, selecting specific fields from each input file in a particular order, filling fields for non-matching lines and so on. See the join chapter from my CLI text processing with GNU Coreutils ebook for explanations and examples.
Exercises
Use the example_files/text_files directory for input files used in the following exercises.
1) Default sort
doesn't work for numbers. Correct the command used below:
# wrong output
$ printf '100\n10\n20\n3000\n2.45\n' | sort
10
100
20
2.45
3000
# expected output
$ printf '100\n10\n20\n3000\n2.45\n' | sort # ???
2.45
10
20
100
3000
2) Which sort
option will help you ignore case?
$ printf 'Super\nover\nRUNE\ntea\n' | LC_ALL=C sort # ???
over
RUNE
Super
tea
3) Go through the sort
manual and use appropriate options to get the output shown below.
# wrong output
$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort -n
-1.53
+120
3.14e+4
42.1e-2
# expected output
$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort # ???
-1.53
42.1e-2
+120
3.14e+4
4) Sort the scores.csv
file numerically in ascending order using the contents of the second field. Header line should be preserved as the first line as shown below. Hint: see the Shell Features chapter.
# ???
Name,Maths,Physics,Chemistry
Lin,78,83,80
Cy,97,98,95
Ith,100,100,100
5) Sort the contents of duplicates.txt
by the fourth column numbers in descending order. Retain only the first copy of lines with the same number.
# ???
dark red,sky,rose,555
blue,ruby,water,333
dark red,ruby,rose,111
brown,toy,bread,42
6) Will uniq
throw an error if the input is not sorted? What do you think will be the output for the following input?
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq
# ???
7) Retain only the unique entries based on the first two characters of the input lines. Sort the input if necessary.
$ printf '3) cherry\n1) apple\n2) banana\n1) almond\n'
3) cherry
1) apple
2) banana
1) almond
$ printf '3) cherry\n1) apple\n2) banana\n1) almond\n' | # ???
2) banana
3) cherry
8) Count the number of times input lines are repeated and display the results in the format shown below.
$ printf 'brown\nbrown\nbrown\ngreen\nbrown\nblue\nblue' | # ???
1 green
2 blue
4 brown
9) Display lines present in c1.txt
but not in c2.txt
using the comm
command. Assume that the input files are already sorted.
# ???
Brown
Purple
Teal
10) Use appropriate options to get the expected output shown below.
# wrong usage, no output
$ join <(printf 'apple 2\nfig 5') <(printf 'Fig 10\nmango 4')
# expected output
# ???
fig 5 10
11) What are the differences between sort -u
and uniq -u
options, if any?