sort
The sort command provides a wide variety of features. In addition to lexicographic ordering, it supports various numerical formats. You can also sort based on particular columns. And there are nifty features like merging already sorted input, debugging, determining whether the input is already sorted and so on.
Default sort and Collating order
By default, sort orders the input in ascending order. If you know about ASCII codepoints, do you agree that the following two examples are showing the correct expected output?
$ cat greeting.txt
Hi there
Have a nice day
# extract and sort space separated words
$ <greeting.txt tr ' ' '\n' | sort
a
day
Have
Hi
nice
there
$ printf '(banana)\n{cherry}\n[apple]' | sort
[apple]
(banana)
{cherry}
From the sort manual:
Unless otherwise specified, all comparisons use the character collating sequence specified by the LC_COLLATE locale.
If you use a non-POSIX locale (e.g., by setting
LC_ALLtoen_US), then sort may produce output that is sorted differently than you're accustomed to. In that case, set theLC_ALLenvironment variable toC. Note that setting onlyLC_COLLATEhas two problems. First, it is ineffective ifLC_ALLis also set. Second, it has undefined behavior ifLC_CTYPE(orLANG, ifLC_CTYPEis unset) is set to an incompatible value. For example, you get undefined behavior ifLC_CTYPEisja_JP.PCKbutLC_COLLATEisen_US.UTF-8.
My locale settings are based on en_IN, which is different from the POSIX sorting order. So, the fact to remember is that sort obeys the rules of the current locale. If you want POSIX sorting, one option is to use LC_ALL=C as shown below.
$ <greeting.txt tr ' ' '\n' | LC_ALL=C sort
Have
Hi
a
day
nice
there
$ printf '(banana)\n{cherry}\n[apple]' | LC_ALL=C sort
(banana)
[apple]
{cherry}
Another benefit of
Clocale is that it will be significantly faster compared to Unicode parsing and sorting rules.
Use the
-foption if you want to explicitly ignore case. See also GNU Core Utilities FAQ: Sort does not sort in normal order!.
See this unix.stackexchange thread if you want to create your own custom sort order.
Ignoring headers
You can use sed -u to consume only the header lines and leave the rest of the input for the sort command. Note that this unbuffered option is supported by GNU sed, might not be available with other implementations.
$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80
# 1q is used to quit after the first line
$ ( sed -u '1q' ; sort ) <scores.csv
Name,Maths,Physics,Chemistry
Cy,97,98,95
Ith,100,100,100
Lin,78,83,80
See this unix.stackexchange thread for more ways of ignoring headers. See bash manual: Grouping Commands for more details about the
()grouping used in the above example.
Dictionary sort
The -d option will consider only alphabets, numbers and blanks for sorting. Space and tab characters are considered as blanks, but this would also depend on the locale.
$ printf '(banana)\n{cherry}\n[apple]' | LC_ALL=C sort -d
[apple]
(banana)
{cherry}
Use the
-ioption if you want to ignore only the non-printing characters.
Reversed order
The -r option will reverse the output order. Note that this doesn't change how sort performs comparisons, only the output is reversed. You'll see an example later where this distinction becomes clearer.
$ printf 'peace\nrest\nquiet' | sort -r
rest
quiet
peace
In case you haven't noticed yet,
sortadds a newline character to the final line even if it wasn't present in the input.
Numeric sort
The sort command provides various options to work with numeric formats. For most cases, the -n option is enough. Here's an example:
# lexicographic ordering isn't suited for numbers
$ printf '20\n2\n3\n111\n314' | sort
111
2
20
3
314
# -n helps in this case
$ printf '20\n2\n3\n111\n314' | sort -n
2
3
20
111
314
The -n option can handle negative and floating-point numbers as well. The decimal point and the thousands separator characters will depend on the locale settings.
$ cat mixed_numbers.txt
12,345
42
31.24
-100
42
5678
# , is the thousands separator in en_IN
# . is the decimal point in en_IN
$ sort -n mixed_numbers.txt
-100
31.24
42
42
5678
12,345
Use the -g option if your input can have the + prefix for positive numbers or follows the E-scientific notation.
$ cat e_notation.txt
+120
-1.53
3.14e+4
42.1e-2
$ sort -g e_notation.txt
-1.53
42.1e-2
+120
3.14e+4
Unless otherwise specified,
sortwill break ties by using the entire input line content. In the case of-n, sorting will work even if there are extra characters after the number. Those extra characters will affect the output order if the numbers are equal. If a line doesn't start with a number (excluding blanks), it will be treated as0.# 'b' comes before 'p' $ printf '2 pins\n13 pens\n2 balls' | sort -n 2 balls 2 pins 13 pens # 'z' and 'a2p' will be treated as '0' # 'a' comes before 'z' $ printf 'z\na2p\n13p\n2b\n-1\n 10' | sort -n -1 a2p z 2b 10 13p
Human numeric sort
Commands like du (disk usage) have the -h and --si options to display numbers with SI suffixes like k, K, M, G and so on. In such cases, you can use sort -h to order them.
$ cat file_size.txt
104K    power.log
316M    projects
746K    report.log
20K     sample.txt
1.4G    games
$ sort -hr file_size.txt
1.4G    games
316M    projects
746K    report.log
104K    power.log
20K     sample.txt
Version sort
The -V option is useful when you have a mix of alphabets and digits. It also helps when you want to treat digits after a decimal point as whole numbers, for example 1.10 should be greater than 1.2.
$ printf '1.10\n1.2' | sort -n
1.10
1.2
$ printf '1.10\n1.2' | sort -V
1.2
1.10
$ cat versions.txt
file2
cmd5.2
file10
cmd1.6
file5
cmd5.10
$ sort -V versions.txt
cmd1.6
cmd5.2
cmd5.10
file2
file5
file10
Here's an example of dealing with numbers reported by the time command (assuming all the entries have the same format).
$ cat timings.txt
5m35.363s
3m20.058s
4m11.130s
3m42.833s
4m3.083s
$ sort -V timings.txt
3m20.058s
3m42.833s
4m3.083s
4m11.130s
5m35.363s
See Version sort ordering for more details. Note that the
lscommand uses lowercase-vfor this task.
Random sort
The -R option will display the output in random order. Unlike shuf, this option will always place identical lines next to each other due to the implementation.
# the two lines with '42' will always be next to each other
# use 'shuf' if you don't want this behavior
$ sort -R mixed_numbers.txt
31.24
5678
42
42
12,345
-100
Unique sort
The -u option will keep only the first copy of lines that are deemed equal.
# (10) and [10] are deemed equal with dictionary sorting
$ printf '(10)\n[20]\n[10]' | sort -du
(10)
[20]
$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea
$ sort -u purchases.txt
coffee
soap
tea
toothpaste
washing powder
As seen earlier, the -n option will work even if there are extra characters after the number. When the -u option is also used, only the first such copy will be retained. Use the uniq command if you want to remove duplicates based on the whole line.
$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -nu
2 balls
13 pens
# note that only the output order is reversed
# use tac if you want the last duplicate to be preserved instead of the first
$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -r -nu
13 pens
2 balls
# use uniq when the entire line contents should be compared
$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -n | uniq
2 balls
2 pins
13 pens
You can use the -f option to ignore case while determining duplicates.
$ printf 'mat\nbat\nMAT\ncar\nbat\n' | sort -u
bat
car
mat
MAT
# the first copy between 'mat' and 'MAT' is retained
$ printf 'mat\nbat\nMAT\ncar\nbat\n' | sort -fu
bat
car
mat
Column sort
The -k option allows you to sort based on specific columns instead of the entire input line. By default, the empty string between non-blank and blank characters is considered as the separator and thus the blanks are also part of the field contents. The effect of blanks and mitigation will be discussed later.
The -k option accepts arguments in various ways. You can specify the starting and ending column numbers separated by a comma. If you specify only the starting column, the last column will be used as the ending column. Usually you just want to sort by a single column, in which case the same number is specified as both the starting and ending columns. Here's an example:
$ cat shopping.txt
apple   50
toys    5
Pizza   2
mango   25
Banana  10
# sort based on the 2nd column numbers
$ sort -k2,2n shopping.txt
Pizza   2
toys    5
Banana  10
mango   25
apple   50
Note that in the above example, the
-noption was also appended to the-koption. This makes it specific to that column and overrides global options, if any. Also, remember that the entire line will be used to break ties, unless otherwise specified.
You can use the -t option to specify a single byte character as the field separator. Use \0 to specify NUL as the separator. Depending on your shell you can use ANSI-C quoting to use escapes like \t instead of a literal tab character. When the -t option is used, the field separator won't be part of the field contents.
# department,name,marks
$ cat marks.csv
ECE,Raj,53
ECE,Joel,72
EEE,Moi,68
CSE,Surya,81
EEE,Raj,88
CSE,Moi,62
EEE,Tia,72
ECE,Om,92
CSE,Amy,67
# name column is the primary sort key
# entire line content will be used for breaking ties
$ sort -t, -k2,2 marks.csv
CSE,Amy,67
ECE,Joel,72
CSE,Moi,62
EEE,Moi,68
ECE,Om,92
ECE,Raj,53
EEE,Raj,88
CSE,Surya,81
EEE,Tia,72
You can use the -k option multiple times to specify your own order of tie breakers. Entire line will still be used to break ties if needed.
# second column is the primary key
# reversed numeric sort on the third column is the secondary key
# entire line will be used only if there are still tied entries
$ sort -t, -k2,2 -k3,3nr marks.csv
CSE,Amy,67
ECE,Joel,72
EEE,Moi,68
CSE,Moi,62
ECE,Om,92
EEE,Raj,88
ECE,Raj,53
CSE,Surya,81
EEE,Tia,72
# sort by month first and then the day
# -M option sorts based on abbreviated month names
$ printf 'Aug-20\nMay-5\nAug-3' | sort -t- -k1,1M -k2,2n
May-5
Aug-3
Aug-20
Use the -s option to retain the original order of input lines when two or more lines are deemed equal. You can still use multiple keys to specify your own tie breakers, -s only prevents the last resort comparison.
# -s prevents last resort comparison
# so, lines having the same value in the 2nd column will retain input order
$ sort -t, -s -k2,2 marks.csv
CSE,Amy,67
ECE,Joel,72
EEE,Moi,68
CSE,Moi,62
ECE,Om,92
ECE,Raj,53
EEE,Raj,88
CSE,Surya,81
EEE,Tia,72
The -u option, as discussed earlier, will retain only the first copy of lines that are deemed equal.
# only the first copy of duplicates in the 2nd column will be retained
$ sort -t, -u -k2,2 marks.csv
CSE,Amy,67
ECE,Joel,72
EEE,Moi,68
ECE,Om,92
ECE,Raj,53
CSE,Surya,81
EEE,Tia,72
Character positions within columns
The -k option also accepts starting and ending character positions within the columns. These are specified after the column number, separated by a . character. If the character position is not specified for the ending column, the last character of that column is assumed.
The character positions start with 1 for the first character. Recall that when the -t option is used, the field separator is not part of the field contents.
# based on the second column number
# 2.2 helps to ignore first character, otherwise -n won't have any effect here
$ printf 'car,(20)\njeep,[10]\ntruck,(5)\nbus,[3]' | sort -t, -k2.2,2n
bus,[3]
truck,(5)
jeep,[10]
car,(20)
# first character of the second column is the primary key
# entire line acts as the last resort tie breaker
$ printf 'car,(20)\njeep,[10]\ntruck,(5)\nbus,[3]' | sort -t, -k2.1,2.1
car,(20)
truck,(5)
bus,[3]
jeep,[10]
The default separation based on blank characters works differently. The empty string between non-blank and blank characters is considered as the separator and thus the blanks are also part of the field contents. You can use the -b option to ignore such leading blanks of field contents.
# the second column here starts with blank characters
# adjusting the character position isn't feasible due to varying blanks
$ printf 'car   (20)\njeep  [10]\ntruck (5)\nbus   [3]' | sort -k2.2,2n
bus   [3]
car   (20)
jeep  [10]
truck (5)
# use -b in such cases to ignore the leading blanks
$ printf 'car   (20)\njeep  [10]\ntruck (5)\nbus   [3]' | sort -k2.2b,2n
bus   [3]
truck (5)
jeep  [10]
car   (20)
Debugging
The --debug option can help you identify issues if the output isn't what you expected. Here's the previously seen -b example, now with --debug enabled. The underscores in the debug output shows which portions of the input are used as primary key, secondary key and so on. The collating order being used is also shown in the output.
$ printf 'car (20)\njeep [10]\ntruck (5)\nbus [3]' | sort -k2.2,2n --debug
sort: text ordering performed using ‘en_IN’ sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
sort: note numbers use ‘.’ as a decimal point in this locale
bus [3]
    ^ no match for key
_______
car (20)
    ^ no match for key
________
jeep [10]
     ^ no match for key
_________
truck (5)
      ^ no match for key
_________
$ printf 'car (20)\njeep [10]\ntruck (5)\nbus [3]' | sort -k2.2b,2n --debug
sort: text ordering performed using ‘en_IN’ sorting rules
sort: note numbers use ‘.’ as a decimal point in this locale
bus [3]
     _
_______
truck (5)
       _
_________
jeep [10]
      __
_________
car (20)
     __
________
Check if sorted
The -c option helps you spot the first unsorted entry in the given input. The uppercase -C option is similar but only affects the exit status. Note that these options will not work for multiple inputs.
$ cat shopping.txt
apple   50
toys    5
Pizza   2
mango   25
Banana  10
$ sort -c shopping.txt
sort: shopping.txt:3: disorder: Pizza   2
$ echo $?
1
$ sort -C shopping.txt
$ echo $?
1
Specifying output file
The -o option can be used to specify the output file to be used for saving the results.
$ sort -R nums.txt -o rand_nums.txt
$ cat rand_nums.txt
1000
3.14
42
You can use -o for in-place editing as well, but the documentation gives this warning:
However, it is often safer to output to an otherwise-unused file, as data may be lost if the system crashes or sort encounters an I/O or other serious error while a file is being sorted in place. Also, sort with
--merge(-m) can open the output file before reading all input, so a command likecat F | sort -m -o F - Gis not safe as sort might start writingFbeforecatis done reading it.
Merge sort
The -m option is useful if you have one or more sorted input files and need a single sorted output file. Typically the use case is that you want to add newly obtained data to existing sorted data. In such cases, you can sort only the new data separately and then combine all the sorted inputs using the -m option. Here's a sample timing comparison between different combinations of sorted/unsorted inputs.
$ shuf -n1000000 -i1-999999999999 > n1.txt
$ shuf -n1000000 -i1-999999999999 > n2.txt
$ sort -n n1.txt > n1_sorted.txt
$ sort -n n2.txt > n2_sorted.txt
$ time sort -n n1.txt n2.txt > op1.txt
real    0m1.010s
$ time sort -mn n1_sorted.txt <(sort -n n2.txt) > op2.txt
real    0m0.535s
$ time sort -mn n1_sorted.txt n2_sorted.txt > op3.txt
real    0m0.218s
$ diff -sq op1.txt op2.txt
Files op1.txt and op2.txt are identical
$ diff -sq op1.txt op3.txt
Files op1.txt and op3.txt are identical
$ rm n{1,2}{,_sorted}.txt op{1..3}.txt
You might wonder if you can improve the performance of a single large file using the
-moption. By default,sortalready uses the available processors to split the input and merge. You can use the--paralleloption to customize this behavior.
NUL separator
Use the -z option if you want to use NUL character as the line separator. In this scenario, sort will ensure to add a final NUL character even if not present in the input.
$ printf 'cherry\0apple\0banana' | sort -z | cat -v
apple^@banana^@cherry^@
Further Reading
A few options like --compress-program and --files0-from aren't covered in this book. See the sort manual for details and examples. See also:
- unix.stackexchange: Scalability of sort for gigantic files
- stackoverflow: Sort by last field when the number of fields varies
- Arch wiki: locale
- ShellHacks: locale and language settings
Exercises
The exercises directory has all the files used in this section.
1) Default sort doesn't work for numbers. Which option would you use to get the expected output shown below?
$ printf '100\n10\n20\n3000\n2.45\n' | sort ##### add your solution here
2.45
10
20
100
3000
2) Which sort option will help you ignore case? LC_ALL=C is used here to avoid differences due to locale.
$ printf 'Super\nover\nRUNE\ntea\n' | LC_ALL=C sort ##### add your solution here
over
RUNE
Super
tea
3) The -n option doesn't work for all sorts of numbers. Which sort option would you use to get the expected output shown below?
# wrong output
$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort -n
-1.53
+120
3.14e+4
42.1e-2
# expected output
$ printf '+120\n-1.53\n3.14e+4\n42.1e-2' | sort ##### add your solution here
-1.53
42.1e-2
+120
3.14e+4
4) What do the -V and -h options do?
5) Is there a difference between shuf and sort -R?
6) Sort the scores.csv file numerically in ascending order using the contents of the second field. Header line should be preserved as the first line as shown below.
$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80
##### add your solution here
Name,Maths,Physics,Chemistry
Lin,78,83,80
Cy,97,98,95
Ith,100,100,100
7) Sort the contents of duplicates.csv by the fourth column numbers in descending order. Retain only the first copy of lines with the same number.
$ cat duplicates.csv
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
##### add your solution here
dark red,sky,rose,555
blue,ruby,water,333
dark red,ruby,rose,111
brown,toy,bread,42
8) Sort the contents of duplicates.csv by the third column item. Use the fourth column numbers as the tie-breaker.
##### add your solution here
brown,toy,bread,42
white,sky,bread,111
yellow,toy,flower,333
dark red,ruby,rose,111
light red,purse,rose,333
dark red,sky,rose,555
blue,ruby,water,333
9) What does the -s option provide?
10) Sort the given input based on the numbers inside the brackets.
$ printf '(-3.14)\n[45]\n(12.5)\n{14093}' | ##### add your solution here
(-3.14)
(12.5)
[45]
{14093}
11) What do the -c, -C and -m options do?