CLI computation with GNU datamash
I'm hoping this post will serve as a quick reference for some of the use cases and tickle your curiosity if you haven't come across this nifty CLI text processing tool yet. There are also links for further reading at the end.
Installation and Documentation🔗
See download page for source code and instructions to install the software on various platforms. This blog post is based on the 1.8 version.
See datamash manual for links to documentation in HTML, plain text, PDF, etc.
Sum🔗
# file with a single number per line
$ cat nums.txt
42
-2
10101
-3.14
-75
$ datamash sum 1 <nums.txt
10062.86
$ echo '3.14 42 1000 -51' | tr ' ' '\n' | datamash sum 1
994.14
# summing a particular column
# tab is the default field separator
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
$ datamash -t' ' sum 5 <table.txt
38.14
Other such operations include count
, min
, max
, mean
, median
, sstdev
(standard deviation), etc.
Transpose🔗
$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80
Er,60,70,90
$ datamash -t, transpose <scores.csv
Name,Ith,Cy,Lin,Er
Maths,100,97,78,60
Physics,100,98,83,70
Chemistry,100,95,80,90
Group by🔗
You can use the -g
option to group items based on one or more columns. You can specify an operation such as collapse
, sum
, mean
, count
and so on. See Grouping rows by categories avoiding repetition for an example with unique
.
# here, the first column items are already next to each other
# so, sorting is not needed
$ cat toys.txt
car blue
car red
car yellow
truck brown
bus green
bus maroon
rocket white
# by default a comma is used as the separator between collapsed items
# use 'unique' instead of 'collapse' to avoid duplicates
$ datamash -t' ' -g1 collapse 2 <toys.txt
car blue,red,yellow
truck brown
bus green,maroon
rocket white
# 'count' gives the number of items for the collapsed row
# 'rand' selects a random item for such collapsed rows
# 'first' and 'last' are other choices available
$ datamash -t' ' -g1 count 2 rand 2 <toys.txt
car 3 red
truck 1 brown
bus 2 green
rocket 1 white
Here's an example with header lines as well as having to sort the input (-s
). The -c
option helps to customize the separator for the grouped items. The -H
option is equivalent to using both --header-in
and --header-out
.
$ cat books.txt
Author,Title
Will Wight,Cradle
John Bierce,Mage Errant
Brandon Sanderson,Mistborn
Domagoj Kurmaic,Mother of Learning
Brandon Sanderson,The Stormlight Archive
Will Wight,The Last Horizon
Brandon Sanderson,Warbreaker
# not sure if there's an option to retain the original header line as is
# you can instead use: (sed -u 1q; datamash -st, -c: -g1 collapse 2) <books.txt
# use --header-in if you don't want the header line in the output
$ datamash -H -st, -c: -g1 collapse 2 <books.txt
GroupBy(Author),collapse(Title)
Brandon Sanderson,Mistborn:The Stormlight Archive:Warbreaker
Domagoj Kurmaic,Mother of Learning
John Bierce,Mage Errant
Will Wight,Cradle:The Last Horizon
Here's an example of summing values based on column 3 items:
$ cat duplicates.csv
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
$ datamash -st, -g3 sum 4 <duplicates.csv
bread,153
flower,333
rose,999
water,333
Average marks:
$ cat result.csv
Amy,maths,90
Amy,physics,75
Joe,maths,79
John,chemistry,77
John,physics,91
Moe,maths,81
Ravi,physics,84
Ravi,chemistry,70
Yui,maths,92
$ datamash -t, -g1 mean 3 <result.csv
Amy,82.5
Joe,79
John,84
Moe,81
Ravi,77
Yui,92
Further Reading🔗
- Alternative one-liners — examples with datamash compared to
awk
,perl
, etc - Documentation examples
- xsv — fast CSV command line toolkit