warning warning warning This is a work-in-progress draft version.



hck

From github: hck:

hck is a shortening of hack, a rougher form of cut.

A close to drop in replacement for cut that can use a regex delimiter instead of a fixed string.

No single feature of hck on its own makes it stand out over awk, cut, xsv or other such tools. Where hck excels is making common things easy, such as reordering output fields, or splitting records on a weird delimiter. It is meant to be simple and easy to use while exploring datasets.

Installation

See hck: install for installation details.

Field separators

By default, the input field separator option -d uses the regex \s+ to split the data. The default value for the output field separator -D is the tab character.

$ printf 'apple ball\t \r\v\fcat   dog' | hck -f2,4
ball    dog

# output field order is same as the order specified by -f
$ printf 'apple ball\t \r\v\fcat   dog' | hck -f3,1
cat     apple

If there are leading and trailing whitespaces, they'll result in empty fields. All the fields are printed if there is no particular selection specified.

$ printf '    fig   toy   net   ' | hck -D, -f2,1,3
fig,,toy

$ printf '    fig   toy   net   ' | hck -D,
,fig,toy,net,

Here's some examples of using custom field separators.

$ echo 'load;err_msg--\ant,r2..not' | hck -d'\W+' -D,
load,err_msg,ant,r2,not

$ echo 'Sample123string42with777numbers' | hck -d'\d+' -f1,4 -D,
Sample,numbers

$ echo 'apple:-:ball:-:cat' | hck -d:-: -f1,3 -D' : '
apple : cat

A particular field can only be displayed once in the output.

$ echo 'a,b,c,d,e' | hck -d, -f3,3,1,2,3,2,1 -D,
c,a,b

Literal field separator

Add -L option to treat the argument passed to the -d option as a fixed string instead of regex. As per the documentation, this can also result in significant speed up.

# same as: hck -d'\\' -f1
$ echo 'apple\ball' | hck -Ld'\' -f1
apple

$ echo '123)(%)*#^&(*@#.[](\\){1}\xyz' | hck -Ld')(%)*#^&(*@#.[](\\){1}\' -f2
xyz

Field ranges

Range of fields can be specified separated by a - character. You'll get an error if the range is in descending order.

$ printf '1 2 3 4 5\na b c d e\n' | hck -f1-3 -D,
1,2,3
a,b,c

# multiple ranges can be specified
# as mentioned before, a particular field can only be printed once
$ printf '1 2 3 4 5\na b c d e\n' | hck -f2-4,1,3-5 -D,
2,3,4,1,5
b,c,d,a,e

Beginning or ending field for a range can be ignored. They'll default to first and last fields respectively.

# up to first four fields
$ printf 'apple ball cat\na b c d e\n' | hck -f-4 -D,
apple,ball,cat
a,b,c,d

# all fields from the second field
$ printf 'apple ball cat\na b c d e\n' | hck -f2- -D,
ball,cat
b,c,d,e

$ printf 'apple ball cat\na b c d e\n' | hck -D,
apple,ball,cat
a,b,c,d,e

Header based field selection

You can pass a literal header name to the -F option to select a column based on its name.

$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80

# you can also use: hck -d, -FMaths scores.csv
$ hck -d, -F 'Maths' scores.csv
Maths
100
97
78

# order of -F usage determines the output order as well
$ hck -d, -D: -F 'Chemistry' -F 'Maths' scores.csv
Chemistry:Maths
100:100
95:97
80:78

You can add the -r option to select headers based on regex.

$ hck -d, -D: -rF '^[NP]' scores.csv
Name:Physics
Ith:100
Cy:98
Lin:83

If a given header selection doesn't match, you'll get an error.

$ hck -d, -F 'English' scores.csv
[2021-07-16T08:33:40Z ERROR hck] No headers matched

$ hck -d, -D: -rF '^[NP]z' -F 'at' scores.csv
[2021-07-16T08:32:38Z ERROR hck] Header not found: ^[NP]z

You can use both -f and -F options if you wish. As mentioned before, a particular field can only be printed once in the output.

$ hck -d, -F 'Name' -f3- -D: scores.csv
Name:Physics:Chemistry
Ith:100:100
Cy:98:95
Lin:83:80

Exclude fields

The -e and -E options can be used to exclude fields based on field number and header names respectively. You can continue to use -f, -F, -L and -r options as needed.

# except second field
$ printf 'apple ball cat\n1 2 3 4 5' | hck -e2 -D:
apple:cat
1:3:4:5

# except first and third fields
$ printf 'apple ball cat\n1 2 3 4 5' | hck -e1,3 -D:
ball
2:4:5

# except first and third fields, but only among the fields specified by -f
$ printf 'apple ball cat\n1 2 3 4 5' | hck -e1,3 -D: -f2-4
ball
2:4

# except fields ending with 's' character
$ hck -d, -rE 's$' -D: scores.csv
Name:Chemistry
Ith:100
Cy:95
Lin:80

Mixing -f and -F selections

You can use a mix of both -f and -F options for field selections. The -f option can be used only once and -F can be used multiple times. The field with the lower value between the -f and -F options will be displayed first in the output. Here's some examples to understand this priority better:

# -f2 comes before Chemistry (4th field)
$ hck -d, -D, -f2 -F 'Chemistry' scores.csv
Maths,Chemistry
100,100
97,95
78,80

# Name (1st field) comes before -f3
$ hck -d, -D, -f3 -F 'Name' scores.csv
Name,Physics
Ith,100
Cy,98
Lin,83

# Name (1st field) comes before -f3
# Chemistry (4th field) comes after -f3
$ hck -d, -D, -f3 -F 'Name' -F 'Chemistry' scores.csv
Name,Physics,Chemistry
Ith,100,100
Cy,98,95
Lin,83,80

-f can have multiple fields, but only the first field passed to -f is considered for the comparison. Similarly, if there are multiple -F options, only the first -F will be considered.

# Maths (2nd field) comes before -f3
$ hck -d, -D, -f3,1 -F 'Maths' scores.csv
Maths,Physics,Name
100,100,Ith
97,98,Cy
78,83,Lin

Processing compressed input

You can use the -z option to work with compressed input files. This works based on the filename extension.

$ xz scores.csv

$ hck -d, -f2 -z scores.csv.xz
Maths
100
97
78

The -z option is especially useful if you have multiple input files, and they can even be compressed differently. See hck: Decompression section for complete list of extensions supported and the command that is used to decompress.

Specifying output file

You can use the -o option to specify a file for the output instead of stdout. Don't use the same name as input, since it will result in empty output file.

$ hck -d, -f2 scores.csv -o op.txt
$ cat op.txt 
Maths
100
97
78

The -o option would become more useful when saving compressed output based on filename extension gets implemented.

DOS style line endings

If you have \r\n as line endings, you can use the --crlf option. The output will also be DOS style.

# since \n is the default line separator,
# last field retains the \r character
$ printf 'a,b,c\r\n1,2,3\r\n' | hck -d, -f3,2,1 -D, | cat -v
c^M,b,a
3^M,2,1

# with --crlf the last field no longer has \r
# output line ending will be \r\n
$ printf 'a,b,c\r\n1,2,3\r\n' | hck --crlf -d, -f3,2,1 -D, | cat -v
c,b,a^M
3,2,1^M