Assorted Text Processing Tools
There are way too many specialized text processing tools. This chapter will discuss some of the commands that haven't been covered in the previous chapters.
The example_files directory has the sample input files used in this chapter.
seq
The seq
command is a handy tool to generate a sequence of numbers in ascending or descending order. Both integer and floating-point numbers are supported. You can also customize the formatting for numbers and the separator between them.
You need three numbers to generate an arithmetic progression — start, step and stop. When you pass only a single number as the stop value, the default start and step values are assumed to be 1
. Passing two numbers are considered as start and stop values (in that order).
# start=1, step=1 and stop=3
$ seq 3
1
2
3
# start=25434, step=1 and stop=25437
$ seq 25434 25437
25434
25435
25436
25437
# start=-5, step=1 and stop=-3
$ seq -5 -3
-5
-4
-3
# start=0.25, step=0.33 and stop=1.12
$ seq 0.25 0.33 1.12
0.25
0.58
0.91
By using a negative step value, you can generate sequences in descending order.
$ seq 3 -1 1
3
2
1
You can use the -s
option to change the separator between the numbers of a sequence. A single newline character is always the character added after the final number.
$ seq -s' - ' 4
1 - 2 - 3 - 4
$ seq -s: 1.2e2 0.752 1.22e2
120.000:120.752:121.504
The -w
option will equalize the width of the output numbers using leading zeros. The largest width between the start and stop values will be used.
$ seq -w 8 10
08
09
10
$ seq -w 0003
0001
0002
0003
You can use the -f
option for printf
style floating-point number formatting.
$ seq -f'%g' -s: 1 0.75 3
1:1.75:2.5
$ seq -f'%.4f' -s: 1 0.75 3
1.0000:1.7500:2.5000
$ seq -f'%.3e' 1.2e2 0.752 1.22e2
1.200e+02
1.208e+02
1.215e+02
shuf
By default, shuf
will randomize the order of input lines. You can use the -n
option to limit the number of output lines.
$ printf 'apple\nbanana\ncherry\nfig\nmango' | shuf
banana
cherry
mango
apple
fig
$ printf 'apple\nbanana\ncherry\nfig\nmango' | shuf -n2
mango
cherry
You can use the -e
option to specify multiple input lines as arguments to the command. The -r
option helps if you want to allow input lines to be repeated. This option is usually paired with -n
to limit the number of lines in the output.
$ shuf -n4 -r -e brown green blue
green
brown
blue
green
The -i
option will help you generate random positive integers.
$ shuf -n3 -i 100-200
170
112
148
cut
cut
is a handy tool for many field processing use cases. The features are limited compared to the awk
and perl
commands, but the reduced scope also leads to faster processing.
By default, cut
splits the input content into fields based on the tab character, which you can change using the -d
option. The -f
option allows you to select a desired field from each input line. To extract multiple fields, specify the selections separated by the comma character. By default, lines not containing the input delimiter will still be part of the output. You can use the -s
option to suppress such lines.
# second field
$ printf 'apple\tbanana\tcherry\n' | cut -f2
banana
# first and third fields
$ printf 'apple\tbanana\tcherry\n' | cut -f1,3
apple cherry
# setting -d automatically changes the output delimiter as well
$ echo 'one;two;three;four;five' | cut -d';' -f2,5
two;five
You can use the -
character to specify field ranges. The starting or ending field number can be skipped, but not both.
# 2nd, 3rd and 4th fields
$ printf 'apple\tbanana\tcherry\tdates\n' | cut -f2-4
banana cherry dates
# all fields from the start till the 3rd field
$ printf 'apple\tbanana\tcherry\tdates\n' | cut -f-3
apple banana cherry
# 1st field and all fields from the 3rd field till the end
$ printf 'apple\tbanana\tcherry\tdates\n' | cut -f1,3-
apple cherry dates
Use the --output-delimiter
option to customize the output separator to any string of your choice.
# same as: tr '\t' ','
$ printf 'apple\tbanana\tcherry\n' | cut --output-delimiter=, -f1-
apple,banana,cherry
# multicharacter example
$ echo 'one;two;three;four' | cut -d';' --output-delimiter=' : ' -f1,3-
one : three : four
The --complement
option allows you to invert the field selections.
# except the second field
$ printf 'apple ball cat\n1 2 3 4 5' | cut --complement -d' ' -f2
apple cat
1 3 4 5
# except the first and third fields
$ printf 'apple ball cat\n1 2 3 4 5' | cut --complement -d' ' -f1,3
ball
2 4 5
You can use the -b
or -c
options to select specified bytes from each input line. The syntax is same as the -f
option. The -c
option is intended for multibyte character selection, but for now it works exactly as the -b
option.
$ printf 'apple\tbanana\tcherry\n' | cut -c2,8,11
pan
$ printf 'apple\tbanana\tcherry\n' | cut -c2,8,11 --output-delimiter=-
p-a-n
$ printf 'apple\tbanana\tcherry\n' | cut --complement -c13-
apple banana
$ printf 'cat-bat\ndog:fog' | cut -c5-
bat
fog
column
The column
command is a nifty tool to align the input data column wise. By default, whitespace is used as the input delimiter. Space character is used to align the output columns, so whitespace characters like tab will get converted to spaces.
$ printf 'one two three\nfour five six\nseven eight nine\n'
one two three
four five six
seven eight nine
$ printf 'one two three\nfour five six\nseven eight nine\n' | column -t
one two three
four five six
seven eight nine
You can use the -s
option to customize the input delimiter. Note that the output delimiter will still be made up of spaces only.
$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80
$ column -s, -t scores.csv
Name Maths Physics Chemistry
Ith 100 100 100
Cy 97 98 95
Lin 78 83 80
$ printf '1:-:2:-:3\napple:-:banana:-:cherry\n' | column -s:-: -t
1 2 3
apple banana cherry
Input should have a newline at the end, otherwise you'll get an error:
$ printf '1 2 3\na b c' | column -t column: line too long 1 2 3
tr
tr
helps you to map one set of characters to another set of characters. Features like range, repeats, character sets, squeeze, complement, etc makes it a must know text processing tool.
tr
works only on stdin
data, so you'll need to use shell input redirection for file input. Here are some basic examples:
# 'l' maps to '1', 'e' to '3', 't' to '7' and 's' to '5'
$ echo 'leet speak' | tr 'lets' '1375'
1337 5p3ak
# example with shell metacharacters
$ echo 'apple;banana;cherry' | tr ';' ':'
apple:banana:cherry
# swap case
$ echo 'Hello World' | tr 'a-zA-Z' 'A-Za-z'
hELLO wORLD
$ tr 'a-z' 'A-Z' <greeting.txt
HI THERE
HAVE A NICE DAY
You can use the -d
option to specify a set of characters to be deleted. The -c
option will invert the first set of characters. Here are some examples:
$ echo '2021-08-12' | tr -d '-'
20210812
$ s='"Hi", there! How *are* you? All fine here.'
$ echo "$s" | tr -d '[:punct:]'
Hi there How are you All fine here
# retain alphabets, whitespaces, period, exclamation and question mark
$ echo "$s" | tr -cd 'a-zA-Z.!?[:space:]'
Hi there! How are you? All fine here.
The -s
option changes consecutive repeated characters to a single copy of that character.
# squeeze lowercase alphabets
$ echo 'HELLO... hhoowwww aaaaaareeeeee yyouuuu!!' | tr -s 'a-z'
HELLO... how are you!!
# translate and squeeze
$ echo 'hhoowwww aaaaaareeeeee yyouuuu!!' | tr -s 'a-z' 'A-Z'
HOW ARE YOU!!
# delete and squeeze
$ echo 'hhoowwww aaaaaareeeeee yyouuuu!!' | tr -sd '!' 'a-z'
how are you
# squeeze other than lowercase alphabets
$ echo 'apple noon banana!!!!!' | tr -cs 'a-z'
apple noon banana!
paste
paste
is typically used to merge two or more files column wise. It also has a handy feature for serializing data. By default, paste
adds a tab character between the corresponding lines of input files.
$ cat colors_1.txt
Blue
Brown
Orange
Purple
$ cat colors_2.txt
Black
Blue
Green
Orange
$ paste colors_1.txt colors_2.txt
Blue Black
Brown Blue
Orange Green
Purple Orange
You can use the -d
option to change the delimiter between the columns. The separator is added even if the data has been exhausted for some of the input files.
$ paste -d'|' <(seq 3) <(seq 4 5) <(seq 6 8)
1|4|6
2|5|7
3||8
# note that the space between -d and the empty string is necessary here
$ paste -d '' <(seq 3) <(seq 6 8)
16
27
38
$ paste -d'\n' <(seq 11 12) <(seq 101 102)
11
101
12
102
You can use empty files to get multicharacter separation between the columns.
$ paste -d' : ' <(seq 3) /dev/null /dev/null <(seq 4 6)
1 : 4
2 : 5
3 : 6
If you use -
multiple times, paste
will consume a line from stdin
data every time -
is encountered. This is different from using the same filename multiple times, in which case they are treated as separate inputs.
# five columns
$ seq 10 | paste -d: - - - - -
1:2:3:4:5
6:7:8:9:10
# use redirection for file input
$ <colors_1.txt paste -d: - - -
Blue:Brown:Orange
Purple::
The -s
option allows you to combine all the input lines from a file into a single line using the given delimiter. Multiple input files are treated separately. paste
will ensure to add a final newline character even if it isn't present in the input.
# <colors_1.txt tr '\n' ',' will give you a trailing comma
$ paste -sd, colors_1.txt
Blue,Brown,Orange,Purple
# multiple file example
$ paste -sd: colors_1.txt colors_2.txt
Blue:Brown:Orange:Purple
Black:Blue:Green:Orange
pr
Paginate or columnate FILE(s) for printing.
As stated in the above quote from the manual, the pr
command is mainly used for those two tasks. This section will discuss only the columnate features and some miscellaneous tasks. Here's a pagination example if you are interested in exploring further. The pr
command will add blank lines, a header and so on to make it suitable for printing.
$ pr greeting.txt | head -n8
2024-05-17 10:48 greeting.txt Page 1
Hi there
Have a nice day
The --columns
and -a
options can be used to merge the input lines in two different ways:
- split the input file and then merge them as columns
- merge consecutive lines, similar to the
paste
command
Here's an example to get started. Note that -N
is same as using --columns=N
where N
is the number of columns you want in the output. The default page width is 72
, which means each column can only have a maximum of 72/N
characters (including the separator). Tab and space characters will be used to fill the columns as needed. You can use the -J
option to prevent pr
from truncating longer columns. The -t
option is used here to turn off the pagination features.
# split input into three parts
# each column width is 72/3 = 24 characters max
$ seq 9 | pr -3t
1 4 7
2 5 8
3 6 9
You can customize the separator using the -s
option. The default is a tab character which you can change to any other string value. The -s
option also turns off line truncation, so the -J
option isn't needed. Use the -a
option to merge consecutive lines, similar to the paste
command example seen earlier.
# tab is the default separator when no argument is passed to the -s option
$ seq 9 | pr -3ts
1 4 7
2 5 8
3 6 9
# multicharacter custom separator example
$ seq 9 | pr -3ats' : '
1 : 2 : 3
4 : 5 : 6
7 : 8 : 9
# unlike paste, pr doesn't add separators if the last row has less columns to fill
$ seq 10 | pr -4ats,
1,2,3,4
5,6,7,8
9,10
However, the default page width of 72
can still cause issues, which you can prevent by using the -w
option. The -w
option overrides the effect of the -s
option on line truncation, so use the -J
option as well unless you really need truncation.
$ seq 6 | pr -J -w10 -3ats'::::'
pr: page width too narrow
$ seq 6 | pr -J -w11 -3ats'::::'
1::::2::::3
4::::5::::6
Two or more input files can be merged column wise using the -m
option. As seen before, the -t
option is needed to ignore pagination features and -s
can be used to customize the separator.
# same as: paste -d' : ' <(seq 3) /dev/null /dev/null <(seq 4 6)
$ pr -mts' : ' <(seq 3) <(seq 4 6)
1 : 4
2 : 5
3 : 6
rev
The rev
command reverses each input line character wise. The newline character won't be added to the end if it wasn't present in the input. Here are some examples:
$ echo 'This is a sample text' | rev
txet elpmas a si sihT
$ printf 'apple\nbanana\ncherry\n' | rev
elppa
ananab
yrrehc
$ printf 'malayalam\nnoon\n' | rev
malayalam
noon
split
The split
command is useful to divide the input into smaller parts based on the number of lines, bytes, file size, etc. You can also execute another command on the divided parts before saving the results. An example use case is sending a large file as multiple parts as a workaround for online transfer size limits.
By default, the split
command divides the input 1000
lines at a time. Newline character is the default line separator. You can pass a single file or stdin
data as the input. Use cat
if you need to concatenate multiple input sources. By default, the output files will be named xaa
, xab
, xac
and so on (where x
is the prefix). If the filenames are exhausted, two more letters will be appended and the pattern will continue as needed. If the number of input lines is not evenly divisible, the last file will contain less than 1000
lines.
# divide input 1000 lines at a time
$ seq 10000 | split
# output filenames
$ ls x*
xaa xab xac xad xae xaf xag xah xai xaj
# preview of some of the output files
$ head -n1 xaa xab xae xaj
==> xaa <==
1
==> xab <==
1001
==> xae <==
4001
==> xaj <==
9001
For more examples, customization options and other details, see the split chapter from my CLI text processing with GNU Coreutils ebook.
csplit
The csplit
command is useful to divide the input into smaller parts based on line numbers and regular expression patterns.
You can split the input into two based on a particular line number. To do so, specify the line number after the input source (filename or stdin
data). The first output file will have the input lines before the given line number and the second output file will have the rest of the contents. By default, the output files will be named xx00
, xx01
, xx02
and so on (where xx
is the prefix). The numerical suffix will automatically use more digits if needed.
# split input into two based on line number 2
# the -q option suppresses output showing number of bytes written for each file
$ seq 4 | csplit -q - 2
# first output file will have the first line
# second output file will have the rest
$ head xx*
==> xx00 <==
1
==> xx01 <==
2
3
4
You can also split the input based on a line matching the given regular expression. The output produced will vary based on the //
or %%
delimiters being used to surround the regexp. When /regexp/
is used, output is similar to the line number based splitting. The first output file will have the input lines before the first occurrence of a line matching the given regexp and the second output file will have the rest of the contents.
Consider this sample input file:
$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea
Here's an example of splitting the input file using the /regexp/
syntax:
# match a line containing 't' followed by zero or more characters and then 'p'
# 'toothpaste' is the only match for this input file
$ csplit -q purchases.txt '/t.*p/'
$ head xx*
==> xx00 <==
coffee
tea
washing powder
coffee
==> xx01 <==
toothpaste
tea
soap
tea
When %regexp%
is used, the lines occurring before the matching line won't be part of the output. Only the line matching the given regexp and the rest of the contents will be part of the single output file.
$ csplit -q purchases.txt '%t.*p%'
$ cat xx00
toothpaste
tea
soap
tea
For more examples, customization options and other details, see the csplit chapter from my CLI text processing with GNU Coreutils ebook.
xargs
By default, xargs
executes the echo
command for the arguments extracted from stdin
data (or file input via the -a
option). The -n
option helps to customize how many arguments should be passed at a time. Together, these features can be used to reshape whitespace separated data as shown in the examples below:
$ printf ' apple banana cherry\n\t\tdragon unicorn \n'
apple banana cherry
dragon unicorn
$ printf ' apple banana cherry\n\t\tdragon unicorn \n' | xargs -n2
apple banana
cherry dragon
unicorn
$ cat ip.txt
deep blue
light orange
blue delight
$ xargs -a ip.txt -n3
deep blue light
orange blue delight
You can use the -L
option to specify how many input lines should be combined at a time:
# same as: pr -3ats' ' or paste -d' ' - - -
$ seq 9 | xargs -L3
1 2 3
4 5 6
7 8 9
$ xargs -a ip.txt -L2
deep blue light orange
blue delight
# you can also use -l instead of -L1
$ printf ' apple banana cherry\n\t\tdragon unicorn \n' | xargs -L1
apple banana cherry
dragon unicorn
Note that
xargs -L1
is not the same asawk '{$1=$1} 1'
sincexargs
will discard blank lines. Also, trailing blank characters will cause the next line to be considered as part of the current line. For example:# no trailing blanks $ printf 'xerox apple\nregex go sea\n' | xargs -L1 xerox apple regex go sea # with trailing blanks $ printf 'xerox apple \nregex go sea\n' | xargs -L1 xerox apple regex go sea
You can use the -d
option to specify a custom single character input delimiter. For example:
$ printf '1,2,3,4,5,6' | xargs -d, -n3
1 2 3
4 5 6
Exercises
Use the example_files/text_files directory for input files used in the following exercises.
1) Generate the following sequence.
# ???
100
95
90
85
80
2) Is the sequence shown below possible to generate with seq
? If so, how?
# ???
01.5,02.5,03.5,04.5,05.5
3) Display three random words from /usr/share/dict/words
(or an equivalent dictionary word file) containing s
and e
and t
in any order. The output shown below is just an example.
# ???
supplemental
foresight
underestimates
4) Briefly describe the purpose of the shuf
command options -i
, -e
and -r
.
5) Why does the below command not work as expected? What other tools can you use in such cases?
# not working as expected
$ echo 'apple,banana,cherry,dates' | cut -d, -f3,1,3
apple,cherry
# expected output
# ???
cherry,apple,cherry
6) Display except the second field in the format shown below. Can you construct two different solutions?
$ echo 'apple,banana,cherry,dates' | cut # ???
apple cherry dates
$ echo '2,3,4,5,6,7,8' | cut # ???
2 4 5 6 7 8
7) Extract the first three characters from the input lines as shown below. Can you also use the head
command for this purpose? If not, why not?
$ printf 'apple\nbanana\ncherry\ndates\n' | cut # ???
app
ban
che
dat
8) Display only the first and third columns of the scores.csv
input file in the format as shown below. Note that only space characters are present between the two columns, not tab.
$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80
# ???
Name Physics
Ith 100
Cy 98
Lin 83
9) Display the contents of table.txt
in the format shown below.
# ???
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
10) Implement ROT13 cipher using the tr
command.
$ echo 'Hello World' | tr # ???
Uryyb Jbeyq
$ echo 'Uryyb Jbeyq' | tr # ???
Hello World
11) Retain only alphabets, digits and whitespace characters.
$ echo 'Apple_42 cool,blue Dragon:army' | # ???
Apple42 coolblue Dragonarmy
12) Use tr
to get the output shown below.
$ echo '!!hhoowwww !!aaaaaareeeeee!! yyouuuu!!' | tr # ???
how are you
13) paste -s
works separately for multiple input files. How would you workaround this if you needed to treat all the input files as a single source?
# this works individually for each input file
$ paste -sd, fruits.txt ip.txt
banana,papaya,mango
deep blue,light orange,blue delight
# expected output
# ???
banana,papaya,mango,deep blue,light orange,blue delight
14) Use appropriate options to get the expected output shown below.
# default output
$ paste fruits.txt ip.txt
banana deep blue
papaya light orange
mango blue delight
# expected output
$ paste # ???
banana
deep blue
papaya
light orange
mango
blue delight
15) Use the pr
command to get the expected output shown below.
$ seq -w 16 | pr # ???
01,02,03,04
05,06,07,08
09,10,11,12
13,14,15,16
$ seq -w 16 | pr # ???
01,05,09,13
02,06,10,14
03,07,11,15
04,08,12,16
16) Use the pr
command to join the input files fruits.txt
and ip.txt
as shown below.
# ???
banana : deep blue
papaya : light orange
mango : blue delight
17) The cut
command doesn't support a way to choose the last N
fields. Which tool presented in this chapter can be combined to work with cut
to get the output shown below?
# last two characters from each line
$ printf 'apple\nbanana\ncherry\ndates\n' | # ???
le
na
ry
es
18) Go through the split
documentation and use appropriate options to get the output shown below for the input file purchases.txt
.
# split input by 3 lines (max) at a time
# ???
$ head xa?
==> xaa <==
coffee
tea
washing powder
==> xab <==
coffee
toothpaste
tea
==> xac <==
soap
tea
$ rm xa?
19) Go through the split
documentation and use appropriate options to get the output shown below.
$ echo 'apple,banana,cherry,dates' | split # ???
$ head xa?
==> xaa <==
apple,
==> xab <==
banana,
==> xac <==
cherry,
==> xad <==
dates
$ rm xa?
20) Split the input file purchases.txt
such that the text before a line containing powder
is part of the first file and the rest are part of the second file as shown below.
# ???
$ head xx0?
==> xx00 <==
coffee
tea
==> xx01 <==
washing powder
coffee
toothpaste
tea
soap
tea
$ rm xx0?
21) Write a generic solution that transposes comma delimited data. Example input/output is shown below. You can use any tool(s) presented in this book.
$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80
# ???
Name,Ith,Cy,Lin
Maths,100,97,78
Physics,100,98,83
Chemistry,100,95,80
22) Reshape the contents of table.txt
to the expected output shown below.
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
# ???
brown bread mat hair
42 blue cake mug
shirt -7 yellow banana
window shoes 3.14