Assorted Text Processing Tools

There are way too many specialized text processing tools. This chapter will discuss some of the commands that haven't been covered in the previous chapters.

info The example_files directory has the sample input files used in this chapter.

seq

The seq command is a handy tool to generate a sequence of numbers in ascending or descending order. Both integer and floating-point numbers are supported. You can also customize the formatting for numbers and the separator between them.

You need three numbers to generate an arithmetic progression — start, step and stop. When you pass only a single number as the stop value, the default start and step values are assumed to be 1. Passing two numbers are considered as start and stop values (in that order).

# start=1, step=1 and stop=3
$ seq 3
1
2
3

# start=25434, step=1 and stop=25437
$ seq 25434 25437
25434
25435
25436
25437

# start=-5, step=1 and stop=-3
$ seq -5 -3
-5
-4
-3

# start=0.25, step=0.33 and stop=1.12
$ seq 0.25 0.33 1.12
0.25
0.58
0.91

By using a negative step value, you can generate sequences in descending order.

$ seq 3 -1 1
3
2
1

You can use the -s option to change the separator between the numbers of a sequence. A single newline character is always the character added after the final number.

$ seq -s' - ' 4
1 - 2 - 3 - 4

$ seq -s: 1.2e2 0.752 1.22e2
120.000:120.752:121.504

The -w option will equalize the width of the output numbers using leading zeros. The largest width between the start and stop values will be used.

$ seq -w 8 10
08
09
10

$ seq -w 0003
0001
0002
0003

You can use the -f option for printf style floating-point number formatting.

$ seq -f'%g' -s: 1 0.75 3
1:1.75:2.5

$ seq -f'%.4f' -s: 1 0.75 3
1.0000:1.7500:2.5000

$ seq -f'%.3e' 1.2e2 0.752 1.22e2
1.200e+02
1.208e+02
1.215e+02

shuf

By default, shuf will randomize the order of input lines. You can use the -n option to limit the number of output lines.

$ printf 'apple\nbanana\ncherry\nfig\nmango' | shuf
banana
cherry
mango
apple
fig

$ printf 'apple\nbanana\ncherry\nfig\nmango' | shuf -n2
mango
cherry

You can use the -e option to specify multiple input lines as arguments to the command. The -r option helps if you want to allow input lines to be repeated. This option is usually paired with -n to limit the number of lines in the output.

$ shuf -n4 -r -e brown green blue
green
brown
blue
green

The -i option will help you generate random positive integers.

$ shuf -n3 -i 100-200
170
112
148

cut

cut is a handy tool for many field processing use cases. The features are limited compared to awk and perl commands, but the reduced scope also leads to faster processing.

By default, cut splits the input content into fields based on the tab character, which you can change using the -d option. The -f option allows you to select a desired field from each input line. To extract multiple fields, specify the selections separated by the comma character. By default, lines not containing the input delimiter will still be part of the output. You can use the -s option to suppress such lines.

# second field
$ printf 'apple\tbanana\tcherry\n' | cut -f2
banana

# first and third field
$ printf 'apple\tbanana\tcherry\n' | cut -f1,3
apple   cherry

# setting -d automatically changes output delimiter as well
$ echo 'one;two;three;four;five' | cut -d';' -f2,5
two;five

You can use the - character to specify field ranges. The starting or ending field number can be skipped, but not both.

# 2nd, 3rd and 4th fields
$ printf 'apple\tbanana\tcherry\tdates\n' | cut -f2-4
banana  cherry  dates

# all fields from the start till the 3rd field
$ printf 'apple\tbanana\tcherry\tdates\n' | cut -f-3
apple   banana  cherry

# 1st field and all fields from the 3rd field till the end
$ printf 'apple\tbanana\tcherry\tdates\n' | cut -f1,3-
apple   cherry  dates

Use the --output-delimiter option to customize the output separator to any string of your choice.

# same as: tr '\t' ','
$ printf 'apple\tbanana\tcherry\n' | cut --output-delimiter=, -f1-
apple,banana,cherry

# multicharacter example
$ echo 'one;two;three;four' | cut -d';' --output-delimiter=' : ' -f1,3-
one : three : four

The --complement option allows you to invert the field selections.

# except second field
$ printf 'apple ball cat\n1 2 3 4 5' | cut --complement -d' ' -f2
apple cat
1 3 4 5

# except first and third fields
$ printf 'apple ball cat\n1 2 3 4 5' | cut --complement -d' ' -f1,3
ball
2 4 5

You can use the -b or -c options to select specified bytes from each input line. The syntax is same as the -f option. The -c option is intended for multibyte character selection, but for now it works exactly as the -b option.

$ printf 'apple\tbanana\tcherry\n' | cut -c2,8,11
pan

$ printf 'apple\tbanana\tcherry\n' | cut -c2,8,11 --output-delimiter=-
p-a-n

$ printf 'apple\tbanana\tcherry\n' | cut --complement -c13-
apple   banana

$ printf 'cat-bat\ndog:fog' | cut -c5-
bat
fog

column

The column command is a nifty tool to align input data column wise. By default, whitespace is used as the input delimiter. Space character is used to align the output columns, so whitespace characters like tab will get converted to spaces.

$ printf 'one two three\nfour five six\nseven eight nine\n'
one two three
four five six
seven eight nine

$ printf 'one two three\nfour five six\nseven eight nine\n' | column -t
one    two    three
four   five   six
seven  eight  nine

You can use the -s option to customize the input delimiter. Note that the output delimiter will still be made up of spaces only.

$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80

$ column -s, -t scores.csv
Name  Maths  Physics  Chemistry
Ith   100    100      100
Cy    97     98       95
Lin   78     83       80

$ printf '1:-:2:-:3\napple:-:banana:-:cherry\n' | column -s:-: -t
1      2       3
apple  banana  cherry

warning Input should have a newline at the end, otherwise you'll get an error:

$ printf '1 2 3\na   b   c' | column -t
column: line too long
1  2  3

tr

tr helps you to map one set of characters to another set of characters. Features like range, repeats, character sets, squeeze, complement, etc makes it a must know text processing tool.

tr works only on stdin data, so you'll need to use shell input redirection for file input. Here are some basic examples:

# 'l' maps to '1', 'e' to '3', 't' to '7' and 's' to '5'
$ echo 'leet speak' | tr 'lets' '1375'
1337 5p3ak

# example with shell metacharacters
$ echo 'apple;banana;cherry' | tr ';' ':'
apple:banana:cherry

# swap case
$ echo 'Hello World' | tr 'a-zA-Z' 'A-Za-z'
hELLO wORLD

$ tr 'a-z' 'A-Z' <greeting.txt
HI THERE
HAVE A NICE DAY

You can use the -d option to specify a set of characters to be deleted. The -c option will invert the first set of characters. Here are some examples:

$ echo '2021-08-12' | tr -d '-'
20210812

$ s='"Hi", there! How *are* you? All fine here.'
$ echo "$s" | tr -d '[:punct:]'
Hi there How are you All fine here

# retain alphabets, whitespaces, period, exclamation and question mark
$ echo "$s" | tr -cd 'a-zA-Z.!?[:space:]'
Hi there! How are you? All fine here.

The -s option will squeeze consecutive repeated characters to a single copy of that character.

# squeeze lowercase alphabets
$ echo 'hhoowwww aaaaaareeeeee yyouuuu!!' | tr -s 'a-z'
how are you!!

# translate and squeeze
$ echo 'hhoowwww aaaaaareeeeee yyouuuu!!' | tr -s 'a-z' 'A-Z'
HOW ARE YOU!!

# delete and squeeze
$ echo 'hhoowwww aaaaaareeeeee yyouuuu!!' | tr -sd '!' 'a-z'
how are you

# squeeze other than lowercase alphabets
$ echo 'apple    noon     banana!!!!!' | tr -cs 'a-z'
apple noon banana!

paste

paste is typically used to merge two or more files column wise. It also has a handy feature for serializing data. By default, paste adds a tab character between the corresponding lines of input files.

$ cat colors_1.txt
Blue
Brown
Orange
Purple
$ cat colors_2.txt
Black
Blue
Green
Orange

$ paste colors_1.txt colors_2.txt
Blue    Black
Brown   Blue
Orange  Green
Purple  Orange

You can use the -d option to change the delimiter between the columns. The separator is added even if the data has been exhausted for some of the input files.

$ paste -d'|' <(seq 3) <(seq 4 5) <(seq 6 8)
1|4|6
2|5|7
3||8

# note that the space between -d and empty string is necessary here
$ paste -d '' <(seq 3) <(seq 6 8)
16
27
38

$ paste -d'\n' <(seq 11 12) <(seq 101 102)
11
101
12
102

You can use empty files to get multicharacter separation between the columns.

$ paste -d' : ' <(seq 3) /dev/null /dev/null <(seq 4 6)
1 : 4
2 : 5
3 : 6

If you use - multiple times, paste will consume a line from stdin data every time - is encountered. This is different from using the same filename multiple times, in which case they are treated as separate inputs.

# five columns
$ seq 10 | paste -d: - - - - -
1:2:3:4:5
6:7:8:9:10

# use redirection for file input
$ <colors_1.txt paste -d: - - -
Blue:Brown:Orange
Purple::

The -s option allows you to combine all the input lines from a file into a single line using the given delimiter. Multiple input files are treated separately. paste will ensure to add a final newline character even if it isn't present in the input.

# <colors_1.txt tr '\n' ',' will give you a trailing comma
$ paste -sd, colors_1.txt
Blue,Brown,Orange,Purple

# multiple file example
$ paste -sd: colors_1.txt colors_2.txt
Blue:Brown:Orange:Purple
Black:Blue:Green:Orange

pr

Paginate or columnate FILE(s) for printing.

As stated in the above quote from the manual, the pr command is mainly used for those two tasks. This section will discuss only the columnate features and some miscellaneous tasks. Here's a pagination example if you are interested in exploring further. The pr command will add blank lines, a header and so on to make it suitable for printing.

$ pr greeting.txt | head -n8


2022-06-11 10:48                   greeting.txt                   Page 1


Hi there
Have a nice day

The --columns and -a options can be used to merge the input lines in two different ways:

  • split the input file and then merge them as columns
  • merge consecutive lines, similar to the paste command

Here's an example to get started. Note that -N is same as using --columns=N where N is the number of columns you want in the output. The default page width is 72, which means each column can only have a maximum of 72/N characters (including the separator). Tab and space characters will be used to fill the columns as needed. You can use the -J option to prevent pr from truncating longer columns. The -t option is used here to turn off the pagination features.

# split input into three parts
# each column width is 72/3 = 24 characters max
$ seq 9 | pr -3t
1                       4                       7
2                       5                       8
3                       6                       9

You can customize the separator using the -s option. The default is a tab character which you can change to any other string value. The -s option also turns off line truncation, so -J option isn't needed.

# tab separator
$ seq 9 | pr -3ts
1       4       7
2       5       8
3       6       9

# custom separator
$ seq 9 | pr -3ts' : '
1 : 4 : 7
2 : 5 : 8
3 : 6 : 9

However, the default page width of 72 can still cause issues, which you can prevent by using the -w option. The -w option overrides the effect of -s option on line truncation, so use -J option as well unless you really need truncation.

$ seq 6 | pr -J -w10 -3ats'::::'
pr: page width too narrow

$ seq 6 | pr -J -w11 -3ats'::::'
1::::2::::3
4::::5::::6

Use the -a option to merge consecutive lines, similar to the paste command. One advantage is that the -s option supports a string value, whereas with paste you'd need to use workarounds to get multicharacter separation.

# same as: paste -d: - - - -
$ seq 8 | pr -4ats:
1:2:3:4
5:6:7:8

# unlike paste, pr doesn't add separators if the last row has less columns to fill
$ seq 10 | pr -4ats,
1,2,3,4
5,6,7,8
9,10

Two or more input files can be merged column wise using the -m option. As seen before, -t is needed to ignore pagination features and -s can be used to customize the separator.

# same as: paste -d' : ' <(seq 3) /dev/null /dev/null <(seq 4 6)
$ pr -mts' : ' <(seq 3) <(seq 4 6)
1 : 4
2 : 5
3 : 6

rev

The rev command reverses each input line character wise. Newline character won't be added to the end if it wasn't present in the input. Here are some examples:

$ echo 'This is a sample text' | rev
txet elpmas a si sihT

$ printf 'apple\nbanana\ncherry\n' | rev
elppa
ananab
yrrehc

$ printf 'malayalam\nnoon\n' | rev
malayalam
noon

split

The split command is useful to divide the input into smaller parts based on number of lines, bytes, file size, etc. You can also execute another command on the divided parts before saving the results. An example use case is sending a large file as multiple parts as a workaround for online transfer size limits.

By default, the split command divides the input 1000 lines at a time. Newline character is the default line separator. You can pass a single file or stdin data as the input. Use cat if you need to concatenate multiple input sources. By default, the output files will be named xaa, xab, xac and so on (where x is the prefix). If the filenames are exhausted, two more letters will be appended and the pattern will continue as needed. If the number of input lines is not evenly divisible, the last file will contain less than 1000 lines.

# divide input 1000 lines at a time
$ seq 10000 | split

# output filenames
$ ls x*
xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj

# preview of some of the output files
$ head -n1 xaa xab xae xaj
==> xaa <==
1

==> xab <==
1001

==> xae <==
4001

==> xaj <==
9001

info For more examples, customization options and other details, see split chapter from my Command line text processing with GNU Coreutils ebook.

csplit

The csplit command is useful to divide the input into smaller parts based on line numbers and regular expression patterns.

You can split the input into two based on a particular line number. To do so, specify the line number after the input source (filename or stdin data). The first output file will have the input lines before the given line number and the second output file will have the rest of the contents. By default, the output files will be named xx00, xx01, xx02 and so on (where xx is the prefix). The numerical suffix will automatically use more digits if needed.

# split input into two based on line number 2
# -q option suppresses output showing number of bytes written for each file
$ seq 4 | csplit -q - 2

# first output file will have the first line
# second output file will have the rest
$ head xx*
==> xx00 <==
1

==> xx01 <==
2
3
4

You can also split the input based on a line matching the given regular expression. The output produced will vary based on // or %% delimiters being used to surround the regexp. When /regexp/ is used, output is similar to the line number based splitting. The first output file will have the input lines before the first occurrence of a line matching the given regexp and the second output file will have the rest of the contents.

Consider this sample input file:

$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea

Here's an example of splitting the input file using the /regexp/ syntax:

# match a line containing 't' followed by zero or more characters and then 'p'
# 'toothpaste' is the only match for this input file
$ csplit -q purchases.txt '/t.*p/'

$ head xx*
==> xx00 <==
coffee
tea
washing powder
coffee

==> xx01 <==
toothpaste
tea
soap
tea

When %regexp% is used, the lines occurring before the matching line won't be part of the output. Only the line matching the given regexp and the rest of the contents will be part of the single output file.

$ csplit -q purchases.txt '%t.*p%'

$ cat xx00
toothpaste
tea
soap
tea

info For more examples, customization options and other details, see csplit chapter from my Command line text processing with GNU Coreutils ebook.

xargs

By default, xargs executes the echo command for the arguments extracted from stdin data (or file input via the -a option). The -n option helps to customize how many arguments should be passed at a time. Together, these features can be used to reshape whitespace separated data as shown in the examples below:

$ printf '  apple   banana cherry\n\t\tdragon unicorn   \n'
  apple   banana cherry
                dragon unicorn   
$ printf '  apple   banana cherry\n\t\tdragon unicorn   \n' | xargs -n2
apple banana
cherry dragon
unicorn

$ cat ip.txt
deep blue
light orange
blue delight
$ xargs -a ip.txt -n3
deep blue light
orange blue delight

You can use the -L option to specify how many input lines should be combined at a time:

# same as: pr -3ats' ' or paste -d' ' - - -
$ seq 9 | xargs -L3
1 2 3
4 5 6
7 8 9

$ xargs -a ip.txt -L2
deep blue light orange
blue delight

# you can also use -l instead of -L1
$ printf '  apple   banana cherry\n\t\tdragon unicorn   \n' | xargs -L1
apple banana cherry
dragon unicorn

info Note that xargs -L1 is not the same as awk '{$1=$1} 1' since xargs will discard blank lines. Also, trailing blank characters will cause the next line to be considered as part of the current line. For example:

# no trailing blanks
$ printf 'xerox apple\nregex   go  sea\n' | xargs -L1
xerox apple
regex go sea

# with trailing blanks
$ printf 'xerox apple  \nregex   go  sea\n' | xargs -L1
xerox apple regex go sea

Use -d option to change the input delimiter from whitespace to some other single character. For example:

$ printf '1,2,3,4,5,6' | xargs -d, -n3
1 2 3
4 5 6

Exercises

info Use example_files/text_files directory for input files used in the following exercises.

1) Generate the following sequence.

# ???
100
95
90
85
80

2) Is the sequence shown below possible to generate with seq? If so, how?

# ???
01.5,02.5,03.5,04.5,05.5

3) Display three random words from /usr/share/dict/words (or equivalent dictionary word file) containing s and e and t in any order. The output shown below is just an example.

# ???
supplemental
foresight
underestimates

4) Briefly describe the purpose of the shuf command options -i, -e and -r.

5) Why does the below command not work as expected? What other tools can you use in such cases?

# not working as expected
$ echo 'apple,banana,cherry,dates' | cut -d, -f3,1,3
apple,cherry

# expected output
# ???
cherry,apple,cherry

6) Display except the second field in the format shown below. Can you construct two different solutions?

$ echo 'apple,banana,cherry,dates' | cut # ???
apple cherry dates

$ echo '2,3,4,5,6,7,8' | cut # ???
2 4 5 6 7 8

7) Extract first three characters from the input lines as shown below. Can you also use the head command for this purpose? If not, why not?

$ printf 'apple\nbanana\ncherry\ndates\n' | cut # ???
app
ban
che
dat

8) Display only the first and third columns of the scores.csv input file in the format as shown below. Note that only space characters are present between the two columns, not tab.

$ cat scores.csv
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80

# ???
Name  Physics
Ith   100
Cy    98
Lin   83

9) Display the contents of table.txt in the format shown below.

# ???
brown   bread   mat     hair   42
blue    cake    mug     shirt  -7
yellow  banana  window  shoes  3.14

10) Implement ROT13 cipher using the tr command.

$ echo 'Hello World' | tr # ???
Uryyb Jbeyq
$ echo 'Uryyb Jbeyq' | tr # ???
Hello World

11) Retain only alphabets, digits and whitespace characters.

$ echo 'Apple_42 cool,blue Dragon:army' | # ???
Apple42 coolblue Dragonarmy

12) Use tr to get the output shown below.

$ echo '!!hhoowwww !!aaaaaareeeeee!! yyouuuu!!' | tr # ???
how are you

13) paste -s works separately for multiple input files. How would you workaround this if you needed to treat input as a single source?

# this works individually for each input file
$ paste -sd, fruits.txt ip.txt
banana,papaya,mango
deep blue,light orange,blue delight

# expected output
# ???
banana,papaya,mango,deep blue,light orange,blue delight

14) Use appropriate options to get the expected output shown below.

# default output
$ paste fruits.txt ip.txt
banana  deep blue
papaya  light orange
mango   blue delight

# expected output
$ paste # ???
banana
deep blue
papaya
light orange
mango
blue delight

15) Use the pr command to get the expected output shown below.

$ seq -w 16 | pr # ???
01,02,03,04
05,06,07,08
09,10,11,12
13,14,15,16

$ seq -w 16 | pr # ???
01,05,09,13
02,06,10,14
03,07,11,15
04,08,12,16

16) Use the pr command to join the input files fruits.txt and ip.txt as shown below.

# ???
banana : deep blue
papaya : light orange
mango : blue delight

17) The cut command doesn't support a way to choose the last N fields. Which tool presented in this chapter can be combined to work with cut to get the output shown below?

# last two characters from each line
$ printf 'apple\nbanana\ncherry\ndates\n' | # ???
le
na
ry
es

18) Go through split documentation and use appropriate options to get the output shown below for the input file purchases.txt.

# split input by 3 lines (max) at a time
# ???

$ head xa?
==> xaa <==
coffee
tea
washing powder

==> xab <==
coffee
toothpaste
tea

==> xac <==
soap
tea

$ rm xa?

19) Go through split documentation and use appropriate options to get the output shown below.

$ echo 'apple,banana,cherry,dates' | split # ???

$ head xa?
==> xaa <==
apple,
==> xab <==
banana,
==> xac <==
cherry,
==> xad <==
dates

$ rm xa?

20) Split the input file purchases.txt such that the text before a line containing powder is part of the first file and the rest are part of the second file as shown below.

# ???

$ head xx0?
==> xx00 <==
coffee
tea

==> xx01 <==
washing powder
coffee
toothpaste
tea
soap
tea

$ rm xx0?

21) Write a generic solution that transposes comma delimited data. Example input/output is shown below. You can use any tool(s) presented in this book.

$ cat scores.csv 
Name,Maths,Physics,Chemistry
Ith,100,100,100
Cy,97,98,95
Lin,78,83,80

# ???
Name,Ith,Cy,Lin
Maths,100,97,78
Physics,100,98,83
Chemistry,100,95,80

22) Reshape the contents of table.txt to the expected output shown below.

$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14

# ???
brown   bread  mat     hair
42      blue   cake    mug
shirt   -7     yellow  banana
window  shoes  3.14