csplit

The csplit command is useful to divide the input into smaller parts based on line numbers and regular expression patterns. Similar to split, this command also supports customizing output filenames.

Since a lot of output files will be generated in this chapter (often with same filenames), remove these files after every illustration.

Split on Nth line

You can split the input into two based on a particular line number. To do so, specify the line number after the input source (filename or stdin data). The first output file will have the input lines before the given line number and the second output file will have the rest of the contents.

By default, the output files will be named xx00, xx01, xx02, and so on (where xx is the prefix). The numerical suffix will automatically use more digits if needed. You'll see examples with more than two output files later.

# split input into two based on line number 4
$ seq 10 | csplit - 4
6
15

# first output file will have the first 3 lines
# second output file will have the rest
$ head xx*
==> xx00 <==
1
2
3

==> xx01 <==
4
5
6
7
8
9
10

$ rm xx*

As seen in the example above, csplit will also display the number of bytes written for each output file. You can use the -q option to suppress this message.

As mentioned earlier, remove the output files after every illustration.

Split on regexp

You can also split the input based on a line matching the given regular expression. The output produced will vary based on the // or %% delimiters being used to surround the regexp.

When /regexp/ is used, output is similar to the line number based splitting. The first output file will have the input lines before the first occurrence of a line matching the given regexp and the second output file will have the rest of the contents.

# match a line containing 't' followed by zero or more characters and then 'p'
# 'toothpaste' is the only match for this input file
$ csplit -q purchases.txt '/t.*p/'

$ head xx*
==> xx00 <==
coffee
tea
washing powder
coffee

==> xx01 <==
toothpaste
tea
soap
tea

When %regexp% is used, the lines occurring before the matching line won't be part of the output. Only the line matching the given regexp and the rest of the contents will be part of the single output file.

$ csplit -q purchases.txt '%t.*p%'

$ cat xx00
toothpaste
tea
soap
tea

You'll get an error if the given regexp isn't found in the input.
$ csplit -q purchases.txt '/xyz/'
csplit: ‘/xyz/’: match not found

See the Regular Expressions chapter from my GNU grep ebook if you want to learn more about regexp syntax and features.

Regexp offset

You can also provide offset numbers that'll affect where the matching line and its surrounding lines should be placed. When the offset is greater than zero, the split will happen that many lines after the matching line. The default offset is zero.

# when the offset is '1', the matching line will be part of the first file
$ csplit -q purchases.txt '/t.*p/1'
$ head xx*
==> xx00 <==
coffee
tea
washing powder
coffee
toothpaste

==> xx01 <==
tea
soap
tea

# matching line and 1 line after won't be part of the output
$ csplit -q purchases.txt '%t.*p%2'
$ cat xx00
soap
tea

When the offset is less than zero, the split will happen that many lines before the matching line.

# 2 lines before the matching line will be part of the second file
$ csplit -q purchases.txt '/t.*p/-2'
$ head xx*
==> xx00 <==
coffee
tea

==> xx01 <==
washing powder
coffee
toothpaste
tea
soap
tea

You'll get an error if the offset goes beyond the number of lines available in the input.
$ csplit -q purchases.txt '/t.*p/5'
csplit: ‘/t.*p/5’: line number out of range

$ csplit -q purchases.txt '/t.*p/-5'
csplit: ‘/t.*p/-5’: line number out of range

Repeat split

You can perform line number and regexp based split more than once by adding the {N} argument after the pattern. Default behavior examples seen so far is same as specifying {0}. Any number greater than zero will result in that many more splits.

# {1} means split one time more than the default split
# so, two splits in total and three output files
# in this example, split happens on the 4th and 8th line numbers
$ seq 10 | csplit -q - 4 '{1}'

$ head xx*
==> xx00 <==
1
2
3

==> xx01 <==
4
5
6
7

==> xx02 <==
8
9
10

Here's an example with regexp:

$ cat log.txt
--> warning 1
a,b,c,d
42
--> warning 2
x,y,z
--> warning 3
4,3,1

# split on the third (2+1) occurrence of a line containing 'warning'
$ csplit -q log.txt '%warning%' '{2}'
$ cat xx00
--> warning 3
4,3,1

As a special case, you can use {*} to repeat the split until the input is exhausted. This is especially useful with the /regexp/ form of splitting. Here's an example:

# split on all lines matching 'paste' or 'powder'
$ csplit -q purchases.txt '/paste\|powder/' '{*}'
$ head xx*
==> xx00 <==
coffee
tea

==> xx01 <==
washing powder
coffee

==> xx02 <==
toothpaste
tea
soap
tea

You'll get an error if the repeat count goes beyond the number of matches possible with the given input.
$ seq 10 | csplit -q - 4 '{2}'
csplit: ‘4’: line number out of range on repetition 2

$ csplit -q purchases.txt '/tea/' '{4}'
csplit: ‘/tea/’: match not found on repetition 3

Keep files on error

By default, csplit will remove the created output files if there's an error or a signal that causes the command to stop. You can use the -k option to keep such files. One use case is line number based splitting with the {*} modifier.

$ seq 7 | csplit -q - 4 '{*}'
csplit: ‘4’: line number out of range on repetition 1
$ ls xx*
ls: cannot access 'xx*': No such file or directory

# -k option will allow you to retain the created files
$ seq 7 | csplit -qk - 4 '{*}'
csplit: ‘4’: line number out of range on repetition 1
$ head xx*
==> xx00 <==
1
2
3

==> xx01 <==
4
5
6
7

Suppress matched lines

The --suppress-matched option will suppress the lines matching the split condition.

$ seq 5 | csplit -q --suppress-matched - 3
# 3rd line won't be part of the output
$ head xx*
==> xx00 <==
1
2

==> xx01 <==
4
5

$ rm xx*

$ seq 10 | csplit -q --suppress-matched - 4 '{1}'
# 4th and 8th lines won't be part of the output
$ head xx*
==> xx00 <==
1
2
3

==> xx01 <==
5
6
7

==> xx02 <==
9
10

Here's an example with regexp based split:

$ csplit -q --suppress-matched purchases.txt '/soap\|powder/' '{*}'
# lines matching 'soap' or 'powder' won't be part of the output
$ head xx*
==> xx00 <==
coffee
tea

==> xx01 <==
coffee
toothpaste
tea

==> xx02 <==
tea

Here's another example:

$ seq 11 16 | csplit -q --suppress-matched - '/[35]/' '{1}'
# lines matching '3' or '5' won't be part of the output
$ head xx*
==> xx00 <==
11
12

==> xx01 <==
14

==> xx02 <==
16

$ rm xx*

Exclude empty files

There are various cases that can result in empty output files. For example, first or last line matching the given split condition. Another possibility is the --suppress-matched option combined with consecutive lines matching during multiple splits. Here's an example:

$ csplit -q --suppress-matched purchases.txt '/coffee\|tea/' '{*}'

$ head xx*
==> xx00 <==

==> xx01 <==

==> xx02 <==
washing powder

==> xx03 <==
toothpaste

==> xx04 <==
soap

==> xx05 <==

You can use the -z option to exclude empty files from the output. The suffix numbering will be automatically adjusted in such cases.

$ csplit -qz --suppress-matched purchases.txt '/coffee\|tea/' '{*}'

$ head xx*
==> xx00 <==
washing powder

==> xx01 <==
toothpaste

==> xx02 <==
soap

Customize filenames

As seen earlier, xx is the default prefix for output filenames. Use the -f option to change this prefix.

$ seq 4 | csplit -q -f'num_' - 3

$ head num_*
==> num_00 <==
1
2

==> num_01 <==
3
4

The -n option controls the length of the numeric suffix. The suffix length will automatically increment if filenames are exhausted.

$ seq 4 | csplit -q -n1 - 3
$ ls xx*
xx0  xx1
$ rm xx*

$ seq 4 | csplit -q -n3 - 3
$ ls xx*
xx000  xx001

The -b option allows you to control the suffix using the printf formatting. Quoting from the manual:

When this option is specified, the suffix string must include exactly one printf(3)-style conversion specification, possibly including format specification flags, a field width, a precision specifications, or all of these kinds of modifiers. The format letter must convert a binary unsigned integer argument to readable form. The format letters d and i are aliases for u, and the u, o, x, and X conversions are allowed.

Here are some examples:

# hexadecimal numbering
# minimum two digits, zero filled
$ seq 100 | csplit -q -b'%02x' - 3 '{20}'
$ ls xx*
xx00  xx02  xx04  xx06  xx08  xx0a  xx0c  xx0e  xx10  xx12  xx14
xx01  xx03  xx05  xx07  xx09  xx0b  xx0d  xx0f  xx11  xx13  xx15
$ rm xx*

# custom prefix and suffix around decimal numbering
# default minimum is a single digit
$ seq 20 | csplit -q -f'num_' -b'%d.txt' - 3 '{4}'
$ ls num_*
num_0.txt  num_1.txt  num_2.txt  num_3.txt  num_4.txt  num_5.txt

Note that the -b option will override the -n option. See man 3 printf for more details about the formatting options.

Exercises

The exercises directory has all the files used in this section.

Remove the output files after every exercise.

1) Split the blocks.txt file such that the first 7 lines are in the first file and the rest are in the second file as shown below.

##### add your solution here

$ head xx*
==> xx00 <==
----
apple--banana
mango---fig
----
3.14
-42
1000

==> xx01 <==
----
sky blue
dark green
----
hi hello

$ rm xx*

2) Split the input file items.txt such that the text before a line containing colors is part of the first file and the rest are part of the second file as shown below.

##### add your solution here

$ head xx*
==> xx00 <==
1) fruits
apple 5
banana 10

==> xx01 <==
2) colors
green
sky blue
3) magical beasts
dragon 3
unicorn 42

$ rm xx*

3) Split the input file items.txt such that the line containing magical and all the lines that come after are part of the single output file.

##### add your solution here

$ cat xx00
3) magical beasts
dragon 3
unicorn 42

$ rm xx00

4) Split the input file items.txt such that the line containing colors as well the line that comes after are part of the first output file.

##### add your solution here

$ head xx*
==> xx00 <==
1) fruits
apple 5
banana 10
2) colors
green

==> xx01 <==
sky blue
3) magical beasts
dragon 3
unicorn 42

$ rm xx*

5) Split the input file items.txt on the line that comes before a line containing magical. Generate only a single output file as shown below.

##### add your solution here

$ cat xx00
sky blue
3) magical beasts
dragon 3
unicorn 42

$ rm xx00

6) Split the input file blocks.txt on the 4th occurrence of a line starting with the - character. Generate only a single output file as shown below.

##### add your solution here

$ cat xx00
----
sky blue
dark green
----
hi hello

$ rm xx00

7) For the input file blocks.txt, determine the logic to produce the expected output shown below.

##### add your solution here

$ head xx*
==> xx00 <==
apple--banana
mango---fig

==> xx01 <==
3.14
-42
1000

==> xx02 <==
sky blue
dark green

==> xx03 <==
hi hello

$ rm xx*

8) What does the -k option do?

9) Split the books.txt file on every line as shown below.

##### add your solution here
csplit: ‘1’: line number out of range on repetition 3

$ head row_*
==> row_0 <==
Cradle:::Mage Errant::The Weirkey Chronicles

==> row_1 <==
Mother of Learning::Eight:::::Dear Spellbook:Ascendant

==> row_2 <==
Mark of the Fool:Super Powereds:::Ends of Magic

$ rm row_*

10) Split the items.txt file on lines starting with a digit character. Matching lines shouldn't be part of the output and the files should be named group_0.txt, group_1.txt and so on.

##### add your solution here

$ head group_*
==> group_0.txt <==
apple 5
banana 10

==> group_1.txt <==
green
sky blue

==> group_2.txt <==
dragon 3
unicorn 42

$ rm group_*

CLI text processing with GNU Coreutils