csplit

The csplit command is useful to divide the input into smaller parts based on line numbers and regular expression patterns. Similar to split, this command also supports customizing output filenames.

info Since a lot of output files will be generated in this chapter (often with same filenames), remove these files after every illustration.

Split on Nth line

You can split the input into two based on a particular line number. To do so, specify the line number after the input source (filename or stdin data). The first output file will have the input lines before the given line number and the second output file will have the rest of the contents.

By default, the output files will be named xx00, xx01, xx02 and so on (where xx is the prefix). The numerical suffix will automatically use more digits if needed. You'll see examples with more than two output files later.

# split input into two based on line number 4
$ seq 10 | csplit - 4
6
15

# first output file will have the first 3 lines
# second output file will have the rest
$ head xx*
==> xx00 <==
1
2
3

==> xx01 <==
4
5
6
7
8
9
10

$ rm xx*

info As seen in the example above, csplit will also display the number of bytes written for each output file. You can use the -q option to suppress this message.

info warning As mentioned earlier, remove the output files after every illustration.

Split on regexp

You can also split the input based on a line matching the given regular expression. The output produced will vary based on // or %% delimiters being used to surround the regexp.

When /regexp/ is used, output is similar to the line number based splitting. The first output file will have the input lines before the first occurrence of a line matching the given regexp and the second output file will have the rest of the contents.

# match a line containing 't' followed by zero or more characters and then 'p'
# 'toothpaste' is the only match for this input file
$ csplit -q purchases.txt '/t.*p/'

$ head xx*
==> xx00 <==
coffee
tea
washing powder
coffee

==> xx01 <==
toothpaste
tea
soap
tea

When %regexp% is used, the lines occurring before the matching line won't be part of the output. Only the line matching the given regexp and the rest of the contents will be part of the single output file.

$ csplit -q purchases.txt '%t.*p%'

$ cat xx00
toothpaste
tea
soap
tea

warning You'll get an error if the given regexp isn't found in the input.

$ csplit -q purchases.txt '/xyz/'
csplit: ‘/xyz/’: match not found

info See Regular Expressions chapter from my GNU grep ebook if you want to learn about regexp syntax and features.

Regexp offset

You can also provide offset numbers that'll affect where the matching line and its surrounding lines should be placed. When the offset is greater than zero, the split will happen that many lines after the matching line. The default offset is zero.

# when the offset is '1', matching line will be part of the first file
$ csplit -q purchases.txt '/t.*p/1'
$ head xx*
==> xx00 <==
coffee
tea
washing powder
coffee
toothpaste

==> xx01 <==
tea
soap
tea

$ rm xx*

# matching line and 1 line after won't be part of the output
$ csplit -q purchases.txt '%t.*p%2'
$ cat xx00
soap
tea

When the offset is less than zero, the split will happen that many lines before the matching line.

# 2 lines before the matching line will be part of the second file
$ csplit -q purchases.txt '/t.*p/-2'
$ head xx*
==> xx00 <==
coffee
tea

==> xx01 <==
washing powder
coffee
toothpaste
tea
soap
tea

warning You'll get an error if the offset goes beyond the number of lines available in the input.

$ csplit -q purchases.txt '/t.*p/5'
csplit: ‘/t.*p/5’: line number out of range

$ csplit -q purchases.txt '/t.*p/-5'
csplit: ‘/t.*p/-5’: line number out of range

Repeat split

You can perform line number and regexp based split more than once by adding {N} argument after the pattern. Default behavior examples seen so far is same as specifying {0}. Any number greater than zero will result in that many more splits.

# {1} means split one time more than the default split
# so, two splits in total and three output files
# in this example, split happens on 4th and 8th line numbers
$ seq 10 | csplit -q - 4 '{1}'

$ head xx*
==> xx00 <==
1
2
3

==> xx01 <==
4
5
6
7

==> xx02 <==
8
9
10

Here's an example with regexp:

$ cat log.txt 
--> warning 1
a,b,c,d
42
--> warning 2
x,y,z
--> warning 3
4,3,1

# split on 3rd (2+1) occurrence of a line containing 'warning'
$ csplit -q log.txt '%warning%' '{2}'
$ cat xx00 
--> warning 3
4,3,1

As a special case, you can use {*} to repeat the split until the input is exhausted. This is especially useful with the /regexp/ form of splitting. Here's an example:

# split on all lines matching 'paste' or 'powder'
$ csplit -q purchases.txt '/paste\|powder/' '{*}'
$ head xx*
==> xx00 <==
coffee
tea

==> xx01 <==
washing powder
coffee

==> xx02 <==
toothpaste
tea
soap
tea

warning You'll get an error if the repeat count goes beyond the number of matches possible with the given input.

$ seq 10 | csplit -q - 4 '{2}'
csplit: ‘4’: line number out of range on repetition 2

$ csplit -q purchases.txt '/tea/' '{4}'
csplit: ‘/tea/’: match not found on repetition 3

Keep files on error

By default, csplit will remove the created output files if there's an error or a signal that causes the command to stop. You can use the -k option to keep such files. One use case is line number based splitting with the {*} modifier.

$ seq 10 | csplit -q - 4 '{*}'
csplit: ‘4’: line number out of range on repetition 2
$ ls xx*
ls: cannot access 'xx*': No such file or directory

# -k option will allow you to retain the created files
$ seq 10 | csplit -qk - 4 '{*}'
csplit: ‘4’: line number out of range on repetition 2
$ head xx*
==> xx00 <==
1
2
3

==> xx01 <==
4
5
6
7

==> xx02 <==
8
9
10

Suppress matched lines

The --suppress-matched option will suppress the lines matching the split condition.

$ seq 5 | csplit -q --suppress-matched - 3
# 3rd line won't be part of the output
$ head xx*
==> xx00 <==
1
2

==> xx01 <==
4
5

$ rm xx*

$ seq 10 | csplit -q --suppress-matched - 4 '{1}'
# 4th and 8th lines won't be part of the output
$ head xx*
==> xx00 <==
1
2
3

==> xx01 <==
5
6
7

==> xx02 <==
9
10

Here's an example with regexp based split:

$ csplit -q --suppress-matched purchases.txt '/soap\|powder/' '{*}'
# lines matching 'soap' or 'powder' won't be part of the output
$ head xx*
==> xx00 <==
coffee
tea

==> xx01 <==
coffee
toothpaste
tea

==> xx02 <==
tea

warning Suppressing matched lines for regexp based split other than {*} usage doesn't give expected results. See this bug report for more details. This bug has been fixed in coreutils version 9.0.

$ seq 11 14 | csplit -q --suppress-matched - '/3/'
# the matching line wasn't suppressed
$ head xx*
==> xx00 <==
11
12

==> xx01 <==
13
14

$ rm xx*

$ seq 11 16 | csplit -q --suppress-matched - '/[35]/' '{1}'
# the first matching line was correctly suppressed
# but the second matching line wasn't suppressed
$ head xx*
==> xx00 <==
11
12

==> xx01 <==
14

==> xx02 <==
15
16

Exclude empty files

There are various cases that can result in empty output files. For example, first or last line matching the given split condition. Another possibility is --suppress-matched option combined with consecutive lines matching during multiple splits. Here's an example:

$ csplit -q --suppress-matched purchases.txt '/coffee\|tea/' '{*}'

$ head xx*
==> xx00 <==

==> xx01 <==

==> xx02 <==
washing powder

==> xx03 <==
toothpaste

==> xx04 <==
soap

==> xx05 <==

You can use the -z option to exclude empty files from the output. The suffix numbering will be automatically adjusted in such cases.

$ csplit -qz --suppress-matched purchases.txt '/coffee\|tea/' '{*}'

$ head xx*
==> xx00 <==
washing powder

==> xx01 <==
toothpaste

==> xx02 <==
soap

Customize filenames

As seen earlier, xx is the default prefix for output filenames. Use the -f option to change this prefix.

$ seq 4 | csplit -q -f'num_' - 3

$ head num_*
==> num_00 <==
1
2

==> num_01 <==
3
4

The -n option controls the length of the numeric suffix. The suffix length will automatically increment if filenames are exhausted.

$ seq 4 | csplit -q -n1 - 3
$ ls xx*
xx0  xx1
$ rm xx*

$ seq 4 | csplit -q -n3 - 3
$ ls xx*
xx000  xx001

The -b option allows you to control the suffix using printf formatting. Quoting from the manual:

When this option is specified, the suffix string must include exactly one printf(3)-style conversion specification, possibly including format specification flags, a field width, a precision specifications, or all of these kinds of modifiers. The format letter must convert a binary unsigned integer argument to readable form. The format letters d and i are aliases for u, and the u, o, x, and X conversions are allowed.

Here's some examples:

# hexadecimal numbering
# minimum two digits, zero filled
$ seq 100 | csplit -q -b'%02x' - 3 '{20}'
$ ls xx*
xx00  xx02  xx04  xx06  xx08  xx0a  xx0c  xx0e  xx10  xx12  xx14
xx01  xx03  xx05  xx07  xx09  xx0b  xx0d  xx0f  xx11  xx13  xx15
$ rm xx*

# custom prefix and suffix around decimal numbering
# default minimum is single digit
$ seq 20 | csplit -q -f'num_' -b'%d.txt' - 3 '{4}'
$ ls num_*
num_0.txt  num_1.txt  num_2.txt  num_3.txt  num_4.txt  num_5.txt

info Note that the -b option will override the -n option. See man 3 printf for more details about the formatting options.