csplit
The csplit
command is useful to divide the input into smaller parts based on line numbers and regular expression patterns. Similar to split
, this command also supports customizing output filenames.
Since a lot of output files will be generated in this chapter (often with same filenames), remove these files after every illustration.
Split on Nth line
You can split the input into two based on a particular line number. To do so, specify the line number after the input source (filename or stdin
data). The first output file will have the input lines before the given line number and the second output file will have the rest of the contents.
By default, the output files will be named xx00
, xx01
, xx02
, and so on (where xx
is the prefix). The numerical suffix will automatically use more digits if needed. You'll see examples with more than two output files later.
# split input into two based on line number 4
$ seq 10 | csplit - 4
6
15
# first output file will have the first 3 lines
# second output file will have the rest
$ head xx*
==> xx00 <==
1
2
3
==> xx01 <==
4
5
6
7
8
9
10
$ rm xx*
As seen in the example above,
csplit
will also display the number of bytes written for each output file. You can use the-q
option to suppress this message.
As mentioned earlier, remove the output files after every illustration.
Split on regexp
You can also split the input based on a line matching the given regular expression. The output produced will vary based on the //
or %%
delimiters being used to surround the regexp.
When /regexp/
is used, output is similar to the line number based splitting. The first output file will have the input lines before the first occurrence of a line matching the given regexp and the second output file will have the rest of the contents.
# match a line containing 't' followed by zero or more characters and then 'p'
# 'toothpaste' is the only match for this input file
$ csplit -q purchases.txt '/t.*p/'
$ head xx*
==> xx00 <==
coffee
tea
washing powder
coffee
==> xx01 <==
toothpaste
tea
soap
tea
When %regexp%
is used, the lines occurring before the matching line won't be part of the output. Only the line matching the given regexp and the rest of the contents will be part of the single output file.
$ csplit -q purchases.txt '%t.*p%'
$ cat xx00
toothpaste
tea
soap
tea
You'll get an error if the given regexp isn't found in the input.
$ csplit -q purchases.txt '/xyz/' csplit: ‘/xyz/’: match not found
See the Regular Expressions chapter from my GNU grep ebook if you want to learn more about regexp syntax and features.
Regexp offset
You can also provide offset numbers that'll affect where the matching line and its surrounding lines should be placed. When the offset is greater than zero, the split will happen that many lines after the matching line. The default offset is zero.
# when the offset is '1', the matching line will be part of the first file
$ csplit -q purchases.txt '/t.*p/1'
$ head xx*
==> xx00 <==
coffee
tea
washing powder
coffee
toothpaste
==> xx01 <==
tea
soap
tea
# matching line and 1 line after won't be part of the output
$ csplit -q purchases.txt '%t.*p%2'
$ cat xx00
soap
tea
When the offset is less than zero, the split will happen that many lines before the matching line.
# 2 lines before the matching line will be part of the second file
$ csplit -q purchases.txt '/t.*p/-2'
$ head xx*
==> xx00 <==
coffee
tea
==> xx01 <==
washing powder
coffee
toothpaste
tea
soap
tea
You'll get an error if the offset goes beyond the number of lines available in the input.
$ csplit -q purchases.txt '/t.*p/5' csplit: ‘/t.*p/5’: line number out of range $ csplit -q purchases.txt '/t.*p/-5' csplit: ‘/t.*p/-5’: line number out of range
Repeat split
You can perform line number and regexp based split more than once by adding the {N}
argument after the pattern. Default behavior examples seen so far is same as specifying {0}
. Any number greater than zero will result in that many more splits.
# {1} means split one time more than the default split
# so, two splits in total and three output files
# in this example, split happens on the 4th and 8th line numbers
$ seq 10 | csplit -q - 4 '{1}'
$ head xx*
==> xx00 <==
1
2
3
==> xx01 <==
4
5
6
7
==> xx02 <==
8
9
10
Here's an example with regexp:
$ cat log.txt
--> warning 1
a,b,c,d
42
--> warning 2
x,y,z
--> warning 3
4,3,1
# split on the third (2+1) occurrence of a line containing 'warning'
$ csplit -q log.txt '%warning%' '{2}'
$ cat xx00
--> warning 3
4,3,1
As a special case, you can use {*}
to repeat the split until the input is exhausted. This is especially useful with the /regexp/
form of splitting. Here's an example:
# split on all lines matching 'paste' or 'powder'
$ csplit -q purchases.txt '/paste\|powder/' '{*}'
$ head xx*
==> xx00 <==
coffee
tea
==> xx01 <==
washing powder
coffee
==> xx02 <==
toothpaste
tea
soap
tea
You'll get an error if the repeat count goes beyond the number of matches possible with the given input.
$ seq 10 | csplit -q - 4 '{2}' csplit: ‘4’: line number out of range on repetition 2 $ csplit -q purchases.txt '/tea/' '{4}' csplit: ‘/tea/’: match not found on repetition 3
Keep files on error
By default, csplit
will remove the created output files if there's an error or a signal that causes the command to stop. You can use the -k
option to keep such files. One use case is line number based splitting with the {*}
modifier.
$ seq 7 | csplit -q - 4 '{*}'
csplit: ‘4’: line number out of range on repetition 1
$ ls xx*
ls: cannot access 'xx*': No such file or directory
# -k option will allow you to retain the created files
$ seq 7 | csplit -qk - 4 '{*}'
csplit: ‘4’: line number out of range on repetition 1
$ head xx*
==> xx00 <==
1
2
3
==> xx01 <==
4
5
6
7
Suppress matched lines
The --suppress-matched
option will suppress the lines matching the split condition.
$ seq 5 | csplit -q --suppress-matched - 3
# 3rd line won't be part of the output
$ head xx*
==> xx00 <==
1
2
==> xx01 <==
4
5
$ rm xx*
$ seq 10 | csplit -q --suppress-matched - 4 '{1}'
# 4th and 8th lines won't be part of the output
$ head xx*
==> xx00 <==
1
2
3
==> xx01 <==
5
6
7
==> xx02 <==
9
10
Here's an example with regexp based split:
$ csplit -q --suppress-matched purchases.txt '/soap\|powder/' '{*}'
# lines matching 'soap' or 'powder' won't be part of the output
$ head xx*
==> xx00 <==
coffee
tea
==> xx01 <==
coffee
toothpaste
tea
==> xx02 <==
tea
Here's another example:
$ seq 11 16 | csplit -q --suppress-matched - '/[35]/' '{1}'
# lines matching '3' or '5' won't be part of the output
$ head xx*
==> xx00 <==
11
12
==> xx01 <==
14
==> xx02 <==
16
$ rm xx*
Exclude empty files
There are various cases that can result in empty output files. For example, first or last line matching the given split condition. Another possibility is the --suppress-matched
option combined with consecutive lines matching during multiple splits. Here's an example:
$ csplit -q --suppress-matched purchases.txt '/coffee\|tea/' '{*}'
$ head xx*
==> xx00 <==
==> xx01 <==
==> xx02 <==
washing powder
==> xx03 <==
toothpaste
==> xx04 <==
soap
==> xx05 <==
You can use the -z
option to exclude empty files from the output. The suffix numbering will be automatically adjusted in such cases.
$ csplit -qz --suppress-matched purchases.txt '/coffee\|tea/' '{*}'
$ head xx*
==> xx00 <==
washing powder
==> xx01 <==
toothpaste
==> xx02 <==
soap
Customize filenames
As seen earlier, xx
is the default prefix for output filenames. Use the -f
option to change this prefix.
$ seq 4 | csplit -q -f'num_' - 3
$ head num_*
==> num_00 <==
1
2
==> num_01 <==
3
4
The -n
option controls the length of the numeric suffix. The suffix length will automatically increment if filenames are exhausted.
$ seq 4 | csplit -q -n1 - 3
$ ls xx*
xx0 xx1
$ rm xx*
$ seq 4 | csplit -q -n3 - 3
$ ls xx*
xx000 xx001
The -b
option allows you to control the suffix using the printf
formatting. Quoting from the manual:
When this option is specified, the suffix string must include exactly one
printf(3)
-style conversion specification, possibly including format specification flags, a field width, a precision specifications, or all of these kinds of modifiers. The format letter must convert a binary unsigned integer argument to readable form. The format lettersd
andi
are aliases foru
, and theu
,o
,x
, andX
conversions are allowed.
Here are some examples:
# hexadecimal numbering
# minimum two digits, zero filled
$ seq 100 | csplit -q -b'%02x' - 3 '{20}'
$ ls xx*
xx00 xx02 xx04 xx06 xx08 xx0a xx0c xx0e xx10 xx12 xx14
xx01 xx03 xx05 xx07 xx09 xx0b xx0d xx0f xx11 xx13 xx15
$ rm xx*
# custom prefix and suffix around decimal numbering
# default minimum is a single digit
$ seq 20 | csplit -q -f'num_' -b'%d.txt' - 3 '{4}'
$ ls num_*
num_0.txt num_1.txt num_2.txt num_3.txt num_4.txt num_5.txt
Note that the
-b
option will override the-n
option. Seeman 3 printf
for more details about the formatting options.
Exercises
The exercises directory has all the files used in this section.
Remove the output files after every exercise.
1) Split the blocks.txt
file such that the first 7 lines are in the first file and the rest are in the second file as shown below.
##### add your solution here
$ head xx*
==> xx00 <==
----
apple--banana
mango---fig
----
3.14
-42
1000
==> xx01 <==
----
sky blue
dark green
----
hi hello
$ rm xx*
2) Split the input file items.txt
such that the text before a line containing colors
is part of the first file and the rest are part of the second file as shown below.
##### add your solution here
$ head xx*
==> xx00 <==
1) fruits
apple 5
banana 10
==> xx01 <==
2) colors
green
sky blue
3) magical beasts
dragon 3
unicorn 42
$ rm xx*
3) Split the input file items.txt
such that the line containing magical
and all the lines that come after are part of the single output file.
##### add your solution here
$ cat xx00
3) magical beasts
dragon 3
unicorn 42
$ rm xx00
4) Split the input file items.txt
such that the line containing colors
as well the line that comes after are part of the first output file.
##### add your solution here
$ head xx*
==> xx00 <==
1) fruits
apple 5
banana 10
2) colors
green
==> xx01 <==
sky blue
3) magical beasts
dragon 3
unicorn 42
$ rm xx*
5) Split the input file items.txt
on the line that comes before a line containing magical
. Generate only a single output file as shown below.
##### add your solution here
$ cat xx00
sky blue
3) magical beasts
dragon 3
unicorn 42
$ rm xx00
6) Split the input file blocks.txt
on the 4th occurrence of a line starting with the -
character. Generate only a single output file as shown below.
##### add your solution here
$ cat xx00
----
sky blue
dark green
----
hi hello
$ rm xx00
7) For the input file blocks.txt
, determine the logic to produce the expected output shown below.
##### add your solution here
$ head xx*
==> xx00 <==
apple--banana
mango---fig
==> xx01 <==
3.14
-42
1000
==> xx02 <==
sky blue
dark green
==> xx03 <==
hi hello
$ rm xx*
8) What does the -k
option do?
9) Split the books.txt
file on every line as shown below.
##### add your solution here
csplit: ‘1’: line number out of range on repetition 3
$ head row_*
==> row_0 <==
Cradle:::Mage Errant::The Weirkey Chronicles
==> row_1 <==
Mother of Learning::Eight:::::Dear Spellbook:Ascendant
==> row_2 <==
Mark of the Fool:Super Powereds:::Ends of Magic
$ rm row_*
10) Split the items.txt
file on lines starting with a digit character. Matching lines shouldn't be part of the output and the files should be named group_0.txt
, group_1.txt
and so on.
##### add your solution here
$ head group_*
==> group_0.txt <==
apple 5
banana 10
==> group_1.txt <==
green
sky blue
==> group_2.txt <==
dragon 3
unicorn 42
$ rm group_*