Record separators

So far, you've seen examples where Perl automatically splits input line by line based on the newline character. Just like you can control how those lines are further split into fields using the -a, -F options and other features, Perl provides a way to control what constitutes a line in the first place. The term record is used to describe the contents that gets placed in the $_ special variable with the -n or -p options.

info The example_files directory has all the files used in the examples.

Input record separator

By default, the newline character is used as the input record separator. You can change the $/ special variable to specify a different input record separator. Unlike field separators, you can only use a string value, regexp isn't allowed. See perldoc faq: I put a regular expression into $/ but it didn't work. What's wrong? for workarounds.

# change the input record separator to a comma character
# note the content of the 2nd record where newline is just another character
# by default, the record separator stays with the record contents
$ printf 'this,is\na,sample,text' | perl -nE 'BEGIN{$/ = ","} say "$.)$_"'
1)this,
2)is
a,
3)sample,
4)text

# use the -l option to chomp the record separator
$ printf 'this,is\na,sample,text' | perl -lne 'BEGIN{$/ = ","} print "$.)$_"'
1)this
2)is
a
3)sample
4)text

Here's an example where the record separator has multiple characters:

$ cat report.log
blah blah Error: second record starts
something went wrong
some more details Error: third record
details about what went wrong

# uses 'Error:' as the input record separator
# print all the records containing 'something'
$ perl -lne 'BEGIN{$/ = "Error:"} print if /something/' report.log
 second record starts
something went wrong
some more details 

Single character separator with the -0 option

The -0 command line option can be used to specify a single character record separator, represented with zero to three octal digits. You can also use hexadecimal value. Quoting from perldoc: -0 option:

You can also specify the separator character using hexadecimal notation: -0xHHH..., where the H are valid hexadecimal digits. Unlike the octal form, this one may be used to specify any Unicode character, even those beyond 0xFF. So if you really want a record separator of 0777, specify it as -0x1FF. (This means that you cannot use the -x option with a directory name that consists of hexadecimal digits, or else Perl will think you have specified a hex number to -0.)

$ s='this:is:a:sample:string'

# the : character is represented by 072 in octal
# -l is used here to chomp the separator
$ echo "$s" | perl -0072 -lnE 'say "$.) $_"'
1) this
2) is
3) a
4) sample
5) string

# print all records containing 'a'
$ echo "$s" | perl -0072 -lnE 'say $_ if /a/'
a
sample

The character that gets appended to the print function with the -l option is based on the value of input record separator at that point. Here are some examples to clarify this point.

$ s='this:is:a:sample:string'

# here, the record separator is still the default \n when -l is used
# so \n gets appended when 'print' is used
# note that chomp isn't affected by such differences in order
# same as: echo "$s" | perl -lne 'BEGIN{$/=":"} print if /a/'
$ echo "$s" | perl -l -0072 -ne 'print if /a/'
a
sample

# here -l is defined after -0, so : gets appended for 'print'
$ echo "$s" | perl -0072 -lne 'print if /a/'
a:sample:

By default, the -a option will split the input record based on whitespaces and remove leading/trailing whitespaces. Now that you've seen how the input record separator can be something other than newline, here's an example to show the full effect of the default record splitting.

# ':' character is the input record separator here
$ s='   a\t\tb:1000\n\n\t \n\n123 7777:x  y \n \n z  :apple banana cherry'
$ printf '%b' "$s" | perl -0072 -lanE 'say join ",", @F'
a,b
1000,123,7777
x,y,z
apple,banana,cherry

NUL separator

If the -0 option is used without an argument, the ASCII NUL character will be considered as the input record separator.

$ printf 'apple\0banana\0' | cat -v
apple^@banana^@

# can also be golfed to: perl -lp0e ''
# don't use -l0 as 0 will be treated as an argument to -l
$ printf 'apple\0banana\0' | perl -ln0e 'print'
apple
banana

Slurping entire input

Any octal value of 400 and above will cause the entire input to be slurped as a single string. Idiomatically, 777 is used. This is same as setting $/ = undef. Slurping the entire file makes it easier to solve some problems, but be careful to not use it for large files, as that might cause memory issues.

$ cat paths.txt
/home/joe/report.log
/home/ram/power.log
/home/rambo/errors.log
$ perl -0777 -pe 's|(?<!\A)/.+/|/|s' paths.txt
/home/errors.log

# replicate entire input as many times as needed
$ seq 2 | perl -0777 -ne 'print $_ x 2'
1
2
1
2

As an alternate, Perl 5.36 introduced the -g option for slurping the entire input.

$ seq 2 | perl -gne 'print $_ x 2'
1
2
1
2

Paragraph mode

As a special case, using -00 or setting $/ to an empty string will invoke paragraph mode. Two or more consecutive newline characters will act as the record separator. Consider the below sample file:

$ cat para.txt
Hello World

Hi there
How are you

Just do-it
Believe it

banana
papaya
mango

Much ado about nothing
He he he
Adios amigo

Here are some examples of processing the input file paragraph wise.

# all paragraphs containing 'do'
# note that the record separator is preserved as there's no chomp
$ perl -00 -ne 'print if /do/' para.txt
Just do-it
Believe it

Much ado about nothing
He he he
Adios amigo

# all paragraphs containing exactly two lines
# note that there's an empty line after the last line
$ perl -F'\n' -00 -ane 'print if $#F == 1' para.txt
Hi there
How are you

Just do-it
Believe it

If the paragraphs are separated by more than two consecutive newlines, the extra newlines will not be part of the record content.

$ s='a\n\n\n\n\n\n\n\n12\n34\n\nhi\nhello\n'

# note that the -l option isn't being used here
$ printf '%b' "$s" | perl -00 -ne 'print if $. <= 2'
a

12
34

Any leading newlines (only newlines, not other whitespace characters) in the input data file will be trimmed and not lead to empty records. This is similar to how -a treats whitespaces for default field separation.

$ s='\n\n\na\nb\n\n12\n34\n\nhi\nhello\n\n\n\n'

# note that -l is used to chomp the record separator here
$ printf '%b' "$s" | perl -00 -lnE 'say "$_\n---" if $. == 1'
a
b
---

# max. of two trailing newlines will be preserved if -l isn't used
$ printf '%b' "$s" | perl -00 -lnE 'say "$_\n---" if eof'
hi
hello
---

$ printf '%b' "$s" | perl -00 -nE 'END{say $.}'
3
$ printf '%b' "$s" | perl -00 -nE 'BEGIN{$/="\n\n"}; END{say $.}'
5

The empty line at the end is a common problem when dealing with custom record separators. You could either process the output further to remove it or add extra logic to handle the issue. Here's one possible workaround:

# single paragraph output, no empty line at the end
$ perl -l -00 -ne 'if(/are/){print $s, $_; $s="\n"}' para.txt
Hi there
How are you

# multiple paragraph output with an empty line between the paragraphs
$ perl -l -00 -ne 'if(/are|an/){print $s, $_; $s="\n"}' para.txt
Hi there
How are you

banana
papaya
mango

Output record separator

Similar to the -0 option used for setting the input record separator, you can use the -l option to specify a single character output record separator by passing an octal value as the argument.

# comma as output record separator, won't have a newline at the end
# note that -l also chomps the input record separator
$ seq 8 | perl -l054 -ne 'print if /[24]/'
2,4,

# null separator
$ seq 8 | perl -l0 -ne 'print if /[24]/' | cat -v
2^@4^@

# adding a final newline to output
$ seq 8 | perl -l054 -nE 'print if /[24]/; END{say}'
2,4,

You can use the $\ special variable to specify a multicharacter string that gets appended to the print function. This is will override changes due to the -l option, if any.

# recall that the input record separator isn't removed by default
$ seq 2 | perl -ne 'print'
1
2
# this will add four more characters after the already present newline
# same as: perl -pe 'BEGIN{$\ = "---\n"}'
$ seq 2 | perl -ne 'BEGIN{$\ = "---\n"} print'
1
---
2
---

# change the NUL character to a dot and newline characters
# -l here helps to chomp the NUL character 
# -l also sets NUL to be added to print, but gets overridden in BEGIN block
$ printf 'apple\0banana\0' | perl -0lpe 'BEGIN{$\ = ".\n"}'
apple.
banana.

Many a times, you'd need to change the output record separator depending upon the contents of the input record or some other condition. The cond ? expr1 : expr2 ternary operator is often used in such scenarios. The below example assumes that the input is evenly divisible, you'll have to add more logic if that is not the case.

# same as: perl -pe 's/\n/-/ if $. % 3'
$ seq 6 | perl -lpe '$\ = $. % 3 ? "-" : "\n"'
1-2-3
4-5-6

Summary

This chapter showed you how to change the way the input content is split into records and how to set the string to be appended when print is used. The paragraph mode is useful for processing multiline records separated by one or more empty lines. You also learned how to set ASCII NUL as the record separator and how to slurp the entire input as a single string.

Exercises

info The exercises directory has all the files used in this section.

1) The input file jumbled.txt consists of words separated by various delimiters. Display all words that contain an or at or in or it, one per line.

$ cat jumbled.txt
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
joint[]seer{intuition}titanic

##### add your solution here
overcoats
furrowing
wavering
joint
intuition
titanic

2) Emulate paste -sd, with Perl.

# this command joins all input lines with the ',' character
$ paste -sd, ip.txt
Hello World,How are you,This game is good,Today is sunny,12345,You are funny
# make sure there's no ',' at the end of the line
# and that there's a newline character at the end of the line
##### add your solution here
Hello World,How are you,This game is good,Today is sunny,12345,You are funny

# if there's only one line in input, again make sure there's no trailing ','
# and that there's a newline character at the end of the line
$ printf 'fig' | paste -sd,
fig
$ printf 'fig' | ##### add your solution here
fig

3) For the input file sample.txt, extract all paragraphs having words starting with do.

$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he

# note that there's no extra empty line at the end of the output
##### add your solution here
Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

4) For the input file sample.txt, change each paragraph to a single line by joining lines using . and a space character as the separator. Also, add a final . to each paragraph.

# note that there's no extra empty line at the end of the output
##### add your solution here
Hello World.

Good day. How are you.

Just do-it. Believe it.

Today is sunny. Not a bit funny. No doubt you like it too.

Much ado about nothing. He he he.

5) For the given input, use ;; as the record separators and : as the field separators. Filter records whose second field is greater than 50.

$ s='mango:100;;apple:25;;grapes:75'

# note that the output has ;; at the end, not a newline character
$ printf "$s" | ##### add your solution here
mango:100;;grapes:75;; 

6) The input file f1.txt has varying amount of empty lines between the records, change them to be always two empty lines. Also, remove the empty lines at the start and end of the file.

##### add your solution here
hello


world


apple
banana
cherry


tea coffee
chocolate

7) The sample string shown below uses cat as the record separator. Display only the even numbered records separated by a single empty line.

$ s='applecatfigcat12345catbananacatguava:cat:mangocat3'
$ echo "$s" | ##### add your solution here
fig

banana

:mango