Record separators
So far, you've seen examples where Ruby automatically splits input line by line based on the newline character. Just like you can control how those lines are further split into fields using the -a
, -F
options and other features, Ruby provides a way to control what constitutes a line in the first place. The term record is used to describe the contents that gets placed in the $_
global variable.
The example_files directory has all the files used in the examples.
Input record separator
By default, the newline character is used as the input record separator. You can change the $/
global variable to specify a different input record separator. Unlike field separators, you can only use a string value, regexp isn't allowed.
# changing the input record separator to a comma character
# note the content of the second record where newline is just another character
# by default, the record separator stays with the record contents
$ printf 'this,is\na,sample,ok' | ruby -ne 'BEGIN{$/ = ","}; puts "#{$.})#{$_}"'
1)this,
2)is
a,
3)sample,
4)ok
# use the -l option to chomp the record separator
$ printf 'this,is\na,sample,ok' | ruby -lne 'BEGIN{$/ = ","}; puts "#{$.})#{$_}"'
1)this
2)is
a
3)sample
4)ok
Here's an example where the record separator has multiple characters:
$ cat report.log
blah blah Error: second record starts
something went wrong
some more details Error: third record
details about what went wrong
# uses 'Error:' as the input record separator
# prints all the records containing 'something'
$ ruby -lne 'BEGIN{$/ = "Error:"}; print if /something/' report.log
second record starts
something went wrong
some more details
Single character separator with the -0 option
The -0
command line option can be used to specify a single character record separator, represented with zero to three octal digits.
$ s='this:is:a:sample:string'
# the : character is represented by 072 in octal
# -l is used here to chomp the separator
$ echo "$s" | ruby -0072 -lne 'puts "#{$.}) #{$_}"'
1) this
2) is
3) a
4) sample
5) string
# print all records containing 'a'
$ echo "$s" | ruby -0072 -lne 'puts $_ if /a/'
a
sample
The character that gets appended to the print
method when the -l
option is active is based on the value of the input record separator at that point. Here are some examples to clarify this point.
$ s='this:is:a:sample:string'
# here, the record separator is still the default \n when -l is used
# so \n gets appended when 'print' is used
# note that chomp isn't affected by such differences in order
# same as: echo "$s" | ruby -lne 'BEGIN{$/=":"}; print if /a/'
$ echo "$s" | ruby -l -0072 -ne 'print if /a/'
a
sample
# here -l is defined after -0, so : gets appended for 'print'
$ echo "$s" | ruby -0072 -lne 'print if /a/'
a:sample:
By default, the -a
option will split the input record based on whitespaces and remove the leading/trailing whitespaces. Now that you've seen how the input record separator can be something other than newline, here's an example to show the full effect of the default record splitting.
# ':' character is the input record separator here
$ s=' a\t\tb:1000\n\n\t \n\n123 7777:x y \n \n z :apple banana cherry'
$ printf '%b' "$s" | ruby -0072 -lane 'puts $F * ","'
a,b
1000,123,7777
x,y,z
apple,banana,cherry
Note that by default
chomp
will remove\r\n
line endings as well from the input record. But, you'll get only\n
in the output if you are relying on the-l
option.$ printf 'apple\r\nfig\r\n' | cat -v apple^M fig^M $ printf 'apple\r\nfig\r\n' | ruby -lne 'print' | cat -v apple fig
NUL separator
If the -0
option is used without an argument, the ASCII NUL character will be considered as the input record separator.
$ printf 'apple\0banana\0' | cat -v
apple^@banana^@
# could also be golfed to: ruby -l0pe ''
$ printf 'apple\0banana\0' | ruby -l -0 -ne 'print'
apple
banana
Slurping entire input
Any octal value of 400
and above will cause the entire input to be slurped as a single string. Idiomatically, 777
is used. This is same as setting $/ = nil
. Slurping the entire file makes it easier to solve some problems, but be careful to not use it for large files, as that might cause memory issues.
$ cat paths.txt
/home/joe/report.log
/home/ram/power.log
/home/rambo/errors.log
$ ruby -0777 -pe 'sub(%r{(?<!\A)/.+/}m, "/")' paths.txt
/home/errors.log
# replicate entire input as many times as needed
$ seq 2 | ruby -0777 -ne 'print $_ * 2'
1
2
1
2
Paragraph mode
As a special case, using -00
or setting $/
to an empty string will invoke paragraph mode. Two or more consecutive newline characters will act as the record separator. Consider the below sample file:
$ cat para.txt
Hello World
Hi there
How are you
Just do-it
Believe it
banana
papaya
mango
Much ado about nothing
He he he
Adios amigo
Here are some examples of processing the input file paragraph wise.
# all paragraphs containing 'do'
# note that the record separator is preserved as there's no chomp
$ ruby -00 -ne 'print if /do/' para.txt
Just do-it
Believe it
Much ado about nothing
He he he
Adios amigo
# all paragraphs containing exactly two lines
# note that there's an empty line at the end of the output
$ ruby -F'\n' -00 -ane 'print if $F.size == 2' para.txt
Hi there
How are you
Just do-it
Believe it
If the paragraphs are separated by more than two consecutive newlines, the extra newlines will not be part of the record content.
$ s='a\n\n\n\n\n\n\n\n12\n34\n\nhi\nhello\n'
# note that the -l option isn't being used here
$ printf '%b' "$s" | ruby -00 -ne 'print if $. <= 2'
a
12
34
Any leading newlines (only newlines, not other whitespace characters) in the input data file will be trimmed and not lead to empty records. This is similar to how -a
treats whitespaces for default field separation.
$ s='\n\n\na\nb\n\n12\n34\n\nhi\nhello\n\n\n\n'
# note that -l is used to chomp the record separator here
$ printf '%b' "$s" | ruby -00 -lne 'puts "#{$_}\n---" if $. <= 2'
a
b
---
12
34
---
# max. of two trailing newlines will be preserved if -l isn't used
$ printf '%b' "$s" | ruby -00 -lne 'puts "#{$_}\n---" if $<.eof'
hi
hello
---
$ printf '%b' "$s" | ruby -00 -ne 'END{puts $.}'
3
$ printf '%b' "$s" | ruby -00 -ne 'BEGIN{$/="\n\n"}; END{puts $.}'
5
The empty line at the end is a common problem when dealing with custom record separators. You could either process the output further to remove it or add extra logic to handle the issue. Here's one possible workaround:
# no empty line at the end
$ ruby -l -00 -ne '(print $s, $_; $s="\n") if /are/' para.txt
Hi there
How are you
# single empty line between the paragraphs
$ ruby -l -00 -ne '(print $s, $_; $s="\n") if /are|an/' para.txt
Hi there
How are you
banana
papaya
mango
Output record separator
You can use $\
to specify the string that gets appended to the print
method. This is will override changes due to the -l
option, if any.
# recall that the input record separator isn't removed by default
$ seq 2 | ruby -ne 'print'
1
2
# this will add four more characters after the already present newline
# same as: ruby -pe 'BEGIN{$\ = "---\n"}'
$ seq 2 | ruby -ne 'BEGIN{$\ = "---\n"}; print'
1
---
2
---
# change the NUL character to a dot and newline characters
# -l here helps to chomp the NUL character
# -l also sets NUL to be added to print, but gets overridden in the BEGIN block
$ printf 'apple\0banana\0' | ruby -0 -lpe 'BEGIN{$\ = ".\n"}'
apple.
banana.
Many a times, you'd need to change the output record separator depending upon the contents of the input record or some other condition. The cond ? expr1 : expr2
ternary operator is often used in such scenarios. The below example assumes that the input is evenly divisible, you'll have to add more logic if that is not the case.
# same as: ruby -pe 'sub(/\n/, "-") if $. % 3 != 0'
$ seq 6 | ruby -lpe '$\ = $. % 3 != 0 ? "-" : "\n"'
1-2-3
4-5-6
Summary
This chapter showed you how to change the way the input content is split into records and how to set the string to be appended when print
is used. The paragraph mode is useful for processing multiline records separated by one or more empty lines. You also learned how to set ASCII NUL as the record separator and how to slurp the entire input as a single string.
Exercises
The exercises directory has all the files used in this section.
1) The input file jumbled.txt
consists of words separated by various delimiters. Display the last four words that contain an
or at
or in
or it
, one per line.
$ cat jumbled.txt
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
joint[]seer{intuition}titanic
##### add your solution here
wavering
joint
intuition
titanic
2) Emulate paste -sd,
with Ruby.
# this command joins all input lines with the ',' character
$ paste -sd, ip.txt
Hello World,How are you,This game is good,Today is sunny,12345,You are funny
# make sure there's no ',' at the end of the line
# and that there's a newline character at the end of the line
##### add your solution here
Hello World,How are you,This game is good,Today is sunny,12345,You are funny
# if there's only one line in input, again make sure there's no trailing ','
# and that there's a newline character at the end of the line
$ printf 'fig' | paste -sd,
fig
$ printf 'fig' | ##### add your solution here
fig
3) For the input file sample.txt
, extract all paragraphs having words starting with do
.
$ cat sample.txt
Hello World
Good day
How are you
Just do-it
Believe it
Today is sunny
Not a bit funny
No doubt you like it too
Much ado about nothing
He he he
# note that there's no extra empty line at the end of the output
##### add your solution here
Just do-it
Believe it
Today is sunny
Not a bit funny
No doubt you like it too
4) For the input file sample.txt
, change each paragraph to a single line by joining lines using .
and a space character as the separator. Also, add a final .
to each paragraph.
# note that there's no extra empty line at the end of the output
##### add your solution here
Hello World.
Good day. How are you.
Just do-it. Believe it.
Today is sunny. Not a bit funny. No doubt you like it too.
Much ado about nothing. He he he.
5) For the given input, use ;;
as the record separators and :
as the field separators. Filter records whose second field is greater than 50
.
$ s='mango:100;;apple:25;;grapes:75'
# note that the output has ;; at the end, not a newline character
$ printf "$s" | ##### add your solution here
mango:100;;grapes:75;;
6) The input file f1.txt
has varying amount of empty lines between the records. Change them to be always two empty lines. Also, remove empty lines at the start and end of the file.
##### add your solution here
hello
world
apple
banana
cherry
tea coffee
chocolate
7) The sample string shown below uses cat
as the record separator. Display only the even numbered records separated by a single empty line.
$ s='applecatfigcat12345catbananacatguava:cat:mangocat3'
$ echo "$s" | ##### add your solution here
fig
banana
:mango