Record separators

So far, you've seen examples where ruby automatically splits input line by line based on the \n newline character. Just like you can control how those lines are further split into fields using -a, -F options and other features, ruby provides a way to control what constitutes a line in the first place. In ruby parlance, the term record is used to describe the contents that gets placed in the $_ global variable.

Input record separator

By default, newline character is used as input record separator. You can change the $/ global variable to specify a different input record separator. Unlike field separators, you can only use string, regexp isn't allowed.

$ # changing input record separator to comma
$ # note the content of second record, newline is just another character
$ # also note that by default record separator stays with the record contents
$ printf 'this,is\na,sample' | ruby -ne 'BEGIN{$/ = ","}; puts "#{$.})#{$_}"'
1)this,
2)is
a,
3)sample

$ # use -l option to chomp the record separator
$ printf 'this,is\na,sample' | ruby -lne 'BEGIN{$/ = ","}; puts "#{$.})#{$_}"'
1)this
2)is
a
3)sample

Here's a multicharacter example:

$ cat report.log
blah blah Error: second record starts
something went wrong
some more details Error: third record
details about what went wrong

$ # uses 'Error:' as the input record separator
$ # prints all the records that contains 'something'
$ ruby -lne 'BEGIN{$/ = "Error:"}; print if /something/' report.log
 second record starts
something went wrong
some more details 

Single character separator with -0 option

The -0 command line option can be used to specify a single character record separator, represented with zero to three octal digits.

$ s='this:is:a:sample:string'

$ # '072' is octal for : character
$ # -l is used to chomp the separator
$ echo "$s" | ruby -0072 -lne 'puts "#{$.}) #{$_}"'
1) this
2) is
3) a
4) sample
5) string

$ # print all records containing 'a'
$ echo "$s" | ruby -0072 -lne 'puts $_ if /a/'
a
sample

info The character that gets appended to print method when -l is used is based on the value of input record separator at that point. Here's some examples to clarify this point.

$ s='this:is:a:sample:string'

$ # here record separator is still the default \n when -l is used
$ # so \n gets appended for 'print' method usage
$ # note that chomp doesn't depend on the order
$ # same as: echo "$s" | ruby -lne 'BEGIN{$/=":"}; print if /a/'
$ echo "$s" | ruby -l -0072 -ne 'print if /a/'
a
sample

$ # here -l is defined after -0, so : gets appended for 'print'
$ echo "$s" | ruby -0072 -lne 'print if /a/'
a:sample:

Recall that default -a will split input record based on whitespaces and remove leading/trailing whitespaces. Now that you've seen how input record separator can be something other than newline, here's an example to show the full effect of default record splitting.

$ # ':' character is the input record separator here
$ s='   a\t\tb\n\t\n:1000\n\n\n\n123 7777:x  y \n \n z  '
$ printf '%b' "$s" | ruby -0072 -lane 'puts $F * ","'
a,b
1000,123,7777
x,y,z

NUL separator and slurping

If the -0 option is used without an argument, the ASCII NUL character will be considered as the input record separator.

$ printf 'foo\0bar\0' | cat -v
foo^@bar^@

$ # could also be golfed to: ruby -l0pe ''
$ printf 'foo\0bar\0' | ruby -l -0 -ne 'print'
foo
bar

Any octal value of 400 and above will cause the entire input to be slurped as a single string. Idiomatically, 777 is used. Slurping entire file makes it easier to solve some problems, but be careful to not use it for large files that might not fit available memory.

$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
/foo/abc/errors.log
$ ruby -0777 -pe 'sub(%r{(?<!\A)/.+/}m, "/")' paths.txt
/foo/errors.log

$ # replicate entire input as many times as needed
$ seq 2 | ruby -0777 -ne 'print $_ * 2'
1
2
1
2

Paragraph mode

As a special case, using -00 or setting $/ to empty string will invoke paragraph mode. Two or more consecutive newline characters will act as the record separator. Consider the programming_quotes.txt sample file, shown here again for convenience:

$ cat programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan

Some people, when confronted with a problem, think - I know, I will
use regular expressions. Now they have two problems by Jamie Zawinski

A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

Here's some examples of processing the input file paragraph wise.

$ # all paragraphs containing 'you'
$ # note that the record separator is preserved as there's no chomp
$ ruby -00 -ne 'print if /you/' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan

A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

$ # all paragraphs containing exactly two lines
$ ruby -F'\n' -00 -ane 'print if $F.size == 2' programming_quotes.txt
Some people, when confronted with a problem, think - I know, I will
use regular expressions. Now they have two problems by Jamie Zawinski

A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

If the paragraphs are separated by more than two consecutive newlines, the extra newlines will not be part of the record content.

$ s='a\n\n\n\n\n\n\n\n12\n34\n\nhi\nhello\n'

$ # note that -l option isn't being used here
$ printf '%b' "$s" | ruby -00 -ne 'print if $. <= 2'
a

12
34

Any leading newlines (only newlines, not other whitespace characters) in the input data file will be trimmed and not lead to empty records. This is similar to how -a treats whitespaces for default field separation.

$ s='\n\n\na\n\n12\n34\n\nhi\nhello\n\n\n\n'

$ # note that -l is used to chomp the record separator here
$ printf '%b' "$s" | ruby -00 -lne 'puts "#{$_}\n---" if $. <= 2'
a
---
12
34
---

$ # max. of two trailing newlines will be preserved if -l isn't used
$ printf '%b' "$s" | ruby -00 -lne 'puts "#{$_}\n---" if $<.eof'
hi
hello
---

$ printf '%b' "$s" | ruby -00 -ne 'END{puts $.}'
3
$ printf '%b' "$s" | ruby -00 -ne 'BEGIN{$/="\n\n"}; END{puts $.}'
5

If you wish to avoid the extra empty line at the end of the output for paragraph mode (or similar situations with other custom record separators), you can either post process the output to remove the extra empty line or add some logic like shown below.

$ # single paragraph output, no empty line at the end
$ ruby -l -00 -ne '(print $s, $_; $s="\n") if /code/' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan

$ # multiple paragraph output with empty line between the paragraphs
$ ruby -l -00 -ne '(print $s, $_; $s="\n") if /you/' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan

A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

Output record separator

You can use $\ to specify the string that gets appended to the print method. This is will override changes due to -l option, if any.

$ # recall that input record separator isn't removed by default
$ seq 2 | ruby -ne 'print'
1
2
$ # this will add four more characters after the already present newline
$ seq 2 | ruby -ne 'BEGIN{$\ = "---\n"}; print'
1
---
2
---

$ # change NUL record separator to dot and newline
$ # -l here helps to chomp the NUL character
$ # -l also sets NUL to be added to print, but gets overridden in BEGIN block
$ printf 'foo\0bar\0' | ruby -0 -lpe 'BEGIN{$\ = ".\n"}'
foo.
bar.

Many a times, you need to change output record separator depending upon contents of input record or some other condition. The cond ? expr1 : expr2 ternary operator is often used in such scenarios. The below example assumes that input is evenly divisible, you'll have to add more logic if that is not the case.

$ # same as: ruby -pe 'sub(/\n/, "-") if $. % 3 != 0'
$ seq 6 | ruby -lpe '$\ = $. % 3 != 0 ? "-" : "\n"'
1-2-3
4-5-6

Summary

This chapter showed you how to change the way input content is split into records and how to set the string to be appended when print is used. The paragraph mode is useful for processing multiline records separated by one or more empty lines. You also learned how to set ASCII NUL as the record separator and how to slurp entire input as a single string.

Exercises

a) The input file jumbled.txt consists of words separated by various delimiters. Display all words that contain an or at or in or it, one per line.

$ cat jumbled.txt
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer

##### add your solution here
overcoats
furrowing
wavering

b) Emulate paste -sd, with ruby.

$ # this command joins all input lines with ',' character
$ paste -sd, ip.txt
Hello World,How are you,This game is good,Today is sunny,12345,You are funny
$ # make sure there's no ',' at end of the line
$ # and that there's a newline character at the end of the line
##### add your solution here
Hello World,How are you,This game is good,Today is sunny,12345,You are funny

$ # if there's only one line in input, again make sure there's no trailing ','
$ # and that there's a newline character at the end of the line
$ printf 'foo' | paste -sd,
foo
$ printf 'foo' | ##### add your solution here
foo

c) For the input file sample.txt, extract all paragraphs with words starting with do.

$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he

$ # note that there's no extra empty line at the end of expected output
##### add your solution here
Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

d) For the input file sample.txt, change all paragraphs into single line by joining lines using . and a space character as the separator. And add a final . to each paragraph.

$ # note that there's no extra empty line at the end of expected output
##### add your solution here
Hello World.

Good day. How are you.

Just do-it. Believe it.

Today is sunny. Not a bit funny. No doubt you like it too.

Much ado about nothing. He he he.

e) For the given input, use ;; as record separators and : as field separators. Display all records with second field having an integer greater than 50.

$ s='mango:100;;apple:25;;grapes:75'

$ # note that the output has ;; at the end but not newline character
$ printf "$s" | ##### add your solution here
mango:100;;grapes:75;;