Line processing

Now that you are familiar with basic Ruby CLI usage, this chapter will dive deeper into line processing examples. You'll learn various ways for matching lines based on regular expressions, fixed string matching, line numbers, etc. You'll also see how to group multiple statements and learn about the control flow keywords next and exit.

info The example_files directory has all the files used in the examples.

Regexp based filtering

As mentioned before, in a conditional context:

  • /regexp/ is a shortcut for $_ =~ /regexp/
  • !/regexp/ is a shortcut for $_ !~ /regexp/

Here are some examples:

$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14

$ ruby -ne 'print if /-|ow\b/' table.txt
blue cake mug shirt -7
yellow banana window shoes 3.14

$ ruby -ne 'print if !/[ksy]/' table.txt
brown bread mat hair 42

warning But, this is not applicable for all types of expressions. For example:

# /at$/ will be 'true' as it is treated as just a Regexp object here
$ printf 'gate\napple\nwhat\n' | ruby -ne '/at$/ && print'
gate
apple
what

# same as: ruby -ne 'print if /at$/'
$ printf 'gate\napple\nwhat\n' | ruby -ne '$_ =~ /at$/ && print'
what

If required, you can also use different delimiters with %r. See ruby-doc: %r Regexp Literals for details.

$ cat paths.txt
/home/joe/report.log
/home/ram/power.log
/home/rambo/errors.log

# leaning toothpick syndrome
$ ruby -ne 'print if /\/home\/ram\//' paths.txt
/home/ram/power.log

$ ruby -ne 'print if %r{/home/ram/}' paths.txt
/home/ram/power.log

$ ruby -ne 'print if !%r#/home/ram/#' paths.txt
/home/joe/report.log
/home/rambo/errors.log

Extracting matched portions

You can use regexp related global variables to extract only the matching portions. Consider this input file.

$ cat ip.txt
it is a warm and cozy day
listen to what I say
go play in the park
come back before the sky turns dark

There are so many delights to cherish
Apple, Banana and Cherry
Bread, Butter and Jelly
Try them all before you perish

Here are some examples of extracting only the matched portions.

# note that this will print only the first match for each input line
$ ruby -ne 'puts $& if /\b[a-z]\w*[ty]\b/' ip.txt
it
what
play
sky
many

# extract the capture group portions
$ ruby -ne 'puts "#{$1}::#{$2}" if /(\b[bdp]\w+).*(\b[a-f]\w+)/i' ip.txt
back::dark
delights::cherish
Banana::Cherry
Bread::and

info See the Working with matched portions chapter from my ebook for examples that use the match method and regexp global variables.

match? method

As seen in the previous section, using $_ =~ /regexp/ also sets global variables. If you just need a true or false result, using the match? method is better suited for performance reasons. The difference would be more visible for large input files.

# same result as: ruby -ne 'print if /[AB]|the\b/'
$ ruby -ne 'print if $_.match?(/[AB]|the\b/)' ip.txt
go play in the park
come back before the sky turns dark
Apple, Banana and Cherry
Bread, Butter and Jelly

Transliteration

The transliteration method tr helps you perform transformations character-wise. See ruby-doc: tr for documentation.

# rot13
$ echo 'Uryyb Jbeyq' | ruby -pe '$_.tr!("a-zA-Z", "n-za-mN-ZA-M")'
Hello World

# ^ at the start of the first argument complements the specified characters
$ echo 'apple:123:banana' | ruby -pe '$_.tr!("^0-9\n", "-")'
------123-------

# an empty second argument deletes the specified characters
$ echo 'apple:123:banana' | ruby -pe '$_.tr!("^0-9\n", "")'
123

# if the second list is shorter than the number of characters in the first list,
# the last character in the second list will be used to fill the gaps
$ s='orange apple appleseed cab'
$ echo "$s" | ruby -pe 'gsub(/\b(?!apple\b)\w++/) {$&.tr("a-z", "1-9")}'
991975 apple 199959554 312

You can use the tr_s method to squeeze repeated characters.

$ echo 'APPLESEED gobbledygook' | ruby -pe '$_.tr_s!("a-zA-Z", "a-zA-Z")'
APLESED gobledygok

# transliteration as well as squeeze
$ echo 'APPLESEED gobbledygook' | ruby -pe '$_.tr_s!("A-Z", "a-z")'
aplesed gobbledygook

Conditional substitution

These examples combine line filtering and substitution in different ways. As noted before, the sub and gsub Kernel methods update $_ if the substitution succeeds and always return the value of $_.

# change commas to hyphens if the input line does NOT contain '2'
# prints all input lines even if the substitution fails
$ printf '1,2,3,4\na,b,c,d\n' | ruby -pe 'gsub(/,/, "-") if !/2/'
1,2,3,4
a-b-c-d

# perform substitution only for the filtered lines
# prints filtered input lines, even if the substitution fails
$ ruby -ne 'print gsub(/ark/, "[\\0]") if /the/' ip.txt
go play in the p[ark]
come back before the sky turns d[ark]
Try them all before you perish

# print only if the substitution succeeds
# $_.gsub! is required for this scenario
$ ruby -ne 'print if $_.gsub!(/\bw\w*t\b/, "{\\0}")' ip.txt
listen to {what} I say

Multiple conditions

It is good to remember that Ruby is a programming language. You can make use of control structures and combine multiple conditions using logical operators, methods like all?, any?, etc. You don't have to create a single complex regexp.

$ ruby -ne 'print if /ark/ && !/sky/' ip.txt
go play in the park

$ ruby -ane 'print if /\bthe\b/ || $F.size == 5' ip.txt
listen to what I say
go play in the park
come back before the sky turns dark

next

When the next statement is executed, rest of the code will be skipped and the next input line will be fetched for processing. It doesn't affect the BEGIN and END blocks as they are outside the file content loop.

$ ruby -ne '(puts "%% #{$_}"; next) if /\bpar/;
            puts /s/ ? "X" : "Y"' word_anchors.txt
%% sub par
X
Y
X
%% cart part tart mart

info () is used in the above example to group multiple statements to be executed for a single if condition. You'll see more such examples in the coming chapters.

exit

The exit method will cause the Ruby script to terminate immediately. This is useful to avoid processing unnecessary input content after a termination condition is reached.

# quits after an input line containing 'say' is found
$ ruby -ne 'print; exit if /say/' ip.txt
it is a warm and cozy day
listen to what I say

# the matching line won't be printed in this case
$ ruby -pe 'exit if /say/' ip.txt
it is a warm and cozy day

Use tac to get all lines starting from the last occurrence of the search string in the entire file.

$ tac ip.txt | ruby -ne 'print; exit if /an/' | tac
Bread, Butter and Jelly
Try them all before you perish

You can optionally provide a status code as an argument to the exit method.

$ printf 'sea\neat\ndrop\n' | ruby -ne 'print; exit(2) if /at/'
sea
eat
$ echo $?
2

Any code in the END block will still be executed before exiting. This doesn't apply if exit was called from the BEGIN block.

$ ruby -pe 'exit if /cake/' table.txt
brown bread mat hair 42

$ ruby -pe 'exit if /cake/; END{puts "bye"}' table.txt
brown bread mat hair 42
bye

$ ruby -pe 'BEGIN{puts "hi"; exit; puts "hello"}; END{puts "bye"}' table.txt
hi

warning Be careful if you want to use exit with multiple input files, as Ruby will stop even if there are other files remaining to be processed.

Line number based processing

Line numbers can also be specified as a matching criteria by using the $. global variable.

# print only the third line
$ ruby -ne 'print if $. == 3' ip.txt
go play in the park

# print the second and sixth lines
$ ruby -ne 'print if $. == 2 || $. == 6' ip.txt
listen to what I say
There are so many delights to cherish

# transliterate only the second line
$ printf 'gates\nnot\nused\n' | ruby -pe '$_.tr!("a-z", "*") if $. == 2'
gates
***
used

# print from a particular line number to the end of the input
$ seq 14 25 | ruby -ne 'print if $. >= 10'
23
24
25

The global variable $< contains the file handle for the current file input being processed. Use the eof method to check for the end of the file condition. See ruby-doc: eof for documentation. You can also use ARGF instead of $< here, see the ARGV and ARGF section for details.

# same as: tail -n1 ip.txt
$ ruby -ne 'print if $<.eof' ip.txt
Try them all before you perish

$ ruby -ne 'puts "#{$.}:#{$_}" if $<.eof' ip.txt
9:Try them all before you perish

# multiple file example
# same as: tail -q -n1 ip.txt table.txt
$ ruby -ne 'print if $<.eof' ip.txt table.txt
Try them all before you perish
yellow banana window shoes 3.14

For large input files, use the exit method to avoid processing unnecessary input lines.

$ seq 3542 4623452 | ruby -ne '(print; exit) if $. == 2452'
5993

$ seq 3542 4623452 | ruby -ne 'print if $. == 250; (print; exit) if $. == 2452'
3791
5993

# here is a sample time comparison
$ time seq 3542 4623452 | ruby -ne '(print; exit) if $. == 2452' > f1
real    0m0.055s
$ time seq 3542 4623452 | ruby -ne 'print if $. == 2452' > f2
real    0m1.130s
$ rm f1 f2

Flip-Flop operator

You can use the Flip-Flop operator to select between a pair of matching conditions like line numbers and regexp. See ruby-doc: Flip-Flop for documentation.

# the range is automatically compared against $. in this context
$ seq 14 25 | ruby -ne 'print if 3..5'
16
17
18

# 'print if 3...5' gives the same result as above,
# you can use the include? method to exclude the end range
$ seq 14 25 | ruby -ne 'print if (3...5).include?($.)'
16
17

# the range is automatically compared against $_ in this context
# note that all the matching ranges are printed
$ ruby -ne 'print if /to/../pl/' ip.txt
listen to what I say
go play in the park
There are so many delights to cherish
Apple, Banana and Cherry

info See the Records bounded by distinct markers section for an alternate solution.

Line numbers and regexp filtering can be mixed.

$ ruby -ne 'print if 6../utter/' ip.txt
There are so many delights to cherish
Apple, Banana and Cherry
Bread, Butter and Jelly

# same logic as: ruby -pe 'exit if /\bba/'
# inefficient, but this will work for multiple file inputs
$ ruby -ne 'print if !(/\bba/..$<.eof)' ip.txt table.txt
it is a warm and cozy day
listen to what I say
go play in the park
brown bread mat hair 42
blue cake mug shirt -7

Both conditions can match the same line too! Also, if the second condition doesn't match, lines starting from the first condition to the last line of the input will be matched.

# 'and' matches the 7th line
$ ruby -ne 'print if 7../and/' ip.txt
Apple, Banana and Cherry
# 'and' will be tested against 8th line onwards
$ ruby -ne 'print if 7.../and/' ip.txt
Apple, Banana and Cherry
Bread, Butter and Jelly

# there's a line containing 'Banana' but the matching pair isn't found
# so, all lines till the end of the input is printed
$ ruby -ne 'print if /Banana/../XYZ/' ip.txt
Apple, Banana and Cherry
Bread, Butter and Jelly
Try them all before you perish

Working with fixed strings

To match strings literally, use the include? method for line filtering. Use string argument instead of regexp for fixed string matching with substitution methods.

$ printf 'int a[5]\nfig\n1+4=5\n' | ruby -ne 'print if /a[5]/'
$ printf 'int a[5]\nfig\n1+4=5\n' | ruby -ne 'print if $_.include?("a[5]")'
int a[5]

$ printf 'int a[5]\nfig\n1+4=5\n' | ruby -pe 'sub(/a[5]/, "b")'
int a[5]
fig
1+4=5
$ printf 'int a[5]\nfig\n1+4=5\n' | ruby -pe 'sub("a[5]", "b")'
int b
fig
1+4=5

The above examples use double quotes for the string argument, which allows escape sequences like \t, \n, etc and interpolation with #{}. This isn't the case with single quoted string values. Using single quotes within the script from the command line requires messing with shell metacharacters. So, use %q instead or pass the fixed string to be matched as an environment variable.

# double quotes allow escape sequences and interpolation
$ ruby -e 'a=5; puts "value of a:\t#{a}"'
value of a:     5

# use %q as an alternate to specify single quoted strings
$ echo 'int #{a}' | ruby -ne 'print if $_.include?(%q/#{a}/)'
int #{a}
$ echo 'int #{a}' | ruby -pe 'sub(%q/#{a}/, "b")'
int b

# or pass the string as an environment variable
$ echo 'int #{a}' | s='#{a}' ruby -ne 'print if $_.include?(ENV["s"])'
int #{a}
# \\ is special within single quotes, so ENV is the better choice here
$ echo 'int #{a\\}' | s='#{a\\}' ruby -pe 'sub(ENV["s"], "b")'
int b

To provide a fixed string in the replacement section, environment variables comes in handy again. Need to use block form, since \ is special in the replacement section.

# \\ will be treated as \ and \0 will backreference the entire matched portion
$ echo 'int a' | s='x\\y\0z' ruby -pe 'sub(/a/, ENV["s"])'
int x\yaz

# use block form to avoid such issues
$ echo 'int a' | s='x\\y\0z' ruby -pe 'sub(/a/) {ENV["s"]}'
int x\\y\0z

Use the start_with? and end_with? methods to restrict the matching to the start or end of the input line. The line content in the $_ variable contains the \n line ending character as well. You can either use the chomp method explicitly or use the -l command line option (which will be discussed in detail in the Record separators chapter). For now, it is enough to know that -l will remove the line separator and add it back when print is used.

$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

# start of the line
$ s='a+b' ruby -ne 'print if $_.start_with?(ENV["s"])' eqns.txt
a+b,pi=3.14,5e12

# end of the line
# -l option is needed here to remove \n from $_
$ s='a+b' ruby -lne 'print if $_.end_with?(ENV["s"])' eqns.txt
i*(t+9-g)/8,4-a+b

Use the index method if you need more control over the location of the matching strings. You can use either the return value (which gives you the index of the matching string) or use the optional second argument to specify an offset to start searching. See ruby-doc: index for details.

# same as: $_.include?("a+b")
$ ruby -ne 'print if $_.index("a+b")' eqns.txt
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

# same as: $_.start_with?("a+b")
$ ruby -ne 'print if $_.index("a+b")==0' eqns.txt
a+b,pi=3.14,5e12

# since 'index' returns 'nil' if there's no match,
# you need some more processing for < or <= comparison
$ ruby -ne '$i = $_.index("="); print if $i && $i < 6' eqns.txt
a=b,a-b=c,c*d

# for > or >= comparison, use the optional second argument
$ s='a+b' ruby -ne 'print if $_.index(ENV["s"], 1)' eqns.txt
i*(t+9-g)/8,4-a+b

If you need to match the entire input line or a particular field, you can use the comparison operators.

$ printf 'a.b\na+b\n' | ruby -lne 'print if /^a.b$/'
a.b
a+b
$ printf 'a.b\na+b\n' | ruby -lne 'print if $_ == %q/a.b/'
a.b

$ printf '1 a.b\n2 a+b\n' | ruby -lane 'print if $F[1] != %q/a.b/'
2 a+b

In-place file editing

You can use the -i option to write back the changes to the input file instead of displaying the output on terminal. When an extension is provided as an argument to -i, the original contents of the input file gets preserved as per the extension given. For example, if the input file is ip.txt and -i.orig is used, the backup file will be named as ip.txt.orig.

$ cat colors.txt
deep blue
light orange
blue delight

# no output on the terminal as -i option is used
# space is NOT allowed between -i and the extension
$ ruby -i.bkp -pe 'sub(/blue/, "-green-")' colors.txt
# changes are written back to 'colors.txt'
$ cat colors.txt
deep -green-
light orange
-green- delight

# original file is preserved in 'colors.txt.bkp'
$ cat colors.txt.bkp
deep blue
light orange
blue delight

Multiple input files are treated individually and the changes are written back to respective files.

$ cat t1.txt
have a nice day
bad morning
what a pleasant evening
$ cat t2.txt
worse than ever
too bad

$ ruby -i.bkp -pe 'sub(/bad/, "good")' t1.txt t2.txt
$ ls t?.*
t1.txt  t1.txt.bkp  t2.txt  t2.txt.bkp

$ cat t1.txt
have a nice day
good morning
what a pleasant evening
$ cat t2.txt
worse than ever
too good

Sometimes backups are not desirable. In such cases, you can use the -i option without an argument. Be careful though, as changes made cannot be undone. It is recommended to test the command with sample inputs before applying the -i option on the actual file. You could also use the option with backup, compare the differences with a diff program and then delete the backup.

$ cat fruits.txt
banana
papaya
mango

$ ruby -i -pe 'gsub(/(..)\1/) {$&.upcase}' fruits.txt
$ cat fruits.txt
bANANa
PAPAya
mango

Summary

This chapter showed various examples of processing only the lines of interest instead of the entire input file. Filtering can be specified using a regexp, fixed string, line number or a combination of them. You also saw how to combine multiple statements inside () for compact CLI usage. The next and exit methods are useful to control the flow of code. The -i option is handy for in-place editing.

Exercises

info The exercises directory has all the files used in this section.

1) For the given input, display except the third line.

$ seq 34 37 | ##### add your solution here
34
35
37

2) Display only the fourth, fifth, sixth and seventh lines for the given input.

$ seq 65 78 | ##### add your solution here
68
69
70
71

3) For the input file ip.txt, replace all occurrences of are with are not and is with is not only from line number 4 till the end of file. Also, only the lines that were changed should be displayed in the output.

$ cat ip.txt
Hello World
How are you
This game is good
Today is sunny
12345
You are funny

##### add your solution here
Today is not sunny
You are not funny

4) For the given stdin, display only the first three lines. Avoid processing lines that are not relevant.

$ seq 14 25 | ##### add your solution here
14
15
16

5) For the input file ip.txt, display all lines from the start of the file till the first occurrence of game.

##### add your solution here
Hello World
How are you
This game is good

6) For the input file ip.txt, display all lines that contain is but not good.

##### add your solution here
Today is sunny

7) For the input file ip.txt, extract the word before the whole word is as well as the word after it. If such a match is found, display the two words around is in reversed order. For example, hi;1 is--234 bye should be converted to 234:1. Assume that the whole word is will not be present more than once in a single line.

##### add your solution here
good:game
sunny:Today

8) For the input file hex.txt, replace all occurrences of 0xA0 with 0x50 and 0xFF with 0x7F.

$ cat hex.txt
start: 0xA0, func1: 0xA0
end: 0xFF, func2: 0xB0
restart: 0xA010, func3: 0x7F

##### add your solution here
start: 0x50, func1: 0x50
end: 0x7F, func2: 0xB0
restart: 0x5010, func3: 0x7F

9) For the input file text.txt, replace all occurrences of in with an and write back the changes to text.txt itself. The original contents should get saved to text.txt.orig.

$ cat text.txt
can ran want plant
tin fin fit mine line

##### add your solution here

$ cat text.txt
can ran want plant
tan fan fit mane lane
$ cat text.txt.orig
can ran want plant
tin fin fit mine line

10) For the input file text.txt, replace all occurrences of an with in and write back the changes to text.txt itself. Do not create backups for this exercise. Note that you should have solved the previous exercise before starting this one.

$ cat text.txt
can ran want plant
tan fan fit mane lane

##### add your solution here

$ cat text.txt
cin rin wint plint
tin fin fit mine line
$ diff text.txt text.txt.orig
1c1
< cin rin wint plint
---
> can ran want plant

11) Find the starting index of first occurrence of is or the or was or to for each input line of the file idx.txt. Assume that every input line will match at least one of these terms.

$ cat idx.txt
match after the last newline character
and then you want to test
this is good bye then
you were there to see?

##### add your solution here
12
4
2
9

12) Display all lines containing [4]* for the given stdin data.

$ printf '2.3/[4]*6\n2[4]5\n5.3-[4]*9\n' | ##### add your solution here
2.3/[4]*6
5.3-[4]*9

13) For the given input string, change all lowercase alphabets to x only for words starting with m.

$ s='ma2T3a a2p kite e2e3m meet'

$ echo "$s" | ##### add your solution here
xx2T3x a2p kite e2e3m xxxx

14) For the input file ip.txt, delete all characters other than lowercase vowels and the newline character. Perform this transformation only between a line containing you up to line number 4 (inclusive).

##### add your solution here
Hello World
oaeou
iaeioo
oaiu
12345
You are funny

15) For the input file sample.txt, display from the start of the file till the first occurrence of are, excluding the matching line.

$ cat sample.txt
Hello World

Good day
How are you

Just do-it
Believe it

Today is sunny
Not a bit funny
No doubt you like it too

Much ado about nothing
He he he

##### add your solution here
Hello World

Good day

16) For the input file sample.txt, display from the last occurrence of do till the end of the file.

##### add your solution here
Much ado about nothing
He he he

17) For the input file sample.txt, display from the 9th line till a line containing you.

##### add your solution here
Today is sunny
Not a bit funny
No doubt you like it too

18) Display only the odd numbered lines from ip.txt.

##### add your solution here
Hello World
This game is good
12345

19) For the table.txt file, print only the line number for lines containing air or win.

$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14

##### add your solution here
1
3

20) For the input file table.txt, calculate the sum of numbers in the last column, excluding the second line.

##### add your solution here
45.14

21) Print the second and fourth line for every block of five lines.

$ seq 15 | ##### add your solution here
2
4
7
9
12
14

22) For the input file ip.txt, display all lines containing e or u but not both.

##### add your solution here
Hello World
This game is good
Today is sunny