Line processing

Now that you are familiar with basic ruby cli usage, this chapter will dive deeper into line processing examples. You'll learn various ways for matching lines based on regular expressions, fixed string matching, line numbers, etc. You'll also see how to group multiple statements and learn about control flow keywords next and exit.

Regexp based filtering

As mentioned before, in a conditional context:

  • /regexp/ is a shortcut for $_ =~ /regexp/
  • !/regexp/ is a shortcut for $_ !~ /regexp/

But, this is not applicable for all types of expressions. For example:

$ # /at$/ will be 'true' as it is treated as just a Regexp object here
$ printf 'gate\napple\nwhat\n' | ruby -ne '/at$/ && print'
gate
apple
what

$ # same as: ruby -ne 'print if /at$/'
$ printf 'gate\napple\nwhat\n' | ruby -ne '$_ =~ /at$/ && print'
what

If required, you can also use different delimiters with %r. Quoting from ruby-doc: Percent Strings:

If you are using (, [, {, < you must close it with ), ], }, > respectively. You may use most other non-alphanumeric characters for percent string delimiters such as %, |, ^, etc.

$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
/foo/abc/errors.log

$ ruby -ne 'print if /\/foo\/a\//' paths.txt
/foo/a/report.log

$ ruby -ne 'print if %r{/foo/a/}' paths.txt
/foo/a/report.log

$ ruby -ne 'print if !%r#/foo/a/#' paths.txt
/foo/y/power.log
/foo/abc/errors.log

Extracting matched portions

You can use regexp related global variables to extract only the matching portions instead of filtering entire matching line. Consider this input file.

$ cat programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan

Some people, when confronted with a problem, think - I know, I will
use regular expressions. Now they have two problems by Jamie Zawinski

A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

Here's some examples with regexp global variables.

$ # note that this will print only the first match for each input line
$ ruby -ne 'puts $& if /\bt\w*[et]\b/' programming_quotes.txt
twice
the
that

$ # extract only capture group portions
$ ruby -ne 'puts $~.captures * "::" if /not (.+)y(.+)/i' programming_quotes.txt
smart enough to debug it b:: Brian W. Kernighan
affect the way ::ou think about programming,
worth knowing b:: Alan Perlis

info See Working with matched portions chapter from my book for examples with match method and regexp global variables.

match? method

As seen in previous section, using $_ =~ /regexp/ also sets global variables. If you just need true or false result, using match? method is better suited for performance reasons. The difference would be more visible for large input files.

$ # same result as: ruby -ne 'print if /on\b/'
$ ruby -ne 'print if $_.match?(/on\b/)' programming_quotes.txt
by definition, not smart enough to debug it by Brian W. Kernighan
There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

tr method

The transliteration method tr allows you to specify per character transformation rule. See ruby-doc: tr for documentation.

$ # rot13
$ echo 'Uryyb Jbeyq' | ruby -pe '$_.tr!("a-zA-Z", "n-za-mN-ZA-M")'
Hello World

$ # use ^ at start of first argument to complement specified characters
$ echo 'foo:123:baz' | ruby -pe '$_.tr!("^0-9\n", "-")'
----123----

$ # use empty second argument to delete specified characters
$ echo 'foo:123:baz' | ruby -pe '$_.tr!("^0-9\n", "")'
123

$ # if second list is shorter than number of characters in the first list,
$ # the last character in the second list will be used to fill the gaps
$ s='orange apple appleseed cab'
$ echo "$s" | ruby -pe 'gsub(/\b(?!apple\b)\w++/) {$&.tr("a-z", "1-9")}'
991975 apple 199959554 312

Conditional substitution

These examples combine line filtering and substitution in different ways. As noted before, sub and gsub Kernel methods update $_ if substitution succeeds and always return the value of $_.

$ # change commas to hyphens if the input line does NOT contain '2'
$ # prints all input lines even if substitution fails
$ printf '1,2,3,4\na,b,c,d\n' | ruby -pe 'gsub(/,/, "-") if !/2/'
1,2,3,4
a-b-c-d

$ # prints filtered input lines even if substitution fails
$ # for example, the 2nd output line doesn't match 'by'
$ ruby -ne 'print gsub(/by/, "**") if /not/' programming_quotes.txt
** definition, not smart enough to debug it ** Brian W. Kernighan
A language that does not affect the way you think about programming,
is not worth knowing ** Alan Perlis

$ # print only if substitution succeeded
$ # $_.gsub! is required for this scenario
$ ruby -ne 'print if $_.gsub!(/1/, "one")' programming_quotes.txt
naming things, and off-by-one errors by Leon Bambrick

Multiple conditions

It is good to remember that Ruby is a programming language. You have control structures and you can combine multiple conditions using logical operators, methods like all?, any?, etc. You don't have to create a single complex regexp.

$ ruby -ne 'print if /not/ && !/it/' programming_quotes.txt
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

$ ruby -ane 'print if /twice/ || $F.size > 12' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Some people, when confronted with a problem, think - I know, I will

When next is executed, rest of the code will be skipped and the next input line will be fetched for processing. It doesn't affect BEGIN or END blocks as they are outside the file content loop.

$ ruby -ne '(puts "%% #{$_}"; next) if /\bpar/;
            puts /s/ ? "X" : "Y"' word_anchors.txt
%% sub par
X
Y
X
%% cart part tart mart

Note that () is used in the above example to group multiple statements to be executed for a single if condition. You'll see many more examples with next in coming chapters.

exit

Using exit method will cause the ruby script to terminate immediately. This is useful to avoid processing unnecessary input content after a termination condition.

$ # quits after an input line containing 'you' is found
$ ruby -ne 'print; exit if /you/' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,

$ # matching line won't be printed in this case
$ ruby -pe 'exit if /you/' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.

Use tac to get all lines starting from last occurrence of the search string with respect to entire file content.

$ tac programming_quotes.txt | ruby -ne 'print; exit if /not/' | tac
is not worth knowing by Alan Perlis

There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

You can optionally provide a status code along with the exit method.

$ printf 'sea\neat\ndrop\n' | ruby -ne 'print; exit(2) if /at/'
sea
eat
$ echo $?
2

Any code in END block will still be executed before exiting. This doesn't apply if exit was called from the BEGIN block.

$ ruby -pe 'exit if /cake/' table.txt
brown bread mat hair 42
$ ruby -pe 'exit if /cake/; END{puts "bye"}' table.txt
brown bread mat hair 42
bye

$ ruby -pe 'BEGIN{puts "hi"; exit; puts "hello"}; END{puts "bye"}' table.txt
hi

warning Be careful if you want to use exit with multiple input files, as ruby will stop even if there are other files remaining to be processed.

Line number based processing

Line numbers can also be used as a filtering criteria. It can be accessed using the $. global variable.

$ # print only the 3rd line
$ ruby -ne 'print if $. == 3' programming_quotes.txt
by definition, not smart enough to debug it by Brian W. Kernighan

$ # print 2nd and 5th line
$ ruby -ne 'print if $. == 2 || $. == 5' programming_quotes.txt
Therefore, if you write the code as cleverly as possible, you are,
Some people, when confronted with a problem, think - I know, I will

$ # transliterate only 2nd line
$ printf 'gates\nnot\nused\n' | ruby -pe '$_.tr!("a-z", "*") if $. == 2'
gates
***
used

$ # selecting from particular line number to end of input
$ seq 14 25 | ruby -ne 'print if $. >= 10'
23
24
25

The global variable $< contains the file handle for the current file input being processed. Use eof method to process lines based on end of file condition. See ruby-doc: eof for documentation. You can also use ARGF instead of $< here, see ARGV and ARGF section for details.

$ # same as: tail -n1 programming_quotes.txt
$ ruby -ne 'print if $<.eof' programming_quotes.txt
naming things, and off-by-1 errors by Leon Bambrick

$ ruby -ne 'puts "#{$.}:#{$_}" if $<.eof' programming_quotes.txt
12:naming things, and off-by-1 errors by Leon Bambrick

$ # multiple file example
$ # same as: tail -q -n1 programming_quotes.txt table.txt
$ ruby -ne 'print if $<.eof' programming_quotes.txt table.txt
naming things, and off-by-1 errors by Leon Bambrick
yellow banana window shoes 3.14

For large input files, use exit method to avoid processing unnecessary input lines.

$ seq 3542 4623452 | ruby -ne '(print; exit) if $. == 2452'
5993
$ seq 3542 4623452 | ruby -ne 'print if $. == 250; (print; exit) if $. == 2452'
3791
5993

$ # here is a sample time comparison
$ time seq 3542 4623452 | ruby -ne '(print; exit) if $. == 2452' > f1
real    0m0.068s
$ time seq 3542 4623452 | ruby -ne 'print if $. == 2452' > f2
real    0m1.158s

Flip-Flop operator

You can use Flip-Flop operator to select between pair of matching conditions like line numbers and regexp. See ruby-doc: Flip-Flop for syntax details.

$ # the range is automatically compared against $. in this context
$ seq 14 25 | ruby -ne 'print if 3..5'
16
17
18

$ # 'print if 3...5' gives same result as above,
$ # you can use include? method to exclude the end range
$ seq 14 25 | ruby -ne 'print if (3...5).include?($.)'
16
17

$ # the range is automatically compared against $_ in this context
$ # note that all the matching ranges are printed
$ ruby -ne 'print if /are/../by/' programming_quotes.txt
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan
There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

info See Records bounded by distinct markers section for an alternate, flexible solution.

You can also mix line number and regexp conditions.

$ ruby -ne 'print if 5../use/' programming_quotes.txt
Some people, when confronted with a problem, think - I know, I will
use regular expressions. Now they have two problems by Jamie Zawinski

$ # same logic as: ruby -pe 'exit if /ll/'
$ # inefficient, but this will work for multiple file inputs
$ ruby -ne 'print if !(/ll/..$<.eof)' programming_quotes.txt table.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan

brown bread mat hair 42
blue cake mug shirt -7

warning Both conditions can match the same line too! Also, if the second condition doesn't match, lines starting from first condition to the last line of the input will be matched.

$ # 'worth' matches the 9th line
$ ruby -ne 'print if 9../worth/' programming_quotes.txt
is not worth knowing by Alan Perlis

$ # there's a line containing 'affect' but doesn't have matching pair
$ # so, all lines till the end of input is printed
$ ruby -ne 'print if /affect/../XYZ/' programming_quotes.txt
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

Working with fixed strings

To match strings literally, use the include? method for line filtering and string argument instead of regexp for substitutions.

$ echo 'int a[5]' | ruby -ne 'print if /a[5]/'
$ echo 'int a[5]' | ruby -ne 'print if $_.include?("a[5]")'
int a[5]

$ echo 'int a[5]' | ruby -pe 'sub(/a[5]/, "b")'
int a[5]
$ echo 'int a[5]' | ruby -pe 'sub("a[5]", "b")'
int b

The above example uses double quotes for the string argument, which allows escape sequences like \t, \n, etc and interpolation with #{}. This isn't the case with single quoted string values. Using single quotes within the script from command line requires messing with shell metacharacters. So, use %q instead or pass the fixed string to be matched as an environment variable, which can be accessed via the ENV hash.

$ # double quotes allow escape sequences and interpolation
$ ruby -e 'a=5; puts "value of a:\t#{a}"'
value of a:     5

$ # use %q as an alternate to specify single quoted string
$ echo 'int #{a}' | ruby -ne 'print if $_.include?(%q/#{a}/)'
int #{a}
$ echo 'int #{a}' | ruby -pe 'sub(%q/#{a}/, "b")'
int b

$ # or pass the string as environment variable
$ echo 'int #{a}' | s='#{a}' ruby -ne 'print if $_.include?(ENV["s"])'
int #{a}
$ # \\ is special within single quotes, so ENV is the better choice here
$ echo 'int #{a\\}' | s='#{a\\}' ruby -pe 'sub(ENV["s"], "b")'
int b

Use start_with? and end_with? methods to restrict the fixed string matching to the start or end of the input line. The line content in $_ variable contains the \n line ending character as well. You can either use chomp method explicitly or use the -l command line option, which will be discussed in detail in Record separators chapter. For now, it is enough to know that -l will remove the line ending from $_ and add it back when print is used.

$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # start of line
$ s='a+b' ruby -ne 'print if $_.start_with?(ENV["s"])' eqns.txt
a+b,pi=3.14,5e12

$ # end of line
$ # -l option is needed here to remove \n from $_
$ s='a+b' ruby -lne 'print if $_.end_with?(ENV["s"])' eqns.txt
i*(t+9-g)/8,4-a+b

Use index method if you need more control over the location of the matching strings. You can use either the return value (which gives you the index of the matching string) or use the optional second argument to specify an offset to start searching. See ruby-doc: index for details.

$ # same as: $_.include?("a+b")
$ ruby -ne 'print if $_.index("a+b")' eqns.txt
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # same as: $_.start_with?("a+b")
$ ruby -ne 'print if $_.index("a+b")==0' eqns.txt
a+b,pi=3.14,5e12

$ # since 'index' returns 'nil' if there's no match,
$ # you need some more processing for < or <= numeric comparison
$ ruby -ne '$i = $_.index("="); print if $i && $i < 6' eqns.txt
a=b,a-b=c,c*d

$ # for > or >= comparison, use the optional second argument
$ s='a+b' ruby -ne 'print if $_.index(ENV["s"], 1)' eqns.txt
i*(t+9-g)/8,4-a+b

If you need to match entire input line or field, you can use comparison operators.

$ printf 'a.b\na+b\n' | ruby -lne 'print if /^a.b$/'
a.b
a+b
$ printf 'a.b\na+b\n' | ruby -lne 'print if $_ == %q/a.b/'
a.b

$ printf '1 a.b\n2 a+b\n' | ruby -lane 'print if $F[1] != %q/a.b/'
2 a+b

To provide a fixed string in replacement section, environment variable comes in handy again. But you have to replace any \ character in the environment variable with \\ before using it as replacement string.

$ # the \ character special in replacement section
$ # and \ is special within double quotes too
$ echo 'x+y' | ruby -pe 'sub(%q/x+y/, "x\y\\0z")'
xyx+yz

$ # \ in value passed via environment variable is still special 
$ echo 'x+y' | r='x\y\\0z' ruby -pe 'sub(%q/x+y/, ENV["r"])'
x\y\0z
$ # have to preprocess the value by replacing \ with \\
$ echo 'x+y' | r='x\y\\0z' ruby -pe 'sub(%q/x+y/, ENV["r"].gsub(/\\/, "\\\0"))'
x\y\\0z

$ # can't use %q strings for all cases as \\ is special
$ ruby -e 'puts %q/x\y\\0z/'
x\y\0z
$ echo 'x+y' | ruby -pe 'sub(%q/x+y/, %q/x\y\\0z/.gsub(/\\/, "\\\0"))'
x\y\0z

In-place file editing

You can use the -i option to write back the changes to the input file instead of displaying the output on terminal. When an extension is provided as an argument to -i, the original contents of the input file gets preserved as per the extension given. For example, if the input file is ip.txt and -i.orig is used, ip.txt.orig will be the backup filename.

$ cat colors.txt
deep blue
light orange
blue delight

$ # no output on terminal as -i option is used
$ # space is NOT allowed between -i and the extension
$ ruby -i.bkp -pe 'sub(/blue/, "green")' colors.txt
$ # changes are written back to 'colors.txt'
$ cat colors.txt
deep green
light orange
green delight

$ # original file is preserved in 'colors.txt.bkp'
$ cat colors.txt.bkp
deep blue
light orange
blue delight

Multiple input files are treated individually and the changes are written back to respective files.

$ cat t1.txt
have a nice day
bad morning
what a pleasant evening
$ cat t2.txt
worse than ever
too bad

$ ruby -i.bkp -pe 'sub(/bad/, "good")' t1.txt t2.txt
$ ls t?.*
t1.txt  t1.txt.bkp  t2.txt  t2.txt.bkp

$ cat t1.txt
have a nice day
good morning
what a pleasant evening
$ cat t2.txt
worse than ever
too good

Sometimes backups are not desirable. Using -i option on its own will not create backups. Be careful though, as changes made cannot be undone. In such cases, test the command with sample input before using -i option on actual file. You could also use the option with backup, compare the differences with a diff program and then delete the backup.

$ cat fruits.txt
banana
papaya
mango

$ ruby -i -pe 'gsub(/an/, "AN")' fruits.txt
$ cat fruits.txt
bANANa
papaya
mANgo

Summary

This chapter showed various examples of processing only lines of interest instead of entire input file. Filtering can be specified using a regexp, fixed string, line number or a combination of them. You also saw how to combine multiple statements using () for compact cli usage. next and exit are often needed to control the flow of code. The -i option is handy for in-place editing.

Exercises

a) Remove only the third line of given input.

$ seq 34 37 | ##### add your solution here
34
35
37

b) Display only fourth, fifth, sixth and seventh lines for the given input.

$ seq 65 78 | ##### add your solution here
68
69
70
71

c) For the input file ip.txt, replace all occurrences of are with are not and is with is not only from line number 4 till end of file. Also, only the lines that were changed should be displayed in the output.

$ cat ip.txt
Hello World
How are you
This game is good
Today is sunny
12345
You are funny

##### add your solution here
Today is not sunny
You are not funny

d) For the given stdin, display only the first three lines. Avoid processing lines that are not relevant.

$ seq 14 25 | ##### add your solution here
14
15
16

e) For the input file ip.txt, display all lines from start of the file till the first occurrence of game.

##### add your solution here
Hello World
How are you
This game is good

f) For the input file ip.txt, display all lines that contain is but not good.

##### add your solution here
Today is sunny

g) For the input file ip.txt, extract the word before the whole word is as well as the word after it. If such a match is found, display the two words around is in reversed order. For example, hi;1 is--234 bye should be converted to 234:1. Assume that whole word is will not be present more than once in a single line.

##### add your solution here
good:game
sunny:Today

h) For the given input string, replace 0xA0 with 0x7F and 0xC0 with 0x1F.

$ s='start address: 0xA0, func1 address: 0xC0'

$ echo "$s" | ##### add your solution here
start address: 0x7F, func1 address: 0x1F

i) For the input file text.txt, replace all occurrences of in with an and write back the changes to text.txt itself. The original contents should get saved to text.txt.orig

$ cat text.txt
can ran want plant
tin fin fit mine line
##### add your solution here

$ cat text.txt
can ran want plant
tan fan fit mane lane
$ cat text.txt.orig
can ran want plant
tin fin fit mine line

j) For the input file text.txt, replace all occurrences of an with in and write back the changes to text.txt itself. Do not create backups for this exercise. Note that you should have solved the previous exercise before starting this one.

$ cat text.txt
can ran want plant
tan fan fit mane lane
##### add your solution here

$ cat text.txt
cin rin wint plint
tin fin fit mine line
$ diff text.txt text.txt.orig
1c1
< cin rin wint plint
---
> can ran want plant

k) Find the starting index of first occurrence of is or the or was or to for each input line of the file idx.txt. Assume all input lines will match at least one of these terms.

$ cat idx.txt
match after the last newline character
and then you want to test
this is good bye then
you were there to see?

##### add your solution here
12
4
2
9

l) Display all lines containing [4]* for the given stdin data.

$ printf '2.3/[4]*6\n2[4]5\n5.3-[4]*9\n' | ##### add your solution here
2.3/[4]*6
5.3-[4]*9

m) For the given input string, replace all lowercase alphabets to x only for words starting with m.

$ s='ma2T3a a2p kite e2e3m meet'

$ echo "$s" | ##### add your solution here
xx2T3x a2p kite e2e3m xxxx

n) For the input file ip.txt, delete all characters other than lowercase vowels and newline character. Perform this transformation only between a line containing you up to line number 4 (inclusive).

##### add your solution here
Hello World
oaeou
iaeioo
oaiu
12345
You are funny