Processing multiple records
Often, you need to consider multiple lines at a time to make a decision, such as the paragraph mode examples seen earlier. Sometimes, you need to match a particular record and then get records surrounding the matched record. Solution to these type of problems often use state machines. See softwareengineering: FSM examples if you are not familiar with state machines.
The example_files directory has all the files used in the examples.
Processing consecutive records
You might need to define a condition that should satisfy something for one record and something else for the very next record. There are many ways to tackle this problem. One possible solution is to use a variable to save the previous record and then create the required conditional expression using those variables and $_
which has the current record content.
# match and print two consecutive records
# first record should contain 'he' and second record should contain 'you'
# note that this will not work if you use normal variable instead of $p
$ ruby -ne 'puts $p, $_ if /you/ && $p=~/he/; $p = $_' para.txt
Hi there
How are you
# same filtering as above, but print only the first record
$ ruby -ne 'puts $p if /you/ && $p=~/he/; $p = $_' para.txt
Hi there
# same filtering as above, but print only the second record
$ ruby -ne 'print if /you/ && $p=~/he/; $p = $_' para.txt
How are you
Context matching
Sometimes you want not just the matching records, but the records relative to the matches as well. For example, it could be to see the comments at the start of a function block that was matched while searching a program file. Or, it could be to see extended information from a log file while searching for a particular error message.
Consider this sample input file:
$ cat context.txt
blue
toy
flower
sand stone
light blue
flower
sky
water
language
english
hindi
spanish
tamil
programming language
python
kotlin
ruby
Case 1: Here's an example that emulates grep --no-group-separator -A<n>
functionality. The $n && $n>=0 && $n-=1
trick used in the example below works like this:
- If
$n
hasn't been initialized yet, the expression becomesfalse
and thus preventsnil
being compared to a number - If
$n=1
1>=0 && 0
--> evaluates totrue
and$n
becomes0
0>=0 && -1
--> evaluates totrue
and$n
becomes-1
-1>=0 &&
--> evaluates tofalse
and$n
no longer changes
# same as: grep --no-group-separator -A1 'blue'
# print the matching line as well as the one that follows it
$ ruby -ne '$n=1 if /blue/; print if $n && $n>=0 && $n-=1' context.txt
blue
toy
light blue
flower
# for overlapping cases, $n gets re-assigned before $n becomes negative
$ ruby -ne '$n=1 if /toy|flower/; print if $n && $n>=0 && $n-=1' context.txt
toy
flower
sand stone
flower
sky
Once you've understood the above examples, the rest of the examples in this section should be easier to comprehend. They are all variations of the logic used above and re-arranged to solve the use case being discussed.
Case 2: Print n
records after the matching record. This is similar to the previous case, except that the matching record isn't printed.
# print 2 lines after the matching line
# note that comparison here is $n>0 and not $n>=0
$ ruby -ne 'print if $n && $n>0 && $n-=1; $n=2 if /language/' context.txt
english
hindi
python
kotlin
Case 3: Here's how to print n
th record after the matching record.
# print only the 3rd line found after the matching line
# overlapping cases won't work as $n gets re-assigned before going to 0
$ ruby -ne 'print if $n.to_i>0 && ($n-=1)==0; $n=3 if /language/' context.txt
spanish
ruby
Case 4: Printing the matched record and n
records before it.
# this won't work if there are less than n records before a match
$ ruby -e 'ip=readlines; n=2; ip.each_with_index { |s, i|
puts ip[i-n..i] if s.match?(/stone/) }' context.txt
toy
flower
sand stone
# this will work even if there are less than n records before a match
$ ruby -e 'ip=readlines; n=2; ip.each_with_index { |s, i|
c=i-n; c=0 if c<0;
puts ip[c..i] if s.match?(/toy/) }' context.txt
blue
toy
To prevent confusion with overlapping cases, you can add a separation line between the results.
$ ruby -e 'ip=readlines; n=2; ip.each_with_index { |s, i|
c=i-n; c=0 if c<0;
(print $s; puts ip[c..i]; $s="---\n") if s.match?(/toy|flower/) }
' context.txt
blue
toy
---
blue
toy
flower
---
sand stone
light blue
flower
Case 5: Print n
th record before the matching record.
$ ruby -e 'ip=readlines; n=2; ip.each_with_index { |s, i| c=i-n;
puts ip[c] if c>=0 && s.match?(/language/) }' context.txt
sky
spanish
# if the count is small enough, you can save them in variables
# this one prints the 2nd line before the matching line
# $.>2 is needed as first 2 records shouldn't be considered for a match
$ ruby -ne 'print $p2 if $.>2 && /toy|flower/; $p2=$p1; $p1=$_' context.txt
blue
sand stone
You can also use the logic from Case 3 by applying tac
twice. This avoids the need to use an array variable.
$ tac context.txt | ruby -ne 'print if $n.to_i>0 && ($n-=1)==0;
$n=2 if /language/' | tac
sky
spanish
Records bounded by distinct markers
This section will cover cases where the input file will always contain the same number of starting and ending patterns, arranged in an alternating fashion. For example, there cannot be two starting patterns appearing without an ending pattern between them and vice versa. Lines of text inside and between such groups are optional.
The sample file shown below will be used to illustrate examples in this section. For simplicity, assume that the starting pattern is marked by start
and the ending pattern by end
. They have also been given group numbers to make it easier to analyze the output.
$ cat uniform.txt
mango
icecream
--start 1--
1234
6789
**end 1**
how are you
have a nice day
--start 2--
a
b
c
**end 2**
par,far,mar,tar
Case 1: Processing all the groups of records based on the distinct markers, including the records matched by markers themselves. For simplicity, the below command will just print all such records.
$ ruby -ne '$f=true if /start/; print if $f; $f=false if /end/' uniform.txt
--start 1--
1234
6789
**end 1**
--start 2--
a
b
c
**end 2**
You can also use
ruby -ne 'print if /start/../end/'
but compared to the Flip-Flop operator, the state machine format is more suitable to adapt for the various cases to follow.
Case 2: Processing all the groups of records but excluding the records matched by markers themselves.
$ ruby -ne '$f=false if /end/; print "* ", $_ if $f;
$f=true if /start/' uniform.txt
* 1234
* 6789
* a
* b
* c
Case 3-4: Processing all the groups of records but excluding one of the markers.
$ ruby -ne '$f=true if /start/; $f=false if /end/; print if $f' uniform.txt
--start 1--
1234
6789
--start 2--
a
b
c
$ ruby -ne 'print if $f; $f=true if /start/; $f=false if /end/' uniform.txt
1234
6789
**end 1**
a
b
c
**end 2**
The next four cases are obtained by just using if !$f
instead of if $f
from the cases shown above.
Case 5: Processing all input records except the groups of records bound by the markers.
# same as: ruby -ne 'print if !(/start/../end/)'
$ ruby -ne '$f=true if /start/; print if !$f; $f=false if /end/' uniform.txt
mango
icecream
how are you
have a nice day
par,far,mar,tar
Case 6: Processing all input records except the groups of records between the markers.
$ ruby -ne '$f=false if /end/; print if !$f; $f=true if /start/' uniform.txt
mango
icecream
--start 1--
**end 1**
how are you
have a nice day
--start 2--
**end 2**
par,far,mar,tar
Case 7-8: Similar to case 6, but include only one of the markers.
$ ruby -ne 'print if !$f; $f=true if /start/; $f=false if /end/' uniform.txt
mango
icecream
--start 1--
how are you
have a nice day
--start 2--
par,far,mar,tar
$ ruby -ne '$f=true if /start/; $f=false if /end/; print if !$f' uniform.txt
mango
icecream
**end 1**
how are you
have a nice day
**end 2**
par,far,mar,tar
Specific blocks
Instead of working with all the groups (or blocks) bound by the markers, this section will discuss how to choose blocks based on an additional criteria.
Here's how you can process only the first matching block.
$ ruby -ne '$f=true if /start/; print if $f; exit if /end/' uniform.txt
--start 1--
1234
6789
**end 1**
# use other tricks discussed in the previous section as needed
$ ruby -ne 'exit if /end/; print if $f; $f=true if /start/' uniform.txt
1234
6789
Getting the last block alone involves a lot more work, unless you happen to know how many blocks are present in the input file.
# reverse input linewise, change the order of comparison, reverse again
# difficult to adjust if the record separator is something other than newline
$ tac uniform.txt | ruby -ne '$f=true if /end/; print if $f; exit if /start/' | tac
--start 2--
a
b
c
**end 2**
# or, save the blocks in a buffer and print the last one alone
# << operator concatenates the given string to the variable in-place
$ ruby -ne '($f=true; buf=$_; next) if /start/;
buf << $_ if $f;
$f=false if /end/;
END{print buf}' uniform.txt
--start 2--
a
b
c
**end 2**
Only the n
th block.
$ seq 30 | ruby -ne 'BEGIN{n=2; c=0}; c+=1 if /4/; (print; exit if /6/) if c==n'
14
15
16
All blocks greater than the n
th block.
$ seq 30 | ruby -ne 'BEGIN{n=1; c=0}; ($f=true; c+=1) if /4/;
print if $f && c>n; $f=false if /6/'
14
15
16
24
25
26
Excluding the n
th block.
$ seq 30 | ruby -ne 'BEGIN{n=2; c=0}; ($f=true; c+=1) if /4/;
print if $f && c!=n; $f=false if /6/'
4
5
6
24
25
26
All blocks, only if the records between the markers match an additional condition.
# additional condition here is a record with entire content as '15'
$ seq 30 | ruby -ne '($f=true; buf=$_; next) if /4/; buf << $_ if $f;
($f=false; print buf if buf.match?(/^15$/)) if /6/;'
14
15
16
Broken blocks
Sometimes, you can have markers in random order and mixed in different ways. In such cases, to work with blocks without any other marker present in between them, the buffer approach comes in handy again.
$ cat broken.txt
qqqqqqqqqqqqqqqq
error 1
hi
error 2
1234
6789
state 1
bye
state 2
error 3
xyz
error 4
abcd
state 3
zzzzzzzzzzzzzzzz
$ ruby -ne '($f=true; buf=$_; next) if /error/;
buf << $_ if $f;
(print buf if $f; $f=false) if /state/' broken.txt
error 2
1234
6789
state 1
error 4
abcd
state 3
Summary
This chapter covered various examples of working with multiple records. State machines play an important role in deriving solutions for such cases. Knowing various corner cases is also crucial, otherwise a solution that works for one input may fail for others. Next chapter will discuss use cases where you need to process a file input based on the contents of another file.
Exercises
The exercises directory has all the files used in this section.
1) For the input file sample.txt
, print lines containing do
only if the previous line is empty and the line before that contains you
.
##### add your solution here
Just do-it
Much ado about nothing
2) For the input file sample.txt
, match lines containing do
or not
case insensitively. Each of these terms occur multiple times in the file. The goal is to print only the second occurrences of these terms (independent of each other).
# for reference, here are all the matches
$ grep -i 'do' sample.txt
Just do-it
No doubt you like it too
Much ado about nothing
$ grep -i 'not' sample.txt
Not a bit funny
Much ado about nothing
##### add your solution here
No doubt you like it too
Much ado about nothing
3) For the input file sample.txt
, print the matching lines containing are
or bit
as well as n
lines around the matching lines. The value for n
is passed to the Ruby command as an environment value.
$ n=1 ##### add your solution here
Good day
How are you
Today is sunny
Not a bit funny
No doubt you like it too
# note that the first and last line are empty for this case
$ n=2 ##### add your solution here
Good day
How are you
Just do-it
Today is sunny
Not a bit funny
No doubt you like it too
4) For the input file broken.txt
, print all lines between the markers top
and bottom
. The first Ruby command shown below doesn't work because it is matching till the end of file as the second marker isn't found. Assume that the input file cannot have two top
markers without a bottom
marker appearing in between and vice-versa.
$ cat broken.txt
top
3.14
bottom
---
top
1234567890
bottom
top
Hi there
Have a nice day
Good bye
# wrong output
$ ruby -ne '$f=false if /bottom/; print if $f; $f=true if /top/' broken.txt
3.14
1234567890
Hi there
Have a nice day
Good bye
# expected output
##### add your solution here
3.14
1234567890
5) For the input file concat.txt
, extract contents from a line starting with %%%
until but not including the next such line. The block to be extracted is indicated by the variable n
passed as an environment value.
$ cat concat.txt
%%% addr.txt
How are you
This game is good
Today %%% is sunny
%%% broken.txt
top %%%
1234567890
bottom
%%% sample.txt
Just %%% do-it
Believe it
%%% mixed_fs.txt
pink blue white yellow
car,mat,ball,basket
$ n=2 ##### add your solution here
%%% broken.txt
top %%%
1234567890
bottom
$ n=4 ##### add your solution here
%%% mixed_fs.txt
pink blue white yellow
car,mat,ball,basket
6) For the input file ruby.md
, replace all occurrences of ruby
(irrespective of case) with Ruby
. But, do not replace any matches between ```ruby
and ```
lines (ruby
in these markers shouldn't be replaced either). Save the output in out.md
.
##### add your solution here, redirect the output to 'out.md'
$ diff -sq out.md expected.md
Files out.md and expected.md are identical
7) Print the last two lines for each of the input files ip.txt
, sample.txt
and table.txt
. Also, add a separator between the results as shown below (note that the separator isn't present at the end of the output). Assume that the input files will have at least two lines.
##### add your solution here
12345
You are funny
---
Much ado about nothing
He he he
---
blue cake mug shirt -7
yellow banana window shoes 3.14
8) For the input file lines.txt
, delete the line that comes after a whole line containing ---
. Assume that such lines won't occur consecutively.
$ cat lines.txt
Go There
come on
go there
---
2 apples and 5 mangoes
come on!
---
2 Apples
COME ON
##### add your solution here
Go There
come on
go there
---
come on!
---
COME ON
9) For the input file result.csv
, use ---
to separate entries with the same name in the first column. Assume that the lines with the same first column value will always be next to each other.
$ cat result.csv
Amy,maths,89
Amy,physics,75
Joe,maths,79
John,chemistry,77
John,physics,91
Moe,maths,81
Ravi,physics,84
Ravi,chemistry,70
Yui,maths,92
##### add your solution here
Amy,maths,89
Amy,physics,75
---
Joe,maths,79
---
John,chemistry,77
John,physics,91
---
Moe,maths,81
---
Ravi,physics,84
Ravi,chemistry,70
---
Yui,maths,92
10) The input file multisort.csv
has two fields separated by the comma character. Sort this file based on the number of :
characters in the second field. Use alphabetic order as the tie-breaker if there are multiple lines with the same number of :
characters in the second field.
$ cat multisort.csv
papaya,2
apple,4:5:2
mango,100
dark:chocolate,12:32
cherry,1:2:1:4:2:1
almond,3:14:6:28
banana,23:8
##### add your solution here
mango,100
papaya,2
banana,23:8
dark:chocolate,12:32
apple,4:5:2
almond,3:14:6:28
cherry,1:2:1:4:2:1