Line processing

Now that you are familiar with basic perl cli usage, this chapter will dive deeper into line processing examples. You'll learn various ways for matching lines based on regular expressions, fixed string matching, line numbers, etc. You'll also see how to group multiple statements and learn about control flow keywords next and exit.

Regexp based filtering

As mentioned before:

  • /REGEXP/FLAGS is a shortcut for $_ =~ m/REGEXP/FLAGS
  • !/REGEXP/FLAGS is a shortcut for $_ !~ m/REGEXP/FLAGS

If required, you can also use different delimiters. Quoting from perldoc: match:

If / is the delimiter then the initial m is optional. With the m you can use any pair of non-whitespace (ASCII) characters as delimiters. This is particularly useful for matching path names that contain /, to avoid LTS (leaning toothpick syndrome). If ? is the delimiter, then a match-only-once rule applies, described in m?PATTERN? below. If ' (single quote) is the delimiter, no variable interpolation is performed on the PATTERN. When using a delimiter character valid in an identifier, whitespace is required after the m. PATTERN may contain variables, which will be interpolated every time the pattern search is evaluated, except for when the delimiter is a single quote.

$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
/foo/abc/errors.log

$ perl -ne 'print if /\/foo\/a\//' paths.txt
/foo/a/report.log

$ perl -ne 'print if m{/foo/a/}' paths.txt
/foo/a/report.log

$ perl -ne 'print if !m#/foo/a/#' paths.txt
/foo/y/power.log
/foo/abc/errors.log

Extracting matched portions

You can use regexp related special variables to extract only the matching portions instead of filtering entire matching line. Consider this input file.

$ cat programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan

Some people, when confronted with a problem, think - I know, I will
use regular expressions. Now they have two problems by Jamie Zawinski

A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

Here's some examples of extracting only the matched portion(s).

$ # note that this will print only the first match for each input line
$ perl -nE 'say $& if /\bt\w*[et]\b/' programming_quotes.txt
twice
the
that

$ perl -nE 'say join "::", @{^CAPTURE} if /not (.+)y(.+)/i' programming_quotes.txt
smart enough to debug it b:: Brian W. Kernighan
affect the way ::ou think about programming,
worth knowing b:: Alan Perlis

$ # sometimes capture groups are enough, you don't need special variables
$ # @{^CAPTURE} isn't needed here, as it is assumed that every line has a match
$ perl -nE 'say /^(\w+ ).*?(\d+)$/' table.txt
brown 42
blue 7
yellow 14
$ # or add a custom separator
$ perl -nE 'say join ":", /^(\w+).*?(\d+)$/' table.txt
brown:42
blue:7
yellow:14

Transliteration

The transliteration operator tr (or y) allows you to specify per character transformation rule. See perldoc: tr for documentation.

$ # rot13
$ echo 'Uryyb Jbeyq' | perl -pe 'tr/a-zA-Z/n-za-mN-ZA-M/'
Hello World

$ # use 'c' option to complement specified characters
$ echo 'foo:123:baz' | perl -pe 'tr/0-9\n/-/c'
----123----

$ # use 'd' option to delete specified characters
$ echo 'foo:123:baz' | perl -pe 'tr/0-9\n//cd'
123

$ # use 's' option to squeeze repeated characters
$ echo 'APPLE gobbledygook' | perl -pe 'tr|A-Za-z||s'
APLE gobledygok
$ # transliterate as well as squeeze
$ echo 'APPLE gobbledygook' | perl -pe 'tr|A-Z|a-z|s'
aple gobbledygook

Similar to s operator, tr will return number of changes made. Use r option to prevent in-place modification and return the transliterated string instead.

$ # match lines containing 'b' 2 times
$ perl -ne 'print if tr/b// == 2' table.txt
brown bread mat hair 42

$ s='orange apple appleseed'
$ echo "$s" | perl -pe 's#\bapple\b(*SKIP)(*F)|\w+#$&=~tr/a-z/A-Z/r#ge'
ORANGE apple APPLESEED

See also:

Conditional substitution

These examples combine line filtering and substitution in different ways. As noted before, s operator will modify the input string and the return value can be used to know how many substitutions were made. Use the r flag to prevent in-place modification and get string output after substitution, if any.

$ # change commas to hyphens if the input line does NOT contain '2'
$ # prints all input lines even if substitution fails
$ printf '1,2,3,4\na,b,c,d\n' | perl -pe 's/,/-/g if !/2/'
1,2,3,4
a-b-c-d

$ # prints filtered input lines, even if substitution fails
$ perl -ne 'print s/by/**/rg if /not/' programming_quotes.txt
** definition, not smart enough to debug it ** Brian W. Kernighan
A language that does not affect the way you think about programming,
is not worth knowing ** Alan Perlis

$ # print only if substitution succeeded
$ perl -ne 'print if s/1/one/g' programming_quotes.txt
naming things, and off-by-one errors by Leon Bambrick

Multiple conditions

It is good to remember that Perl is a programming language. You have control structures and you can combine multiple conditions using logical operators. You don't have to create a single complex regexp.

$ perl -ne 'print if /not/ && !/it/' programming_quotes.txt
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

$ perl -ane 'print if /twice/ || $#F > 11' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Some people, when confronted with a problem, think - I know, I will

$ perl -ne 'print if /s/ xor /m/' table.txt
brown bread mat hair 42
yellow banana window shoes 3.14

next

When next is executed, rest of the code will be skipped and the next input line will be fetched for processing. It doesn't affect BEGIN or END blocks as they are outside the file content loop.

$ perl -nE 'if(/\bpar/){print "%% $_"; next}
            say /s/ ? "X" : "Y"' word_anchors.txt
%% sub par
X
Y
X
%% cart part tart mart

Note that {} is used in the above example to group multiple statements to be executed for a single if condition. You'll see many more examples with next in coming chapters.

exit

The exit function is useful to avoid processing unnecessary input content when a termination condition is reached. See perldoc: exit for documentation.

$ # quits after an input line containing 'you' is found
$ perl -ne 'print; exit if /you/' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
$ # matching line won't be printed in this case
$ perl -pe 'exit if /you/' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.

Use tac to get all lines starting from last occurrence of the search string with respect to entire file content.

$ tac programming_quotes.txt | perl -ne 'print; exit if /not/' | tac
is not worth knowing by Alan Perlis

There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

You can optionally provide a status code as an argument to the exit function.

$ printf 'sea\neat\ndrop\n' | perl -ne 'print; exit(2) if /at/'
sea
eat
$ echo $?
2

Any code in END block will still be executed before exiting. This doesn't apply if exit was called from the BEGIN block.

$ perl -pE 'exit if /cake/' table.txt
brown bread mat hair 42

$ perl -pE 'exit if /cake/; END{say "bye"}' table.txt
brown bread mat hair 42
bye

$ perl -pE 'BEGIN{say "hi"; exit; say "hello"} END{say "bye"}' table.txt
hi

warning Be careful if you want to use exit with multiple input files, as perl will stop even if there are other files remaining to be processed.

Line number based processing

Line numbers can also be specified as a matching criteria using the $. special variable.

$ # print only the 3rd line
$ perl -ne 'print if $. == 3' programming_quotes.txt
by definition, not smart enough to debug it by Brian W. Kernighan

$ # print 2nd and 5th line
$ perl -ne 'print if $. == 2 || $. == 5' programming_quotes.txt
Therefore, if you write the code as cleverly as possible, you are,
Some people, when confronted with a problem, think - I know, I will

$ # transliterate only 2nd line
$ printf 'gates\nnot\nused\n' | perl -pe 'tr/a-z/*/ if $. == 2'
gates
***
used

$ # print from particular line number to the end of input
$ seq 14 25 | perl -ne 'print if $. >= 10'
23
24
25

Use eof function to check for end of file condition. See perldoc: eof for documentation.

$ # same as: tail -n1 programming_quotes.txt
$ perl -ne 'print if eof' programming_quotes.txt
naming things, and off-by-1 errors by Leon Bambrick

$ perl -ne 'print "$.:$_" if eof' programming_quotes.txt
12:naming things, and off-by-1 errors by Leon Bambrick

$ # multiple file example
$ # same as: tail -q -n1 programming_quotes.txt table.txt
$ perl -ne 'print if eof' programming_quotes.txt table.txt
naming things, and off-by-1 errors by Leon Bambrick
yellow banana window shoes 3.14

For large input files, use exit to avoid processing unnecessary input lines.

$ seq 3542 4623452 | perl -ne 'if($. == 2452){print; exit}'
5993
$ seq 3542 4623452 | perl -ne 'print if $. == 250; if($. == 2452){print; exit}'
3791
5993

$ # here is a sample time comparison
$ time seq 3542 4623452 | perl -ne 'if($. == 2452){print; exit}' > f1
real    0m0.004s
$ time seq 3542 4623452 | perl -ne 'print if $. == 2452' > f2
real    0m0.740s

Range operator

You can use range operator to select between pair of matching conditions like line numbers and regexp. See perldoc: range for documentation.

$ # the range is automatically compared against $. in this context
$ # same as: perl -ne 'print if 3 <= $. <= 5'
$ seq 14 25 | perl -ne 'print if 3..5'
16
17
18

$ # the range is automatically compared against $_ in this context
$ # note that all the matching ranges are printed
$ perl -ne 'print if /are/ .. /by/' programming_quotes.txt
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan
There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

info See Records bounded by distinct markers section for an alternate, flexible solution.

You can also mix line number and regexp conditions.

$ perl -ne 'print if 5 .. /use/' programming_quotes.txt
Some people, when confronted with a problem, think - I know, I will
use regular expressions. Now they have two problems by Jamie Zawinski

$ # same logic as: perl -pe 'exit if /ll/'
$ # inefficient, but this will work for multiple file inputs
$ perl -ne 'print if !(/ll/ .. eof)' programming_quotes.txt table.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan

brown bread mat hair 42
blue cake mug shirt -7

warning Both conditions can match the same line too! Also, if the second condition doesn't match, lines starting from first condition to the last line of the input will be matched.

$ # 'worth' matches the 9th line
$ perl -ne 'print if 9 .. /worth/' programming_quotes.txt
is not worth knowing by Alan Perlis

$ # there's a line containing 'affect' but doesn't have matching pair
$ # so, all lines till the end of input is printed
$ perl -ne 'print if /affect/ .. /XYZ/' programming_quotes.txt
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis

There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick

Working with fixed strings

You can surround a regexp pattern with \Q and \E to match it as a fixed string, similar to grep -F option. \E can be left out if there's no further pattern to be specified. Variables are still interpolated, so if your fixed string contains $ or @ forming possible variables, you'll run into issues. For such cases, you can pass the string as an environment value and then apply \Q to that variable. See perldoc: quotemeta for documentation.

$ # no match, since [] are character class metacharacters
$ echo 'int a[5]' | perl -ne 'print if /a[5]/'

$ perl -E 'say "\Qa[5]"'
a\[5\]
$ echo 'int a[5]' | perl -ne 'print if /\Qa[5]/'
int a[5]
$ echo 'int a[5]' | perl -pe 's/\Qa[5]/b[12]/'
int b[12]

$ # $y and $z will be treated as variables here (default value empty string)
$ echo '$x = $y + $z' | perl -pe 's/\Q$y + $z/100/'
$x = $y100$z
$ echo '$x = $y + $z' | fs='$y + $z' perl -pe 's/\Q$ENV{fs}/100/'
$x = 100
$ # ENV is preferred since \\ is special in single quoted strings
$ perl -E '$x = q(x\y\\0z); say $x'
x\y\0z
$ x='x\y\\0z' perl -E 'say $ENV{x}'
x\y\\0z

If you just want to filter a line based on fixed string, you can also use the index function. This returns the matching position (which starts with 0) and -1 if the given string wasn't found. See perldoc: index for documentation.

$ echo 'int a[5]' | perl -ne 'print if index($_, "a[5]") != -1'
int a[5]

The above index example uses double quotes for the string argument, which allows escape sequences like \t, \n, etc and interpolation. This isn't the case with single quoted string values. Using single quotes within the script from command line requires messing with shell metacharacters. So, use q operator instead or pass the fixed string to be matched as an environment variable.

$ # double quotes allow escape sequences and interpolation
$ perl -E '$x=5; say "value of x:\t$x"'
value of x:     5

$ # use 'q' operator as an alternate to specify single quoted string
$ s='$a = 2 * ($b + $c)'
$ echo "$s" | perl -ne 'print if index($_, q/($b + $c)/) != -1'
$a = 2 * ($b + $c)

$ # or pass the string as environment variable
$ echo "$s" | fs='($b + $c)' perl -ne 'print if index($_, $ENV{fs}) != -1'
$a = 2 * ($b + $c)

You can use the return value of index function to restrict the matching to the start or end of the input line. The line content in $_ variable contains the \n line ending character as well. You can either use chomp function explicitly or use the -l command line option, which will be discussed in detail in Record separators chapter. For now, it is enough to know that -l will remove the line ending from $_ and add it back when print is used.

$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b

$ # start of line
$ s='a+b' perl -ne 'print if index($_, $ENV{s})==0' eqns.txt
a+b,pi=3.14,5e12

$ # end of line
$ # same as: s='a+b' perl -ne 'print if /\Q$ENV{s}\E$/' eqns.txt
$ # length function returns number of characters, by default acts on $_
$ # -l option is needed here to remove \n from $_
$ s='a+b' perl -lne '$pos = length() - length($ENV{s});
                     print if index($_, $ENV{s}) == $pos' eqns.txt
i*(t+9-g)/8,4-a+b

Here's some more examples using the return value of index function.

$ # since 'index' returns '-1' if there's no match,
$ # you need to add >=0 check as well for < or <= comparison
$ perl -ne '$i = index($_, "="); print if 0 <= $i <= 5' eqns.txt
a=b,a-b=c,c*d

$ # > or >= comparison is easy to specify
$ # if you use 3rd argument to 'index', you'll still have to check != -1
$ s='a+b' perl -ne 'print if index($_, $ENV{s})>=1' eqns.txt
i*(t+9-g)/8,4-a+b

If you need to match entire input line or field, you can use string comparison operators.

$ printf 'a.b\na+b\n' | perl -lne 'print if /^a.b$/'
a.b
a+b
$ printf 'a.b\na+b\n' | perl -lne 'print if $_ eq q/a.b/'
a.b
$ printf '1 a.b\n2 a+b\n' | perl -lane 'print if $F[1] ne q/a.b/'
2 a+b

To provide a fixed string in replacement section, environment variable comes in handy again. Or use q operator for directly providing the value, but you may have to workaround the delimiters being used and presence of \\ characters.

$ # characters like $ and @ are special in replacement section
$ echo 'x+y' | perl -pe 's/\Qx+y/$x+@y/'
+

$ # provide replacement string as environment variable
$ echo 'x+y' | r='$x+@y' perl -pe 's/\Qx+y/$ENV{r}/'
$x+@y

$ # or, use 'e' flag to provide single quoted value as Perl code
$ echo 'x+y' | perl -pe 's/\Qx+y/q($x+@y)/e'
$x+@y

$ # need to workaround delimiters and \\ with 'q' operator based solution
$ echo 'x+y' | perl -pe 's/\Qx+y/q($x\/@y)/e'
$x/@y
$ echo 'x+y' | perl -pe 's|\Qx+y|q($x/@y)|e'
$x/@y
$ echo 'x+y' | perl -pe 's|\Qx+y|q($x/@y\\\z)|e'
$x/@y\\z

Summary

This chapter showed various examples of processing only lines of interest instead of entire input file. Filtering can be specified using a regexp, fixed string, line number or a combination of them. next and exit are useful to change the flow of code.

Exercises

a) Remove only the third line of given input.

$ seq 34 37 | ##### add your solution here
34
35
37

b) Display only fourth, fifth, sixth and seventh lines for the given input.

$ seq 65 78 | ##### add your solution here
68
69
70
71

c) For the input file ip.txt, replace all occurrences of are with are not and is with is not only from line number 4 till end of file. Also, only the lines that were changed should be displayed in the output.

$ cat ip.txt
Hello World
How are you
This game is good
Today is sunny
12345
You are funny

##### add your solution here
Today is not sunny
You are not funny

d) For the given stdin, display only the first three lines. Avoid processing lines that are not relevant.

$ seq 14 25 | ##### add your solution here
14
15
16

e) For the input file ip.txt, display all lines from start of the file till the first occurrence of game.

##### add your solution here
Hello World
How are you
This game is good

f) For the input file ip.txt, display all lines that contain is but not good.

##### add your solution here
Today is sunny

g) For the input file ip.txt, extract the word before the whole word is as well as the word after it. If such a match is found, display the two words around is in reversed order. For example, hi;1 is--234 bye should be converted to 234:1. Assume that whole word is will not be present more than once in a single line.

##### add your solution here
good:game
sunny:Today

h) For the given input string, replace 0xA0 with 0x7F and 0xC0 with 0x1F.

$ s='start address: 0xA0, func1 address: 0xC0'

$ echo "$s" | ##### add your solution here
start address: 0x7F, func1 address: 0x1F

i) Find the starting index of first occurrence of is or the or was or to for each input line of the file idx.txt. Assume all input lines will match at least one of these terms.

$ cat idx.txt
match after the last newline character
and then you want to test
this is good bye then
you were there to see?

##### add your solution here
12
4
2
9

j) Display all lines containing [4]* for the given stdin data.

$ printf '2.3/[4]*6\n2[4]5\n5.3-[4]*9\n' | ##### add your solution here
2.3/[4]*6
5.3-[4]*9

k) For the given input string, replace all lowercase alphabets to x only for words starting with m.

$ s='ma2T3a a2p kite e2e3m meet'
$ echo "$s" | ##### add your solution here
xx2T3x a2p kite e2e3m xxxx

l) For the input file ip.txt, delete all characters other than lowercase vowels and newline character. Perform this transformation only between a line containing you up to line number 4 (inclusive).

##### add your solution here
Hello World
oaeou
iaeioo
oaiu
12345
You are funny