Line processing
Now that you are familiar with basic Perl CLI usage, this chapter will dive deeper into line processing examples. You'll learn various ways for matching lines based on regular expressions, fixed string matching, line numbers, etc. You'll also see how to group multiple statements and learn about the control flow keywords next
and exit
.
The example_files directory has all the files used in the examples.
Regexp based filtering
As mentioned before:
/REGEXP/FLAGS
is a shortcut for$_ =~ m/REGEXP/FLAGS
!/REGEXP/FLAGS
is a shortcut for$_ !~ m/REGEXP/FLAGS
Here are some examples:
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
$ perl -ne 'print if /ow\b/' table.txt
yellow banana window shoes 3.14
$ perl -ne 'print if !/[ksy]/' table.txt
brown bread mat hair 42
If required, you can also use different delimiters. Quoting from perldoc: match:
If
/
is the delimiter then the initialm
is optional. With them
you can use any pair of non-whitespace (ASCII) characters as delimiters. This is particularly useful for matching path names that contain/
, to avoid LTS (leaning toothpick syndrome). If?
is the delimiter, then a match-only-once rule applies, described inm?PATTERN?
below. If'
(single quote) is the delimiter, no variable interpolation is performed on the PATTERN. When using a delimiter character valid in an identifier, whitespace is required after them
. PATTERN may contain variables, which will be interpolated every time the pattern search is evaluated, except for when the delimiter is a single quote.
$ cat paths.txt
/home/joe/report.log
/home/ram/power.log
/home/rambo/errors.log
# leaning toothpick syndrome
$ perl -ne 'print if /\/home\/ram\//' paths.txt
/home/ram/power.log
# using a different delimiter makes it more readable here
$ perl -ne 'print if m{/home/ram/}' paths.txt
/home/ram/power.log
$ perl -ne 'print if !m#/home/ram/#' paths.txt
/home/joe/report.log
/home/rambo/errors.log
Extracting matched portions
You can use regexp related special variables to extract only the matching portions. Consider this input file:
$ cat ip.txt
it is a warm and cozy day
listen to what I say
go play in the park
come back before the sky turns dark
There are so many delights to cherish
Apple, Banana and Cherry
Bread, Butter and Jelly
Try them all before you perish
Here are some examples of extracting only the matched portions.
# note that this will print only the first match for each input line
$ perl -nE 'say $& if /\b[a-z]\w*[ty]\b/' ip.txt
it
what
play
sky
many
$ perl -nE 'say join "::", @{^CAPTURE} if /(\b[bdp]\w+).*((?1))/i' ip.txt
play::park
back::dark
Bread::Butter
before::perish
Special variables to work with capture groups aren't always needed. For example, when every line has a match.
$ perl -nE 'say /^(\w+ ).*?(\d+)$/' table.txt
brown 42
blue 7
yellow 14
# with a custom separator
$ perl -nE 'say join ":", /^(\w+).*?(\d+)$/' table.txt
brown:42
blue:7
yellow:14
Transliteration
The transliteration operator tr
(or y
) helps you perform transformations character-wise. See perldoc: tr for documentation.
# rot13
$ echo 'Uryyb Jbeyq' | perl -pe 'tr/a-zA-Z/n-za-mN-ZA-M/'
Hello World
# 'c' option complements the specified characters
$ echo 'apple:123:banana' | perl -pe 'tr/0-9\n/-/c'
------123-------
# 'd' option deletes the characters
$ echo 'apple:123:banana' | perl -pe 'tr/0-9\n//cd'
123
# 's' option squeezes repeated characters
$ echo 'APPLE gobbledygook' | perl -pe 'tr|A-Za-z||s'
APLE gobledygok
# transliteration as well as squeeze
$ echo 'APPLE gobbledygook' | perl -pe 'tr|A-Z|a-z|s'
aple gobbledygook
Similar to the s
operator, tr
returns the number of changes made. Use the r
option to prevent in-place modification and return the transliterated string instead.
# match lines containing 'b' 2 times
$ perl -ne 'print if tr/b// == 2' table.txt
brown bread mat hair 42
$ s='orange apple appleseed'
$ echo "$s" | perl -pe 's#\bapple\b(*SKIP)(*F)|\w+#$&=~tr/a-z/A-Z/r#ge'
ORANGE apple APPLESEED
See also:
- stackoverflow: reverse complement DNA sequence for a specific field
- unix.stackexchange: count the number of characters except specific characters
- unix.stackexchange: scoring DNA data
Conditional substitution
These examples combine line filtering and substitution in different ways. As noted before, the s
operator modifies the input string and the return value can be used to know how many substitutions were made. Use the r
flag to prevent in-place modification and get the string output after substitution.
# change commas to hyphens if the input line does NOT contain '2'
# prints all input lines even if the substitution fails
$ printf '1,2,3,4\na,b,c,d\n' | perl -pe 's/,/-/g if !/2/'
1,2,3,4
a-b-c-d
# perform substitution only for the filtered lines
# prints filtered input lines, even if the substitution fails
$ perl -ne 'print s/ark/[$&]/rg if /the/' ip.txt
go play in the p[ark]
come back before the sky turns d[ark]
Try them all before you perish
# print only if the substitution succeeds
$ perl -ne 'print if s/\bw\w*t\b/{$&}/g' ip.txt
listen to {what} I say
Multiple conditions
It is good to remember that Perl is a programming language. You can make use of control structures and combine multiple conditions using logical operators. You don't have to create a single complex regexp.
$ perl -ne 'print if /ark/ && !/sky/' ip.txt
go play in the park
$ perl -ane 'print if /\bthe\b/ || $#F == 5' ip.txt
go play in the park
come back before the sky turns dark
Try them all before you perish
$ perl -ne 'print if /s/ xor /m/' table.txt
brown bread mat hair 42
yellow banana window shoes 3.14
next
When the next
statement is executed, rest of the code will be skipped and the next input line will be fetched for processing. It doesn't affect the BEGIN
and END
blocks as they are outside the file content loop.
$ perl -nE 'if(/\bpar/){print "%% $_"; next} say /s/ ? "X" : "Y"' anchors.txt
%% sub par
X
Y
X
%% cart part tart mart
Note that {}
is used in the above example to group multiple statements to be executed for a single if
condition. You'll see many more examples with next
in the coming chapters.
exit
The exit
function is useful to avoid processing unnecessary input content when a termination condition is reached. See perldoc: exit for documentation.
# quits after an input line containing 'say' is found
$ perl -ne 'print; exit if /say/' ip.txt
it is a warm and cozy day
listen to what I say
# the matching line won't be printed in this case
$ perl -pe 'exit if /say/' ip.txt
it is a warm and cozy day
Use tac
to get all lines starting from the last occurrence of the search string in the entire file.
$ tac ip.txt | perl -ne 'print; exit if /an/' | tac
Bread, Butter and Jelly
Try them all before you perish
You can optionally provide a status code as an argument to the exit
function.
$ printf 'sea\neat\ndrop\n' | perl -ne 'print; exit(2) if /at/'
sea
eat
$ echo $?
2
Any code in the END
block will still be executed before exiting. This doesn't apply if exit
was called from the BEGIN
block.
$ perl -pE 'exit if /cake/' table.txt
brown bread mat hair 42
$ perl -pE 'exit if /cake/; END{say "bye"}' table.txt
brown bread mat hair 42
bye
$ perl -pE 'BEGIN{say "hi"; exit; say "hello"} END{say "bye"}' table.txt
hi
Be careful if you want to use
exit
with multiple input files, as Perl will stop even if there are other files remaining to be processed.
Line number based processing
Line numbers can also be specified as a matching criteria by using the $.
special variable.
# print only the third line
$ perl -ne 'print if $. == 3' ip.txt
go play in the park
# print the second and sixth lines
$ perl -ne 'print if $. == 2 || $. == 6' ip.txt
listen to what I say
There are so many delights to cherish
# transliterate only the second line
$ printf 'gates\nnot\nused\n' | perl -pe 'tr/a-z/*/ if $. == 2'
gates
***
used
# print from a particular line number to the end of the input
$ seq 14 25 | perl -ne 'print if $. >= 10'
23
24
25
Use the eof
function to check for the end of the file condition. See perldoc: eof for documentation.
# same as: tail -n1 ip.txt
$ perl -ne 'print if eof' ip.txt
Try them all before you perish
$ perl -ne 'print "$.:$_" if eof' ip.txt
9:Try them all before you perish
# multiple file example
# same as: tail -q -n1 ip.txt table.txt
$ perl -ne 'print if eof' ip.txt table.txt
Try them all before you perish
yellow banana window shoes 3.14
For large input files, you can use exit
to avoid processing unnecessary input lines.
$ seq 3542 4623452 | perl -ne 'if($. == 2452){print; exit}'
5993
$ seq 3542 4623452 | perl -ne 'print if $. == 250; if($. == 2452){print; exit}'
3791
5993
# here is a sample time comparison
$ time seq 3542 4623452 | perl -ne 'if($. == 2452){print; exit}' > f1
real 0m0.005s
$ time seq 3542 4623452 | perl -ne 'print if $. == 2452' > f2
real 0m0.496s
$ rm f1 f2
Range operator
You can use the range operator to select between a pair of matching conditions like line numbers and regexp. See perldoc: range for documentation.
# the range is automatically compared against $. in this context
# same as: perl -ne 'print if 3 <= $. <= 5'
$ seq 14 25 | perl -ne 'print if 3..5'
16
17
18
# the range is automatically compared against $_ in this context
# note that all the matching ranges are printed
$ perl -ne 'print if /to/ .. /pl/' ip.txt
listen to what I say
go play in the park
There are so many delights to cherish
Apple, Banana and Cherry
See the Records bounded by distinct markers section for an alternate solution.
Line numbers and regexp filtering can be mixed.
$ perl -ne 'print if 6 .. /utter/' ip.txt
There are so many delights to cherish
Apple, Banana and Cherry
Bread, Butter and Jelly
# same logic as: perl -pe 'exit if /\bba/'
# inefficient, but this will work for multiple file inputs
$ perl -ne 'print if !(/\bba/ .. eof)' ip.txt table.txt
it is a warm and cozy day
listen to what I say
go play in the park
brown bread mat hair 42
blue cake mug shirt -7
Both conditions can match the same line too! Use ...
if you don't want the second condition to be matched against the starting line. Also, if the second condition doesn't match, lines starting from the first condition to the last line of the input will be matched.
# 'and' matches the 7th line
$ perl -ne 'print if 7 .. /and/' ip.txt
Apple, Banana and Cherry
# 'and' will be tested against 8th line onwards
$ perl -ne 'print if 7 ... /and/' ip.txt
Apple, Banana and Cherry
Bread, Butter and Jelly
# there's a line containing 'Banana' but the matching pair isn't found
# so, all lines till the end of the input is printed
$ perl -ne 'print if /Banana/ .. /XYZ/' ip.txt
Apple, Banana and Cherry
Bread, Butter and Jelly
Try them all before you perish
Working with fixed strings
You can surround a regexp pattern with \Q
and \E
to match it as a fixed string, similar to the grep -F
option. \E
can be left out if there's no further pattern to be specified. Variables are still interpolated, so if your fixed string contains $
or @
forming possible variables, you'll run into issues. For such cases, one workaround is to pass the search string as an environment value and then apply \Q
to that variable. See perldoc: quotemeta for documentation.
# no match, since [] are character class metacharacters
$ printf 'int a[5]\nfig\n1+4=5\n' | perl -ne 'print if /a[5]/'
$ perl -E 'say "\Qa[5]"'
a\[5\]
$ printf 'int a[5]\nfig\n1+4=5\n' | perl -ne 'print if /\Qa[5]/'
int a[5]
$ printf 'int a[5]\nfig\n1+4=5\n' | perl -pe 's/\Qa[5]/b[12]/'
int b[12]
fig
1+4=5
# $y and $z will be treated as uninitialized variables here
$ echo '$x = $y + $z' | perl -pe 's/\Q$y + $z/100/'
$x = $y100$z
$ echo '$x = $y + $z' | fs='$y + $z' perl -pe 's/\Q$ENV{fs}/100/'
$x = 100
# ENV is preferred since \\ is special in single quoted strings
$ perl -E '$x = q(x\y\\0z); say $x'
x\y\0z
$ x='x\y\\0z' perl -E 'say $ENV{x}'
x\y\\0z
If you just want to filter a line based on fixed strings, you can also use the index
function. This returns the matching position (which starts with 0
) and -1
if the given string wasn't found. See perldoc: index for documentation.
$ printf 'int a[5]\nfig\n1+4=5\n' | perl -ne 'print if index($_, "a[5]") != -1'
int a[5]
The above index
example uses double quotes for the string argument, which allows escape sequences like \t
, \n
, etc and interpolation. This isn't the case with single quoted string values. Using single quotes within the script from command line requires messing with shell metacharacters. So, use the q
operator instead or pass the fixed string to be matched as an environment variable.
# double quotes allow escape sequences and interpolation
$ perl -E '$x=5; say "value of x:\t$x"'
value of x: 5
# use the 'q' operator as an alternate for single quoted strings
$ s='$a = 2 * ($b + $c)'
$ echo "$s" | perl -ne 'print if index($_, q/($b + $c)/) != -1'
$a = 2 * ($b + $c)
# or pass the string as an environment variable
$ echo "$s" | fs='($b + $c)' perl -ne 'print if index($_, $ENV{fs}) != -1'
$a = 2 * ($b + $c)
You can use the return value of the index
function to restrict the matching to the start or end of the input line. The line content in the $_
variable contains the \n
line ending character as well. You can remove the line separator using the chomp
function or the -l
command line option (which will be discussed in detail in the Record separators chapter). For now, it is enough to know that -l
will remove the line separator and add it back when print
is used.
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b
# start of the line
$ s='a+b' perl -ne 'print if index($_, $ENV{s})==0' eqns.txt
a+b,pi=3.14,5e12
# end of the line
# same as: s='a+b' perl -ne 'print if /\Q$ENV{s}\E$/' eqns.txt
# length function returns the number of characters, by default acts on $_
# -l option is needed here to remove \n from $_
$ s='a+b' perl -lne '$pos = length() - length($ENV{s});
print if index($_, $ENV{s}) == $pos' eqns.txt
i*(t+9-g)/8,4-a+b
Here are some more examples using the return value of the index
function.
# since 'index' returns '-1' if there's no match,
# you need to add >=0 check as well for < or <= comparison
$ perl -ne '$i = index($_, "="); print if 0 <= $i <= 5' eqns.txt
a=b,a-b=c,c*d
# > or >= comparison is easy to specify
# if you pass the third argument to 'index', you'll still have to check != -1
$ s='a+b' perl -ne 'print if index($_, $ENV{s})>=1' eqns.txt
i*(t+9-g)/8,4-a+b
If you need to match the entire input line or a particular field, you can use the string comparison operators.
$ printf 'a.b\na+b\n' | perl -lne 'print if /^a.b$/'
a.b
a+b
$ printf 'a.b\na+b\n' | perl -lne 'print if $_ eq q/a.b/'
a.b
$ printf '1 a.b\n2 a+b\n' | perl -lane 'print if $F[1] ne q/a.b/'
2 a+b
To provide a fixed string in the replacement section, environment variables come in handy again. Or, use the q
operator for directly providing the value, but you may have to workaround the delimiters being used and the presence of \\
characters.
# characters like $ and @ are special in the replacement section
$ echo 'x+y' | perl -pe 's/\Qx+y/$x+@y/'
+
# provide replacement string as an environment variable
$ echo 'x+y' | r='$x+@y' perl -pe 's/\Qx+y/$ENV{r}/'
$x+@y
# or, use the 'e' flag to provide a single quoted value as Perl code
$ echo 'x+y' | perl -pe 's/\Qx+y/q($x+@y)/e'
$x+@y
# need to workaround delimiters and \\ for the 'q' operator based solution
$ echo 'x+y' | perl -pe 's/\Qx+y/q($x\/@y)/e'
$x/@y
$ echo 'x+y' | perl -pe 's|\Qx+y|q($x/@y)|e'
$x/@y
$ echo 'x+y' | perl -pe 's|\Qx+y|q($x/@y\\\z)|e'
$x/@y\\z
Summary
This chapter showed various examples of processing only the lines of interest instead of the entire input file. Filtering can be specified using a regexp, fixed string, line number or a combination of them. The next
and exit
statements are useful to change the flow of code.
Exercises
The exercises directory has all the files used in this section.
1) For the given input, display except the third line.
$ seq 34 37 | ##### add your solution here
34
35
37
2) Display only the fourth, fifth, sixth and seventh lines for the given input.
$ seq 65 78 | ##### add your solution here
68
69
70
71
3) For the input file ip.txt
, replace all occurrences of are
with are not
and is
with is not
only from line number 4 till the end of file. Also, only the lines that were changed should be displayed in the output.
$ cat ip.txt
Hello World
How are you
This game is good
Today is sunny
12345
You are funny
##### add your solution here
Today is not sunny
You are not funny
4) For the given stdin, display only the first three lines. Avoid processing lines that are not relevant.
$ seq 14 25 | ##### add your solution here
14
15
16
5) For the input file ip.txt
, display all lines from the start of the file till the first occurrence of game
.
##### add your solution here
Hello World
How are you
This game is good
6) For the input file ip.txt
, display all lines that contain is
but not good
.
##### add your solution here
Today is sunny
7) For the input file ip.txt
, extract the word before the whole word is
as well as the word after it. If such a match is found, display the two words around is
in reversed order. For example, hi;1 is--234 bye
should be converted to 234:1
. Assume that the whole word is
will not be present more than once in a single line.
##### add your solution here
good:game
sunny:Today
8) For the input file hex.txt
, replace all occurrences of 0xA0
with 0x50
and 0xFF
with 0x7F
.
$ cat hex.txt
start: 0xA0, func1: 0xA0
end: 0xFF, func2: 0xB0
restart: 0xA010, func3: 0x7F
##### add your solution here
start: 0x50, func1: 0x50
end: 0x7F, func2: 0xB0
restart: 0x5010, func3: 0x7F
9) Find the starting index of the first occurrence of is
or the
or was
or to
for each input line of the file idx.txt
. Assume that every input line will match at least one of these terms.
$ cat idx.txt
match after the last newline character
and then you want to test
this is good bye then
you were there to see?
##### add your solution here
12
4
2
9
10) Display all lines containing [4]*
for the given stdin data.
$ printf '2.3/[4]*6\n2[4]5\n5.3-[4]*9\n' | ##### add your solution here
2.3/[4]*6
5.3-[4]*9
11) For the given input string, replace all lowercase alphabets to x
only for words starting with m
.
$ s='ma2T3a a2p kite e2e3m meet'
$ echo "$s" | ##### add your solution here
xx2T3x a2p kite e2e3m xxxx
12) For the input file ip.txt
, delete all characters other than lowercase vowels and the newline character. Perform this transformation only between a line containing you
up to line number 4
(inclusive).
##### add your solution here
Hello World
oaeou
iaeioo
oaiu
12345
You are funny
13) For the input file sample.txt
, display from the start of the file till the first occurrence of are
, excluding the matching line.
$ cat sample.txt
Hello World
Good day
How are you
Just do-it
Believe it
Today is sunny
Not a bit funny
No doubt you like it too
Much ado about nothing
He he he
##### add your solution here
Hello World
Good day
14) For the input file sample.txt
, display from the last occurrence of do
till the end of the file.
##### add your solution here
Much ado about nothing
He he he
15) For the input file sample.txt
, display from the 9th line till a line containing you
.
##### add your solution here
Today is sunny
Not a bit funny
No doubt you like it too
16) Display only the odd numbered lines from ip.txt
.
##### add your solution here
Hello World
This game is good
12345
17) For the table.txt
file, print only the line number for lines containing air
or win
.
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
##### add your solution here
1
3
18) For the input file table.txt
, calculate the sum of numbers in the last column, excluding the second line.
##### add your solution here
45.14
19) Print the second and fourth line for every block of five lines.
$ seq 15 | ##### add your solution here
2
4
7
9
12
14
20) For the input file ip.txt
, display all lines containing e
or u
but not both.
##### add your solution here
Hello World
This game is good
Today is sunny