Field separators

This chapter will dive deep into field processing. You'll learn how to set input and output field separators, how to use regexps for defining fields and how to work with fixed length fields.

Default field separation

By default, the -a option splits based on one or more sequence of whitespace characters. In addition, whitespaces at the start or end of input gets trimmed and won't be part of field contents. Using -a is equivalent to @F = split. From perldoc: split:

split emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20", but not e.g. / /). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator. However, this special treatment can be avoided by specifying the pattern / / instead of the string " ", thereby allowing only a single space character to be a separator.

$ # $#F gives index of last element, i.e. size of array - 1
$ echo '   a   b   c   ' | perl -anE 'say $#F'
2
$ # note that leading whitespaces isn't part of field content
$ echo '   a   b   c   ' | perl -anE 'say $F[0]'
a
$ # note that trailing whitespaces isn't part of field content
$ echo '   a   b   c   ' | perl -anE 'say "$F[-1]."'
c.

$ # here's another example with more whitespace characters thrown in
$ # in scalar context, @F will return size of the array
$ printf '     one \t\f\v two\t\r\tthree \t\r ' | perl -anE 'say scalar @F'
3
$ printf '     one \t\f\v two\t\r\tthree \t\r ' | perl -anE 'say "$F[1]."'
two.

Input field separator

You can use the -F command line option to specify a custom regexp field separator. Note that -a option implicitly sets -n and -F option implicitly sets -n and -a on newer versions of Perl. However, this book will always explicitly use these options.

$ # use ':' as input field separator
$ echo 'goal:amazing:whistle:kwality' | perl -F: -anE 'say "$F[0]\n$F[2]"'
goal
whistle

$ # use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | perl -F';' -anE 'say $F[2]'
three

$ echo 'load;err_msg--\ant,r2..not' | perl -F'\W+' -anE 'say $F[2]'
ant

$ echo 'hi.bye.hello' | perl -F'\.' -anE 'say $F[1]'
bye

You can also specify the regexp to -F option inside // delimiters as well as add LIMIT argument if needed.

$ # count number of vowels for each input line
$ # can also use: -F'(?i)[aeiou]'
$ printf 'COOL\nnice car\n' | perl -F'/[aeiou]/i' -anE 'say $#F'
2
3

$ # note that newline character is present as part of the last field content
$ echo 'goal:amazing:whistle:kwality' | perl -F'/:/,$_,2' -ane 'print $F[1]'
amazing:whistle:kwality

To get individual characters, you can use empty argument for the -F option.

$ echo 'apple' | perl -F -anE 'say $F[0]'
a
$ # -CS option will turn on UTF-8 for stdin/stdout/stderr streams
$ echo 'fox:αλεπού' | perl -CS -F -anE 'say @F[4..6]'
αλε

For more information about using perl with different encodings, see:

warning If the custom field separator with -F option doesn't affect the newline character, then the last element can contain the newline character.

$ # last element will not have newline character with default -a
$ # as leading/trailing whitespaces are trimmed with default split
$ echo 'cat dog' | perl -anE 'say "[$F[-1]]"'
[dog]

$ # last element will have newline character since field separator is ':'
$ echo 'cat:dog' | perl -F: -anE 'say "[$F[-1]]"'
[dog
]
$ # unless the input itself doesn't have newline character
$ printf 'cat:dog' | perl -F: -anE 'say "[$F[-1]]"'
[dog]

The newline character can also show up as the entire content of the last field.

$ # both leading and trailing whitespaces are trimmed
$ echo '  a b   c   ' | perl -anE 'say $#F'
2
$ # leading empty element won't be removed here
$ # and last element will have only newline character as its value
$ echo ':a:b:c:' | perl -F: -anE 'say $#F; say "[$F[-1]]"'
4
[
]

As mentioned before, the -l option is helpful if you wish to remove the newline character (more details will be discussed in Record separators chapter). A side effect of removing the newline character before applying split is that a trailing empty field will also get removed (you can explicitly call split function with -1 as limit to prevent this).

$ # -l will remove the newline character
$ # -l will also cause 'print' to append the newline character
$ echo 'cat:dog' | perl -F: -lane 'print "[$F[-1]]"'
[dog]

$ # since newline character is chomped, last element is empty
$ # which is then removed due to default 'split' behavior
$ echo ':a:b:c:' | perl -F: -lane 'print scalar @F'
4
$ # explicit call to split with -1 as limit will preserve the empty element
$ echo ':a:b:c:' | perl -lne 'print scalar split/:/,$_,-1'
5

warning As per perldoc: -F option, "You can't use literal whitespace or NUL characters in the pattern." Here's some examples.

$ # only one element, field separator didn't match at all!!
$ echo 'pick eat rest laugh' | perl -F'/t /' -lane 'print $F[0]'
pick eat rest laugh
$ # number of splits is correct
$ # but the space character shouldn't be part of field here
$ echo 'pick eat rest laugh' | perl -F't ' -lane 'print $F[1]'
 res
$ # this gives the expected behavior
$ echo 'pick eat rest laugh' | perl -F't\x20' -lane 'print $F[1]'
res

$ # Error!!
$ echo 'pick eat rest laugh' | perl -F't[ ]' -lane 'print $F[1]'
Unmatched [ in regex; marked by <-- HERE in m/t[ <-- HERE /.
$ # no issues if 'split' is used explicitly
$ echo 'pick eat rest laugh' | perl -lne 'print((split /t[ ]/)[1])'
res

$ # example with NUL specified literally and as an escape sequence
$ printf 'a\0b\0c' | perl -F$'\0' -anE 'say join ",", @F' | cat -v
a,^@,b,^@,c
$ printf 'a\0b\0c' | perl -F'\0' -anE 'say join ",", @F' | cat -v
a,b,c

Output field separator

There are a few ways to affect the separator to be used while displaying multiple values.

Method 1: The value of $, special variable is used as the separator when multiple arguments (or list/array) are passed to print and say functions. $, could be remembered easily by noting that , is used to separate multiple arguments. Note that -l option is used in the examples below as a good practice even when not needed.

info See perldoc: perlvar for alternate names of special variables if you use metacpan: English module. For example, $OFS or $OUTPUT_FIELD_SEPARATOR instead of $,

$ perl -lane 'BEGIN{$,=" "} print $F[0], $F[2]' table.txt
brown mat
blue mug
yellow window

$ s='Sample123string42with777numbers'
$ echo "$s" | perl -F'\d+' -lane 'BEGIN{$,=","} print @F'
Sample,string,with,numbers

$ # default value of $, is undef
$ echo 'table' | perl -F -lane 'print @F[0..2]'
tab

Method 2: By using the join function.

$ s='Sample123string42with777numbers'
$ echo "$s" | perl -F'\d+' -lane 'print join ",", @F'
Sample,string,with,numbers

$ s='goal:amazing:whistle:kwality'
$ echo "$s" | perl -F: -lane 'print join "-", @F[-1, 1, 0]'
kwality-amazing-goal
$ echo "$s" | perl -F: -lane 'print join "::", @F, 42'
goal::amazing::whistle::kwality::42

Method 3: You can also manually build the output string within double quotes. Or use $" to specify the field separator for an array value within double quotes. $" could be remembered easily by noting that interpolation happens within double quotes.

$ s='goal:amazing:whistle:kwality'

$ echo "$s" | perl -F: -lane 'print "$F[0] $F[2]"'
goal whistle

$ # default value of $" is space
$ echo "$s" | perl -F: -lane 'print "@F[0, 2]"'
goal whistle

$ echo "$s" | perl -F: -lane 'BEGIN{$"="-"} print "msg: @F[-1, 1, 0]"'
msg: kwality-amazing-goal

Changing number of fields

Manipulating $#F will change the number of fields for @F array.

$ s='goal:amazing:whistle:kwality'

$ # reducing fields
$ echo "$s" | perl -F: -lane '$#F=1; print join ",", @F'
goal,amazing

$ # increasing fields
$ echo "$s" | perl -F: -lane '$F[$#F+1]="sea"; print join ":", @F'
goal:amazing:whistle:kwality:sea

$ # empty fields will be created as needed
$ echo "$s" | perl -F: -lane '$F[7]="go"; print join ":", @F'
goal:amazing:whistle:kwality::::go

Assigning $#F to -1 or lower will delete all the fields.

$ echo "1:2:3" | perl -F: -lane '$#F=-1; print "[@F]"'
[]

Here's an example of adding a new field based on existing fields.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

$ # adds a new grade column based on marks in 3rd column
$ perl -anE 'BEGIN{$,="\t"; @g = qw(D C B A S)}
             say @F, $.==1 ? "Grade" : $g[$F[-1]/10 - 5]' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

Defining field contents instead of using split

The -F option uses the split function to get field values from input content. In contrast, using /regexp/g allows you to define what should the fields be made up of. Quoting from perldoc: Global matching

In list context, /g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regexp.

$ s='Sample123string42with777numbers'

$ # define fields to be one or more consecutive digits
$ # can also use: perl -nE 'say((/\d+/g)[1])'
$ echo "$s" | perl -nE '@f=/\d+/g; say $f[1]'
42

$ # define fields to be one or more consecutive alphabets
$ echo "$s" | perl -lne 'print join ",", /[a-z]+/ig'
Sample,string,with,numbers

Here's some examples to display results only if there's a match. Without the if conditions, you'll get empty lines for non-matching lines. Quoting from perldoc: The empty pattern //

If the PATTERN evaluates to the empty string, the last successfully matched regular expression is used instead. In this case, only the g and c flags on the empty pattern are honored; the other flags are taken from the original pattern. If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match).

$ perl -nE 'say join "\n", //g if /\bm\w*\b/' table.txt
mat
mug

$ # /\bb\w*\b/ will come into play only if a word starting with 'h' isn't found
$ # so, first line matches 'hair' but not 'brown' or 'bread'
$ # other lines don't have words starting with 'h'
$ perl -nE 'say join "\n", //g if /\bh\w*\b/ || /\bb\w*\b/' table.txt
hair
blue
banana

As an alternate, you can use while loop with g flag. Quoting from perldoc: Global matching

In scalar context, successive invocations against a string will have /g jump from match to match, keeping track of position in the string as it goes along.

$ perl -nE 'say $& while /\bm\w*\b/g' table.txt
mat
mug

# note that this form isn't suited for priority extraction
$ perl -nE 'say $& while /\b[bh]\w*\b/g' table.txt
brown
bread
hair
blue
banana

A simple split fails for csv input where fields can contain embedded delimiter characters. For example, a field content "fox,42" when , is the delimiter.

$ s='eagle,"fox,42",bee,frog'
$ # simply using , as separator isn't sufficient
$ echo "$s" | perl -F, -lane 'print $F[1]'
"fox

While metacpan: Text::CSV module should be preferred for robust csv parsing, regexp is enough for simple formats.

$ echo "$s" | perl -lne 'print((/"[^"]+"|[^,]+/g)[1])'
"fox,42"

Fixed width processing

The unpack function is more than just a different way of string slicing. It supports various formats and pre-processing, see perldoc: unpack, perldoc: pack and perldoc: perlpacktut for details.

In the example below, a indicates arbitrary binary string. The optional number that follows indicates length of the field.

$ cat items.txt
apple   fig banana
50      10  200

$ # here field widths have been assigned such that
$ # extra spaces are placed at the end of each field
$ # $_ is the default input string for 'unpack' function
$ perl -lne 'print join ",", unpack "a8a4a6"' items.txt
apple   ,fig ,banana
50      ,10  ,200
$ perl -lne 'print((unpack "a8a4a6")[1])' items.txt
fig 
10  

You can specify characters to be ignored with x followed by optional length.

$ # first field is 5 characters
$ # then 3 characters are ignored and 3 characters for second field
$ # then 1 character is ignored and 6 characters for third field
$ perl -lne 'print join ",", unpack "a5x3a3xa6"' items.txt
apple,fig,banana
50   ,10 ,200

Using * will cause remaining characters of that particular format to be consumed. Here Z is used to process ASCII NUL separated string.

$ printf 'banana\x0050\x00' | perl -nE 'say join ":", unpack "Z*Z*"'
banana:50

$ # first field is 5 characters, then 3 characters are ignored
$ # all the remaining characters are assigned to second field
$ perl -lne 'print join ",", unpack "a5x3a*"' items.txt
apple,fig banana
50   ,10  200

Unpacking isn't always needed, string slicing using substr may suffice. See perldoc: substr for documentation.

$ # same as: perl -F -anE 'say @F[2..4]'
$ echo 'b 123 good' | perl -nE 'say substr $_,2,3'
123
$ echo 'b 123 good' | perl -ne 'print substr $_,6'
good

$ # replacing arbitrary slice
$ echo 'b 123 good' | perl -pe 'substr $_,2,3,"gleam"'
b gleam good

See also perldoc: Functions for fixed-length data or records.

Assorted field processing functions

Having seen command line options and features commonly used for field processing, this section will highlight some of the built-in functions. There's just too many to meaningfully cover them in all in detail, so consider this to be just a brief overview of features. See also perldoc: Perl Functions by Category.

First up, the grep function that allows you to select fields based on a condition. In scalar context, it returns number of fields that matched the given condition. See perldoc: grep for documentation. See also unix.stackexchange: create lists of words according to binary numbers.

$ s='goal:amazing:42:whistle:kwality:3.14'

$ # fields containing 'in' or 'it' or 'is'
$ echo "$s" | perl -F: -lane 'print join ":", grep {/i[nts]/} @F'
amazing:whistle:kwality

$ # number of fields NOT containing a digit character
$ echo "$s" | perl -F: -lane 'print scalar grep {!/\d/} @F'
4

$ s='hour hand band mat heated apple'
$ echo "$s" | perl -lane 'print join "\n", grep {!/^h/ && length()<4} @F'
mat

$ echo '20 711 -983 5 21' | perl -lane 'print join ":", grep {$_ > 20} @F'
711:21

$ # maximum of one field containing 'r'
$ perl -lane 'print if 1 >= grep {/r/} @F' table.txt
blue cake mug shirt -7
yellow banana window shoes 3.14

The map function transforms each element according to the logic passed to it. See perldoc: map for documentation.

$ s='goal:amazing:42:whistle:kwality:3.14'
$ echo "$s" | perl -F: -lane 'print join ":", map {uc} @F'
GOAL:AMAZING:42:WHISTLE:KWALITY:3.14
$ echo "$s" | perl -F: -lane 'print join ":", map {/^[gw]/ ? uc : $_} @F'
GOAL:amazing:42:WHISTLE:kwality:3.14

$ echo '23 756 -983 5' | perl -lane 'print join ":", map {$_ ** 2} @F'
529:571536:966289:25

$ echo 'AaBbCc' | perl -F -lane 'print join " ", map {ord} @F'
65 97 66 98 67 99
$ # for in-place modification of the input array
$ echo 'AaBbCc' | perl -F -lane 'map {$_ = ord} @F; print "@F"'
65 97 66 98 67 99

$ echo 'a b c' | perl -lane 'print join ",", map {qq/"$_"/} @F'
"a","b","c"

Here's an example with grep and map combined.

$ s='hour hand band mat heated pineapple'
$ echo "$s" | perl -lane 'print join "\n", map {y/ae/X/r} grep {/^h/} @F'
hour
hXnd
hXXtXd
$ # with 'grep' alone, provided the transformation doesn't affect the condition
$ # also, @F will be changed here, above map+grep code will not affect @F
$ echo "$s" | perl -lane 'print join "\n", grep {y/ae/X/; /^h/} @F'
hour
hXnd
hXXtXd

Here's some examples with sort and reverse functions for arrays and strings. See perldoc: sort and perldoc: reverse for documentation.

$ # sorting numbers
$ echo '23 756 -983 5' | perl -lane 'print join " ", sort {$a <=> $b} @F'
-983 5 23 756

$ s='floor bat to dubious four'
$ # default alphabetic sorting in ascending order
$ echo "$s" | perl -lane 'print join ":", sort @F'
bat:dubious:floor:four:to

$ # sort by length of the fields in ascending order
$ echo "$s" | perl -lane 'print join ":", sort {length($a) <=> length($b)} @F'
to:bat:four:floor:dubious
$ # descending order
$ echo "$s" | perl -lane 'print join ":", sort {length($b) <=> length($a)} @F'
dubious:floor:four:bat:to

$ # same as: perl -F -lane 'print sort {$b cmp $a} @F'
$ echo 'foobar' | perl -F -lane 'print reverse sort @F'
roofba

Here's an example with multiple sorting conditions. If the transformation applied for each field is expensive, using Schwartzian transform can provide a faster result. See also stackoverflow: multiple sorting conditions.

$ s='try a bad to good i teal by nice how'

$ # longer words first, ascending alphabetic order as tie-breaker
$ echo "$s" | perl -anE 'say join ":",
                    sort {length($b) <=> length($a) or $a cmp $b} @F'
good:nice:teal:bad:how:try:by:to:a:i

$ # using Schwartzian transform
$ echo "$s" | perl -anE 'say join ":", map {$_->[0]}
                         sort {$b->[1] <=> $a->[1] or $a->[0] cmp $b->[0]}
                         map {[$_, length($_)]} @F'
good:nice:teal:bad:how:try:by:to:a:i

Here's an example for sorting in descending order based on header column names.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

$ perl -lane '@i = sort {$F[$b] cmp $F[$a]} 0..$#F if $.==1;
              print join "\t", @F[@i]' marks.txt
Name    Marks   Dept
Raj     53      ECE
Joel    72      ECE
Moi     68      EEE
Surya   81      CSE
Tia     59      EEE
Om      92      ECE
Amy     67      CSE

info See Using modules chapter for more field processing functions.

Summary

This chapter discussed various ways in which you can split (or define) the input into fields and manipulate them. Many more examples will be discussed in later chapters.

Exercises

a) Extract only the contents between () or )( from each input line. Assume that () characters will be present only once every line.

$ cat brackets.txt
foo blah blah(ice) 123 xyz$ 
(almond-pista) choco
yo )yoyo( yo

##### add your solution here
ice
almond-pista
yoyo

b) For the input file scores.csv, extract Name and Physics fields in the format shown below.

$ cat scores.csv
Name,Maths,Physics,Chemistry
Blue,67,46,99
Lin,78,83,80
Er,56,79,92
Cy,97,98,95
Ort,68,72,66
Ith,100,100,100

##### add your solution here
Name:Physics
Blue:46
Lin:83
Er:79
Cy:98
Ort:72
Ith:100

c) For the input file scores.csv, display names of those who've scored above 80 in Maths.

##### add your solution here
Cy
Ith

d) Display the number of word characters for the given inputs. Word definition here is same as used in regular expressions. Can you construct two different solutions as indicated below?

$ # solve using 's' operator
$ echo 'hi there' | ##### add your solution here
7

$ # solve without using substitution or transliteration operator
$ echo 'u-no;co%."(do_12:as' | ##### add your solution here
12

e) Construct a solution that works for both the given sample inputs and the corresponding output shown.

$ s1='1 "grape" and "mango" and "guava"'
$ s2='("a 1""d""c-2""b")'

$ echo "$s1" | ##### add your solution here
"grape","guava","mango"
$ echo "$s2" | ##### add your solution here
"a 1","b","c-2","d"

f) Display only the third and fifth characters from each input line.

$ printf 'restore\ncat one\ncricket' | ##### add your solution here
so
to
ik

g) Transform the given input file fw.txt to get the output as shown below. If second field is empty (i.e. contains only space characters), replace it with NA.

$ cat fw.txt
1.3  rs   90  0.134563
3.8           6
5.2  ye       8.2387
4.2  kt   32  45.1

##### add your solution here
1.3,rs,0.134563
3.8,NA,6
5.2,ye,8.2387
4.2,kt,45.1

h) For the input file scores.csv, display the header as well as any row which contains b or t (irrespective of case) in the first field.

##### add your solution here
Name,Maths,Physics,Chemistry
Blue,67,46,99
Ort,68,72,66
Ith,100,100,100

i) Extract all whole words that contains 42 but not at the edge of a word. Assume a word cannot contain 42 more than once.

$ s='hi42bye nice1423 bad42 cool_42a 42fake'
$ echo "$s" | ##### add your solution here
hi42bye
nice1423
cool_42a

j) For the input file scores.csv, add another column named GP which is calculated out of 100 by giving 50% weightage to Maths and 25% each for Physics and Chemistry.

##### add your solution here
Name,Maths,Physics,Chemistry,GP
Blue,67,46,99,69.75
Lin,78,83,80,79.75
Er,56,79,92,70.75
Cy,97,98,95,96.75
Ort,68,72,66,68.5
Ith,100,100,100,100.0

k) For the input file mixed_fs.txt, retain only first two fields from each input line. The input and output field separators should be space for first two lines and , for the rest of the lines.

$ cat mixed_fs.txt
rose lily jasmine tulip
pink blue white yellow
car,mat,ball,basket
light green,brown,black,purple

##### add your solution here
rose lily
pink blue
car,mat
light green,brown

l) For the given space separated numbers, filter only numbers in the range 20 to 1000 (inclusive).

$ s='20 -983 5 756 634223'

##### add your solution here
20 756

m) For the given input file words.txt, filter all lines containing characters in ascending and descending order.

$ cat words.txt
bot
art
are
boat
toe
flee
reed

$ # ascending order
##### add your solution here
bot
art

$ # descending order
##### add your solution here
toe
reed

n) For the given space separated words, extract the three longest words.

$ s='I bought two bananas and three mangoes'

$ echo "$s" | ##### add your solution here
bananas
mangoes
bought

o) Convert the contents of split.txt as shown below.

$ cat split.txt
apple,1:2:5,mango
wry,4,look
pencil,3:8,paper

##### add your solution here
apple,1,mango
apple,2,mango
apple,5,mango
wry,4,look
pencil,3,paper
pencil,8,paper

p) Generate string combinations as shown below for the given input string passed as an environment variable.

$ s='{x,y,z}{1,2,3}' ##### add your solution here
x1 x2 x3 y1 y2 y3 z1 z2 z3