Field separators

This chapter will dive deep into field processing. You'll learn how to set input and output field separators, how to use regexps for defining fields and how to work with fixed length fields.

info The example_files directory has all the files used in the examples.

Default field separation

Using the -a option is equivalent to @F = split. So, the input will be split based on one or more sequence of whitespace characters. Also, leading and trailing whitespaces will be removed (you can use the LIMIT argument to preserve trailing empty fields). From perldoc: split:

split emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20", but not e.g. / /). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator. However, this special treatment can be avoided by specifying the pattern / / instead of the string " ", thereby allowing only a single space character to be a separator.

# $#F gives the index of the last element, i.e. size of array - 1
$ echo '   a   b   c   ' | perl -anE 'say $#F'
2

# note that the leading whitespaces aren't part of the field content
$ echo '   a   b   c   ' | perl -anE 'say "($F[0])"'
(a)
# trailing whitespaces are removed as well
$ echo '   a   b   c   ' | perl -anE 'say "($F[-1])"'
(c)

# here's another example with more whitespace characters thrown in
# in scalar context, @F will return the size of the array
$ printf '     one \t\f\v two\t\r\tthree \t\r ' | perl -anE 'say scalar @F'
3
$ printf '     one \t\f\v two\t\r\tthree \t\r ' | perl -anE 'say "$F[1]."'
two.

Input field separator

You can use the -F command line option to specify a custom regexp field separator. Note that the -a option implicitly sets -n and the -F option implicitly sets -n and -a on newer versions of Perl. However, this book will always explicitly use these options.

# use ':' as the input field separator
$ echo 'goal:amazing:whistle:kwality' | perl -F: -anE 'say "$F[0]\n$F[2]"'
goal
whistle

# use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | perl -F';' -anE 'say $F[2]'
three

$ echo 'load;err_msg--\ant,r2..not' | perl -F'\W+' -anE 'say $F[2]'
ant

$ echo 'hi.bye.hello' | perl -F'\.' -anE 'say $F[1]'
bye

You can also specify the regexp to the -F option within // delimiters as well. This is useful to add flags and the LIMIT argument if needed.

# count the number of vowels for each input line
# can also use: -F'(?i)[aeiou]'
$ printf 'COOL\nnice car\n' | perl -F'/[aeiou]/i' -anE 'say $#F'
2
3

# LIMIT=2
# note that the newline character is present as part of the last field content
$ echo 'goal:amazing:whistle:kwality' | perl -F'/:/,$_,2' -ane 'print $F[1]'
amazing:whistle:kwality

Character-wise separation

To get individual characters, you can provide an empty argument to the -F option.

$ echo 'apple' | perl -F -anE 'say $F[0]'
a

# -CS turns on UTF-8 for stdin/stdout/stderr streams
$ echo 'fox:αλεπού' | perl -CS -F -anE 'say @F[4..6]'
αλε

For more information about using Perl with different encodings, see:

Newline character in the last field

If the custom field separator doesn't affect the newline character, then the last element can contain the newline character.

# last element will not have the newline character with the -a option
# as leading/trailing whitespaces are trimmed with default split
$ echo 'cat dog' | perl -anE 'say "[$F[-1]]"'
[dog]

# last element will have the newline character since field separator is ':'
$ echo 'cat:dog' | perl -F: -anE 'say "[$F[-1]]"'
[dog
]
# unless the input itself doesn't have the newline character
$ printf 'cat:dog' | perl -F: -anE 'say "[$F[-1]]"'
[dog]

The newline character can also show up as the entire content of the last field.

# both leading and trailing whitespaces are trimmed
$ echo '  a b   c   ' | perl -anE 'say $#F'
2

# leading empty element won't be removed here
# and the last element will have only the newline character as the value
$ echo ':a:b:c:' | perl -F: -anE 'say $#F; say "[$F[-1]]"'
4
[
]

Using the -l option for field splitting

As mentioned before, the -l option is helpful if you wish to remove the newline character (more details will be discussed in the Record separators chapter). A side effect of removing the newline character before applying split is that the trailing empty fields will also get removed (you can set LIMIT as -1 to prevent this).

# -l will remove the newline character
# -l will also cause 'print' to append the newline character
$ echo 'cat:dog' | perl -F: -lane 'print "[$F[-1]]"'
[dog]

# since the newline character is chomped, the last element is empty
# which is then removed due to the default 'split' behavior
$ echo ':a:b:c:' | perl -F: -lane 'print scalar @F'
4

# set LIMIT as -1 to preserve trailing empty fields
# can also use: perl -F'/:/,$_,-1' -lane 'print scalar @F'
$ echo ':a:b:c:' | perl -lne 'print scalar split/:/,$_,-1'
5

Whitespace and NUL characters in field separation

As per perldoc: -F option, "You can't use literal whitespace or NUL characters in the pattern." Here are some examples with whitespaces being used as part of the field separator.

$ s='pick eat rest laugh'

# only one element, field separator didn't match at all!!
$ echo "$s" | perl -F'/t /' -lane 'print $F[0]'
pick eat rest laugh
# number of splits is correct
# but the space character shouldn't be part of field here
$ echo "$s" | perl -F't ' -lane 'print $F[1]'
 res
# this gives the expected behavior
$ echo "$s" | perl -F't\x20' -lane 'print $F[1]'
res

# Error!!
$ echo "$s" | perl -F't[ ]' -lane 'print $F[1]'
Unmatched [ in regex; marked by <-- HERE in m/t[ <-- HERE /.
# no issues if the split function is used explicitly
$ echo "$s" | perl -lne 'print((split /t[ ]/)[1])'
res

And here's an example with the ASCII NUL character being used as the field separator:

# doesn't work as expected when NUL is passed as a literal character
$ printf 'aa\0b\0c' | perl -F$'\0' -anE 'say join ",", @F' | cat -v
a,a,^@,b,^@,c

# no issues when passed as an escape sequence
$ printf 'aa\0b\0c' | perl -F'\0' -anE 'say join ",", @F' | cat -v
aa,b,c

Output field separator

There are a few ways to affect the separator to be used while displaying multiple values.

Method 1: The value of the $, special variable is used as the separator when multiple arguments (or list/array) are passed to the print and say functions. $, could be remembered easily by noting that , is used to separate multiple arguments. Note that the -l option is used in the examples below as a good practice even when not needed.

$ perl -lane 'BEGIN{$,=" "} print $F[0], $F[2]' table.txt
brown mat
blue mug
yellow window

$ s='Sample123string42with777numbers'
$ echo "$s" | perl -F'\d+' -lane 'BEGIN{$,=","} print @F'
Sample,string,with,numbers

# default value of $, is undef
$ echo 'table' | perl -F -lane 'print @F[0..2]'
tab

info See perldoc: perlvar for alternate names of special variables if you use the metacpan: English module. For example, $OFS or $OUTPUT_FIELD_SEPARATOR instead of $,

Method 2: By using the join function.

$ s='Sample123string42with777numbers'
$ echo "$s" | perl -F'\d+' -lane 'print join ",", @F'
Sample,string,with,numbers

$ s='goal:amazing:whistle:kwality'
$ echo "$s" | perl -F: -lane 'print join "-", @F[-1, 1, 0]'
kwality-amazing-goal
$ echo "$s" | perl -F: -lane 'print join "::", @F, 42'
goal::amazing::whistle::kwality::42

Method 3: You can also manually build the output string within double quotes. Or use $" to specify the field separator for an array value within double quotes. $" could be remembered easily by noting that interpolation happens within double quotes.

$ s='goal:amazing:whistle:kwality'

$ echo "$s" | perl -F: -lane 'print "$F[0] $F[2]"'
goal whistle

# default value of $" is a space character
$ echo "$s" | perl -F: -lane 'print "@F[0, 2]"'
goal whistle

$ echo "$s" | perl -F: -lane 'BEGIN{$"="-"} print "msg: @F[-1, 1, 0]"'
msg: kwality-amazing-goal

Manipulating $#F

Changing the value of $#F will affect the @F array. Here are some examples:

$ s='goal:amazing:whistle:kwality'

# reducing fields
$ echo "$s" | perl -F: -lane '$#F=1; print join ",", @F'
goal,amazing

# increasing fields
$ echo "$s" | perl -F: -lane '$F[$#F+1]="sea"; print join ":", @F'
goal:amazing:whistle:kwality:sea

# empty fields will be created as needed
$ echo "$s" | perl -F: -lane '$F[7]="go"; print join ":", @F'
goal:amazing:whistle:kwality::::go

Assigning $#F to -1 or lower will delete all the fields.

$ echo "1:2:3" | perl -F: -lane '$#F=-1; print "[@F]"'
[]

Manipulating $#F isn't always needed. Here's an example of simply printing the additional field instead of modifying the array.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

# adds a new grade column based on marks in the third column
$ perl -anE 'BEGIN{$,="\t"; @g = qw(D C B A S)}
             say @F, $.==1 ? "Grade" : $g[$F[-1]/10 - 5]' marks.txt
Dept    Name    Marks   Grade
ECE     Raj     53      D
ECE     Joel    72      B
EEE     Moi     68      C
CSE     Surya   81      A
EEE     Tia     59      D
ECE     Om      92      S
CSE     Amy     67      C

Defining field contents instead of splitting

The -F option uses the split function to generate the fields. In contrast, you can use /regexp/g to define what should the fields be made up of. Quoting from perldoc: Global matching:

In list context, /g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regexp.

Here are some examples:

$ s='Sample123string42with777numbers'
# define fields to be one or more consecutive digits
# can also use: perl -nE 'say((/\d+/g)[1])'
$ echo "$s" | perl -nE '@f=/\d+/g; say $f[1]'
42

$ s='coat Bin food tar12 best Apple fig_42'
# whole words made up of lowercase alphabets and digits only
$ echo "$s" | perl -nE 'say join ",", /\b[a-z0-9]+\b/g'
coat,food,tar12,best

$ s='items: "apple" and "mango"'
# get the first double quoted item
$ echo "$s" | perl -nE '@f=/"[^"]+"/g; say $f[0]'
"apple"

Here are some examples for displaying results only if there's a match. Without the if conditions, you'll get empty lines for non-matching lines. Quoting from perldoc: The empty pattern

If the PATTERN evaluates to the empty string, the last successfully matched regular expression is used instead. In this case, only the g and c flags on the empty pattern are honored; the other flags are taken from the original pattern. If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match).

$ perl -nE 'say join "\n", //g if /\bm\w*\b/' table.txt
mat
mug

# /\bb\w*\b/ will come into play only if a word starting with 'h' isn't found
# so, first line matches 'hair' but not 'brown' or 'bread'
# other lines don't have words starting with 'h'
$ perl -nE 'say join "\n", //g if /\bh\w*\b/ || /\bb\w*\b/' table.txt
hair
blue
banana

As an alternate, you can use a while loop with the g flag. Quoting from perldoc: Global matching:

In scalar context, successive invocations against a string will have /g jump from match to match, keeping track of position in the string as it goes along.

$ perl -nE 'say $& while /\bm\w*\b/g' table.txt
mat
mug

# note that this form isn't suited for priority-based extraction
$ perl -nE 'say $& while /\b[bh]\w*\b/g' table.txt
brown
bread
hair
blue
banana

A simple split fails for CSV input where fields can contain embedded delimiter characters. For example, a field content "fox,42" when , is the delimiter.

$ s='eagle,"fox,42",bee,frog'
# simply using , as separator isn't sufficient
$ echo "$s" | perl -F, -lane 'print $F[1]'
"fox

While metacpan: Text::CSV module should be preferred for robust CSV parsing, regexp is enough for simple formats.

$ echo "$s" | perl -lne 'print((/"[^"]+"|[^,]+/g)[1])'
"fox,42"

Fixed width processing

The unpack function is more than just a different way of string slicing. It supports various formats and pre-processing, see perldoc: unpack, perldoc: pack and perldoc: perlpacktut for details.

In the example below, a indicates arbitrary binary string. The optional number that follows indicates length of the field.

$ cat items.txt
apple   fig banana
50      10  200

# here field widths have been assigned such that
# extra spaces are placed at the end of each field
# $_ is the default input string for the 'unpack' function
$ perl -lne 'print join ",", unpack "a8a4a6"' items.txt
apple   ,fig ,banana
50      ,10  ,200

$ perl -lne 'print((unpack "a8a4a6")[1])' items.txt
fig 
10  

You can specify characters to be ignored with x followed by an optional length.

# first field is 5 characters
# then 3 characters are ignored and 3 characters for the second field
# then 1 character is ignored and 6 characters for the third field
$ perl -lne 'print join ",", unpack "a5x3a3xa6"' items.txt
apple,fig,banana
50   ,10 ,200

Using * will cause remaining characters of that particular format to be consumed. Here Z is used to process strings that are separated by the ASCII NUL character.

$ printf 'banana\x0050\x00' | perl -nE 'say join ":", unpack "Z*Z*"'
banana:50

# first field is 5 characters, then 3 characters are ignored
# all the remaining characters are assigned to the second field
$ perl -lne 'print join ",", unpack "a5x3a*"' items.txt
apple,fig banana
50   ,10  200

Unpacking isn't always needed, string slicing using substr may suffice. See perldoc: substr for documentation.

# same as: perl -F -anE 'say @F[2..4]'
$ echo 'b 123 good' | perl -nE 'say substr $_,2,3'
123
$ echo 'b 123 good' | perl -ne 'print substr $_,6'
good

# replace arbitrary slice
$ echo 'b 123 good' | perl -pe 'substr $_,2,3,"gleam"'
b gleam good

See also perldoc: Functions for fixed-length data or records.

Assorted field processing functions

Having seen command line options and features commonly used for field processing, this section will highlight some of the built-in functions. There are just too many to meaningfully cover them in all in detail, so consider this to be just a brief overview of features. See also perldoc: Perl Functions by Category.

First up, the grep function that allows you to select fields based on a condition. In scalar context, it returns the number of fields that matched the given condition. See perldoc: grep for documentation. See also unix.stackexchange: create lists of words according to binary numbers.

$ s='goal:amazing:42:whistle:kwality:3.14'

# fields containing 'in' or 'it' or 'is'
$ echo "$s" | perl -F: -lane 'print join ":", grep {/i[nts]/} @F'
amazing:whistle:kwality

# number of fields NOT containing a digit character
$ echo "$s" | perl -F: -lane 'print scalar grep {!/\d/} @F'
4

$ s='hour hand band mat heated apple hit'
$ echo "$s" | perl -lane 'print join "\n", grep {!/^h/ && length()<4} @F'
mat

$ echo '20 711 -983 5 21' | perl -lane 'print join ":", grep {$_ > 20} @F'
711:21

# no more than one field can contain 'r'
$ perl -lane 'print if 1 >= grep {/r/} @F' table.txt
blue cake mug shirt -7
yellow banana window shoes 3.14

The map function transforms each element according to the logic passed to it. See perldoc: map for documentation.

$ s='goal:amazing:42:whistle:kwality:3.14'
$ echo "$s" | perl -F: -lane 'print join ":", map {uc} @F'
GOAL:AMAZING:42:WHISTLE:KWALITY:3.14
$ echo "$s" | perl -F: -lane 'print join ":", map {/^[gw]/ ? uc : $_} @F'
GOAL:amazing:42:WHISTLE:kwality:3.14

$ echo '23 756 -983 5' | perl -lane 'print join ":", map {$_ ** 2} @F'
529:571536:966289:25

$ echo 'AaBbCc' | perl -F -lane 'print join " ", map {ord} @F'
65 97 66 98 67 99
# for in-place modification of the input array
$ echo 'AaBbCc' | perl -F -lane 'map {$_ = ord} @F; print "@F"'
65 97 66 98 67 99

$ echo 'a b c' | perl -lane 'print join ",", map {qq/"$_"/} @F'
"a","b","c"

Here's an example with grep and map combined.

$ s='hour hand band mat heated pineapple'
$ echo "$s" | perl -lane 'print join "\n", map {y/ae/X/r} grep {/^h/} @F'
hour
hXnd
hXXtXd
# with 'grep' alone, provided the transformation doesn't affect the condition
# also, @F will be changed here, above map+grep code will not affect @F
$ echo "$s" | perl -lane 'print join "\n", grep {y/ae/X/; /^h/} @F'
hour
hXnd
hXXtXd

Here are some examples with sort and reverse functions for arrays and strings. See perldoc: sort and perldoc: reverse for documentation.

# sorting numbers
$ echo '23 756 -983 5' | perl -lane 'print join " ", sort {$a <=> $b} @F'
-983 5 23 756

$ s='floor bat to dubious four'
# default alphabetic sorting in ascending order
$ echo "$s" | perl -lane 'print join ":", sort @F'
bat:dubious:floor:four:to

# sort by length of the fields in ascending order
$ echo "$s" | perl -lane 'print join ":", sort {length($a) <=> length($b)} @F'
to:bat:four:floor:dubious
# descending order
$ echo "$s" | perl -lane 'print join ":", sort {length($b) <=> length($a)} @F'
dubious:floor:four:bat:to

# same as: perl -F -lane 'print sort {$b cmp $a} @F'
$ echo 'dragon' | perl -F -lane 'print reverse sort @F'
rongda

Here's an example with multiple sorting conditions. If the transformation applied for each field is expensive, using Schwartzian transform can provide a faster result. See also stackoverflow: multiple sorting conditions.

$ s='try a bad to good i teal by nice how'

# longer words first, ascending alphabetic order as tie-breaker
$ echo "$s" | perl -anE 'say join ":",
                    sort {length($b) <=> length($a) or $a cmp $b} @F'
good:nice:teal:bad:how:try:by:to:a:i

# using Schwartzian transform
$ echo "$s" | perl -anE 'say join ":", map {$_->[0]}
                         sort {$b->[1] <=> $a->[1] or $a->[0] cmp $b->[0]}
                         map {[$_, length($_)]} @F'
good:nice:teal:bad:how:try:by:to:a:i

Here's an example for sorting in descending order based on header column names.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

$ perl -lane '@i = sort {$F[$b] cmp $F[$a]} 0..$#F if $.==1;
              print join "\t", @F[@i]' marks.txt
Name    Marks   Dept
Raj     53      ECE
Joel    72      ECE
Moi     68      EEE
Surya   81      CSE
Tia     59      EEE
Om      92      ECE
Amy     67      CSE

info See the Using modules chapter for more field processing functions.

Summary

This chapter discussed various ways in which you can split (or define) the input into fields and manipulate them. Many more examples will be discussed in later chapters.

Exercises

info The exercises directory has all the files used in this section.

1) For the input file brackets.txt, extract only the contents between () or )( from each input line. Assume that () characters will be present only once every line.

$ cat brackets.txt
foo blah blah(ice) 123 xyz$ 
(almond-pista) choco
yo )yoyo( yo

##### add your solution here
ice
almond-pista
yoyo

2) For the input file scores.csv, extract Name and Physics fields in the format shown below.

$ cat scores.csv
Name,Maths,Physics,Chemistry
Blue,67,46,99
Lin,78,83,80
Er,56,79,92
Cy,97,98,95
Ort,68,72,66
Ith,100,100,100

##### add your solution here
Name:Physics
Blue:46
Lin:83
Er:79
Cy:98
Ort:72
Ith:100

3) For the input file scores.csv, display names of those who've scored above 80 in Maths.

##### add your solution here
Cy
Ith

4) Display the number of word characters for the given inputs. Word definition here is same as used in regular expressions. Can you construct two different solutions as indicated below?

# solve using the 's' operator
$ echo 'hi there' | ##### add your solution here
7

# solve without using the substitution or transliteration operators
$ echo 'u-no;co%."(do_12:as' | ##### add your solution here
12

5) For the input file quoted.txt, extract the sequence of characters surrounded by double quotes and display them in the format shown below.

$ cat quoted.txt
1 "grape" and "mango" and "guava"
("c 1""d""a-2""b")

##### add your solution here
"grape","guava","mango"
"a-2","b","c 1","d"

6) Display only the third and fifth characters from each input line as shown below.

$ printf 'restore\ncat one\ncricket' | ##### add your solution here
so
to
ik

7) Transform the given input file fw.txt to get the output as shown below. If a field is empty (i.e. contains only space characters), replace it with NA.

$ cat fw.txt
1.3  rs   90  0.134563
3.8           6
5.2  ye       8.2387
4.2  kt   32  45.1

##### add your solution here
1.3,rs,0.134563
3.8,NA,6
5.2,ye,8.2387
4.2,kt,45.1

8) For the input file scores.csv, display the header as well as any row which contains b or t (irrespective of case) in the first field.

##### add your solution here
Name,Maths,Physics,Chemistry
Blue,67,46,99
Ort,68,72,66
Ith,100,100,100

9) Extract all whole words containing 42 but not at the edge of a word. Assume a word cannot contain 42 more than once.

$ s='hi42bye nice1423 bad42 cool_42a 42fake'
$ echo "$s" | ##### add your solution here
hi42bye
nice1423
cool_42a

10) For the input file scores.csv, add another column named GP which is calculated out of 100 by giving 50% weightage to Maths and 25% each for Physics and Chemistry.

##### add your solution here
Name,Maths,Physics,Chemistry,GP
Blue,67,46,99,69.75
Lin,78,83,80,79.75
Er,56,79,92,70.75
Cy,97,98,95,96.75
Ort,68,72,66,68.5
Ith,100,100,100,100

11) For the input file mixed_fs.txt, retain only the first two fields from each input line. The input and output field separators should be space for first two lines and , for the rest of the lines.

$ cat mixed_fs.txt
rose lily jasmine tulip
pink blue white yellow
car,mat,ball,basket
light green,brown,black,purple
apple,banana,cherry

##### add your solution here
rose lily
pink blue
car,mat
light green,brown
apple,banana

12) For the given space separated numbers, filter only numbers in the range 20 to 1000 (inclusive).

$ s='20 -983 5 756 634223 1000'

$ echo "$s" | ##### add your solution here
20 756 1000

13) For the given input file words.txt, filter all lines containing characters in ascending and descending order.

$ cat words.txt
bot
art
are
boat
toe
flee
reed

# ascending order
##### add your solution here
bot
art

# descending order
##### add your solution here
toe
reed

14) For the given space separated words, extract the three longest words.

$ s='I bought two bananas and three mangoes'

$ echo "$s" | ##### add your solution here
bananas
mangoes
bought

15) Convert the contents of split.txt as shown below.

$ cat split.txt
apple,1:2:5,mango
wry,4,look
pencil,3:8,paper

##### add your solution here
apple,1,mango
apple,2,mango
apple,5,mango
wry,4,look
pencil,3,paper
pencil,8,paper

16) Generate string combinations as shown below for the given input string passed as an environment variable.

$ s='{x,y,z}{1,2,3}' ##### add your solution here
x1 x2 x3 y1 y2 y3 z1 z2 z3

17) For the input file varying_fields.txt, construct a solution to get the output shown below.

$ cat varying_fields.txt
hi,bye,there,was,here,to
1,2,3,4,5

##### add your solution here
hi:bye:to
1:2:5

18) The fields.txt file has fields separated by the : character. Delete : and the last field if there is a digit character anywhere before the last field. Solution shouldn't use the s operator.

$ cat fields.txt
42:cat
twelve:a2b
we:be:he:0:a:b:bother
apple:banana-42:cherry:
dragon:unicorn:centaur

##### add your solution here
42
twelve:a2b
we:be:he:0:a:b
apple:banana-42:cherry
dragon:unicorn:centaur

19) The sample string shown below uses cat as the field separator (irrespective of case). Use space as the output field separator and add 42 as the last field.

$ s='applecatfigCaT12345cAtbanana'
$ echo "$s" | ##### add your solution here
apple fig 12345 banana 42

20) For the input file sample.txt, filter lines containing 5 or more lowercase vowels.

##### add your solution here
How are you
Believe it
No doubt you like it too
Much ado about nothing