Field separators

This chapter will dive deep into field processing. You'll learn how to set input and output field separators, how to use regexps for defining fields and how to work with fixed length fields.

Default field separation

By default, the -a option splits based on one or more sequence of whitespace characters. In addition, whitespaces at the start or end of input gets trimmed and won't be part of field contents. Using -a is equivalent to $F = $_.split. From ruby-doc: split:

If pattern is a single space, str is split on whitespace, with leading and trailing whitespace and runs of contiguous whitespace characters ignored...If pattern is nil, the value of $; is used. If $; is nil (which is the default), str is split on whitespace as if ' ' were specified.

$ echo '   a   b   c   ' | ruby -ane 'puts $F.size'
3
$ # note that leading whitespaces isn't part of field content
$ echo '   a   b   c   ' | ruby -ane 'puts $F[0]'
a
$ # note that trailing whitespaces isn't part of field content
$ echo '   a   b   c   ' | ruby -ane 'puts $F[-1] + "."'
c.

$ # here's another example with more whitespace characters thrown in
$ printf '     one \t\f\v two\t\r\tthree  ' | ruby -ane 'puts $F.size'
3
$ printf '     one \t\f\v two\t\r\tthree  ' | ruby -ane 'puts $F[1] + "."'
two.

Input field separator

You can use the -F command line option to specify a custom field separator. The value passed to the option will be treated as a regexp. Note that -a option is also necessary for -F option to work. Instead of -F option, you can also set $; to a string or regexp value in the code, but $; is deprecated.

$ # use ':' as input field separator
$ echo 'goal:amazing:whistle:kwality' | ruby -F: -ane 'puts $F[0], $F[-1]'
goal
kwality

$ # use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | ruby -F';' -ane 'puts $F[2]'
three

$ echo 'load;err_msg--\ant,r2..not' | ruby -F'\W+' -ane 'puts $F[2]'
ant

$ echo 'hi.bye.hello' | ruby -F'\.' -ane 'puts $F[1]'
bye

$ # count number of vowels for each input line
$ printf 'COOL\nnice car\n' | ruby -F'(?i)[aeiou]' -ane 'puts $F.size - 1'
2
3

No need to use field separation to access individual characters. See ruby-doc: Encoding for details on handling different string encodings.

$ echo 'apple' | ruby -ne 'puts $_[0]'
a

$ ruby -e 'puts Encoding.default_external'
UTF-8
$ LC_ALL=C ruby -e 'puts Encoding.default_external'
US-ASCII

$ echo 'fox:αλεπού' | ruby -ne 'puts $_[4..5]'
αλ
$ # use -E option to explicitly specify external/internal encodings
$ echo 'fox:αλεπού' | ruby -E UTF-8:UTF-8 -ne 'puts $_[4..5]'
αλ

warning If the custom field separator with -F option doesn't affect the newline character, then the last element can contain the newline character.

$ # last element will not have newline character with default -a
$ # as leading/trailing whitespaces are trimmed with default split
$ echo 'cat dog' | ruby -ane 'puts "[#{$F[-1]}]"'
[dog]

$ # last element will have newline character since field separator is ':'
$ echo 'cat:dog' | ruby -F: -ane 'puts "[#{$F[-1]}]"'
[dog
]
$ # unless the input itself doesn't have newline character
$ printf 'cat:dog' | ruby -F: -ane 'puts "[#{$F[-1]}]"'
[dog]

The newline character can also show up as the content of last field.

$ # both leading and trailing whitespaces are trimmed
$ echo '  a b   c   ' | ruby -ane 'puts $F.size'
3

$ # leading empty element won't be removed here
$ # and last element will have newline character
$ echo ':a:b:c:' | ruby -F: -ane 'puts $F.size'
5

As mentioned before, the -l option is helpful if you wish to remove the newline character (more details will be discussed in Record separators chapter). A side effect of removing the newline character before applying split is that a trailing empty field will also get removed (you can explicitly call split method with -1 as limit to prevent this).

$ # -l will remove the newline character
$ echo 'cat:dog' | ruby -F: -lane 'puts "[#{$F[-1]}]"'
[dog]
$ # -l will also cause 'print' method to append the newline character
$ echo 'cat:dog' | ruby -F: -lane 'print "[#{$F[-1]}]"'
[dog]

$ # since newline character is chomped, last element is empty
$ # which is then removed due to default 'split' behavior
$ echo ':a:b:c:' | ruby -F: -lane 'puts $F.size'
4
$ # explicit call to split with -1 as limit will preserve the empty element
$ echo ':a:b:c:' | ruby -lane 'puts $_.split(/:/, -1).size'
5

Output field separator

There are a few ways to affect the separator to be used while displaying multiple values. The value of $, global variable is used as the separator when multiple arguments are passed to the print method. This is usually used in combination with -l option so that a newline character is appended automatically as well. The join method also uses $, as the default value. But $, is deprecated now.

$ ruby -lane 'BEGIN{$, = " "}; print $F[0], $F[2]' table.txt
-e:1: warning: `$,` is deprecated
brown mat
blue mug
yellow window

$ ruby -W:no-deprecated -lane 'BEGIN{$, = " "}; print $F[0], $F[2]' table.txt
brown mat
blue mug
yellow window

The other options include manually building the output string within double quotes. Or, use the join method. Note that -l option is used in the examples below as a good practice even when not needed.

$ ruby -lane 'puts "#{$F[0]} #{$F[2]}"' table.txt
brown mat
blue mug
yellow window

$ echo 'Sample123string42with777numbers' | ruby -F'\d+' -lane 'puts $F.join(",")'
Sample,string,with,numbers

$ s='goal:amazing:whistle:kwality'
$ echo "$s" | ruby -F: -lane 'puts $F.values_at(-1, 1, 0).join("-")'
kwality-amazing-goal

$ # you can also use the '*' operator
$ echo "$s" | ruby -F: -lane '$F.append(42); puts $F * "::"'
goal::amazing::whistle::kwality::42

scan method

The -F option uses the split method to get field values from input content. In contrast, scan method allows you to define what should the fields be made up of. And scan method does not have the concept of removing empty trailing fields nor does it have arguments like limit.

$ s='Sample123string42with777numbers'

$ # define fields to be one or more consecutive digits
$ echo "$s" | ruby -lne 'puts $_.scan(/\d+/)[1]'
42

$ # define fields to be one or more consecutive alphabets
$ echo "$s" | ruby -lne 'puts $_.scan(/[a-z]+/i) * ","'
Sample,string,with,numbers

A simple split fails for csv input where fields can contain embedded delimiter characters. For example, a field content "fox,42" when , is the delimiter.

$ s='eagle,"fox,42",bee,frog'

$ # simply using , as separator isn't sufficient
$ echo "$s" | ruby -F, -lane 'puts $F[1]'
"fox

While ruby-doc: CSV library should be preferred for robust csv parsing, scan can be used for simple workarounds.

$ echo "$s" | ruby -lne 'puts $_.scan(/"[^"]*"|[^,]+/)[1]'
"fox,42"

Fixed width processing

The unpack method is more than just a different way of using string slicing. It supports various formats and pre-processing, see ruby-doc: unpack for details.

In the example below, a indicates arbitrary binary string. The optional number that follows indicates length of the field.

$ cat items.txt
apple   fig banana
50      10  200

$ # here field widths have been assigned such that
$ # extra spaces are placed at the end of each field
$ ruby -ne 'puts $_.unpack("a8a4a6") * ","' items.txt
apple   ,fig ,banana
50      ,10  ,200
$ ruby -ne 'puts $_.unpack("a8a4a6")[1]' items.txt
fig 
10  

You can specify characters to be ignored with x followed by optional length.

$ # first field is 5 characters
$ # then 3 characters are ignored and 3 characters for second field
$ # then 1 character is ignored and 6 characters for third field
$ ruby -ne 'puts $_.unpack("a5x3a3xa6") * ","' items.txt
apple,fig,banana
50   ,10 ,200

Using * will cause remaining characters of that particular format to be consumed. Here Z is used to process ASCII NUL separated string.

$ printf 'banana\x0050\x00' | ruby -ne 'puts $_.unpack("Z*Z*") * ":"'
banana:50

$ # first field is 5 characters, then 3 characters are ignored
$ # all the remaining characters are assigned to second field
$ ruby -ne 'puts $_.unpack("a5x3a*") * ","' items.txt
apple,fig banana
50   ,10  200

Unpacking isn't always needed, simple string slicing might suffice.

$ echo 'b 123 good' | ruby -ne 'puts $_[2,3]'
123
$ echo 'b 123 good' | ruby -ne 'puts $_[6,4]'
good

$ # replacing arbitrary slice
$ echo 'b 123 good' | ruby -lpe '$_[2,3] = "gleam"'
b gleam good

Assorted field processing methods

Having seen command line options and features commonly used for field processing, this section will highlight some of the built-in array and Enumerable methods. There's just too many to meaningfully cover them in all in detail, so consider this to be just a brief overview of features.

First up, regexp based field selection. grep(cond) and grep_v(cond) are specialized filter methods that perform cond === object test check. See stackoverflow: What does the === operator do in Ruby? for more details.

$ s='goal:amazing:42:whistle:kwality:3.14'

$ # fields containing 'in' or 'it' or 'is'
$ echo "$s" | ruby -F: -lane 'puts $F.grep(/i[nts]/) * ":"'
amazing:whistle:kwality

$ # fields NOT containing a digit character
$ echo "$s" | ruby -F: -lane 'puts $F.grep_v(/\d/) * ":"'
goal:amazing:whistle:kwality

The map method helps to transform each element according to the logic passed to it.

$ s='goal:amazing:42:whistle:kwality:3.14'
$ echo "$s" | ruby -F: -lane 'puts $F.map(&:upcase) * ":"'
GOAL:AMAZING:42:WHISTLE:KWALITY:3.14

$ # you can also use numbered parameters: {_1.to_i ** 2}
$ echo '23 756 -983 5' | ruby -ane 'puts $F.map {|n| n.to_i ** 2} * " "'
529 571536 966289 25

$ echo 'AaBbCc' | ruby -lne 'puts $_.chars.map(&:ord) * " "'
65 97 66 98 67 99

$ echo '3.14,17,6' | ruby -F, -ane 'puts $F.map(&:to_f).sum'
26.14

The filter method (which has other aliases and opposites too) is handy to construct all kinds of selection conditions. You can combine with map by using the filter_map method.

$ s='hour hand band mat heated pineapple'

$ echo "$s" | ruby -ane 'puts $F.filter {|w| w[0]!="h" && w.size<6}'
band
mat

$ echo "$s" | ruby -ane 'puts $F.filter_map {|w|
                w.gsub(/[ae]/, "X") if w[0]=="h"}'
hour
hXnd
hXXtXd

The reduce method can be used to perform an action against all the elements of an array and get a singular value as the result.

$ # sum of input numbers with initial value of 100
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.map(&:to_f).reduce(100, :+)'
126.14

$ # product of input numbers
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.map(&:to_f).reduce(:*)'
320.28000000000003
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.reduce(1) {|op,n| op*n.to_f}'
320.28000000000003

Here's some examples with sort, sort_by and uniq methods for arrays and strings.

$ s='floor bat to dubious four'
$ echo "$s" | ruby -ane 'puts $F.sort * ":"'
bat:dubious:floor:four:to
$ echo "$s" | ruby -ane 'puts $F.sort_by(&:size) * ":"'
to:bat:four:floor:dubious

$ # numeric sort example
$ echo '23 756 -983 5' | ruby -lane 'puts $F.sort_by(&:to_i) * ":"'
-983:5:23:756

$ echo 'foobar' | ruby -lne 'puts $_.chars.sort.reverse * ""'
roofba

$ s='try a bad to good i teal by nice how'
$ # longer words first, ascending alphabetic order as tie-breaker
$ echo "$s" | ruby -ane 'puts $F.sort { |a, b|
                [b.size, a] <=> [a.size, b] } * ":"'
good:nice:teal:bad:how:try:by:to:a:i

$ s='3,b,a,3,c,d,1,d,c,2,2,2,3,1,b'
$ # note that the input order of elements is preserved
$ echo "$s" | ruby -F, -lane 'puts $F.uniq * ","'
3,b,a,c,d,1,2

Here's an example for sorting in descending order based on header column names.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

$ ruby -ane 'idx = $F.each_index.sort {|i,j| $F[j] <=> $F[i]} if $.==1;
             puts $F.values_at(*idx) * "\t"' marks.txt
Name    Marks   Dept
Raj     53      ECE
Joel    72      ECE
Moi     68      EEE
Surya   81      CSE
Tia     59      EEE
Om      92      ECE
Amy     67      CSE

The shuffle method randomizes the order of elements.

$ s='floor bat to dubious four'
$ echo "$s" | ruby -ane 'puts $F.shuffle * ":"'
bat:floor:dubious:to:four

$ echo 'foobar' | ruby -lne 'print $_.chars.shuffle * ""'
bofrao

Use sample method to get one or more elements of an array in random order.

$ s='hour hand band mat heated pineapple'

$ echo "$s" | ruby -ane 'puts $F.sample'
band
$ echo "$s" | ruby -ane 'puts $F.sample(2)'
pineapple
hand

Summary

This chapter discussed various ways in which you can split (or define) the input into fields and manipulate them. There's many more examples to be discussed related to fields in upcoming chapters.

Exercises

a) Extract only the contents between () or )( from each input line. Assume that () characters will be present only once every line.

$ cat brackets.txt
foo blah blah(ice) 123 xyz$ 
(almond-pista) choco
yo )yoyo( yo

##### add your solution here
ice
almond-pista
yoyo

b) For the input file scores.csv, extract Name and Physics fields in the format shown below.

$ cat scores.csv
Name,Maths,Physics,Chemistry
Blue,67,46,99
Lin,78,83,80
Er,56,79,92
Cy,97,98,95
Ort,68,72,66
Ith,100,100,100

##### add your solution here
Name:Physics
Blue:46
Lin:83
Er:79
Cy:98
Ort:72
Ith:100

c) For the input file scores.csv, display names of those who've scored above 70 in Maths.

##### add your solution here
Lin
Cy
Ith

d) Display the number of word characters for the given inputs. Word definition here is same as used in regular expressions. Can you construct a solution with gsub and one without substitution functions?

$ # solve using gsub
$ echo 'hi there' | ##### add your solution here
7

$ # solve without using substitution functions
$ echo 'u-no;co%."(do_12:as' | ##### add your solution here
12

e) Construct a solution that works for both the given sample inputs and the corresponding output shown.

$ s1='1 "grape" and "mango" and "guava"'
$ s2='("a 1""d""c-2""b")'

$ echo "$s1" | ##### add your solution here
"grape","guava","mango"
$ echo "$s2" | ##### add your solution here
"a 1","b","c-2","d"

f) Display only the third and fifth characters from each input line.

$ printf 'restore\ncat one\ncricket' | ##### add your solution here
so
to
ik

g) Transform the given input file fw.txt to get the output as shown below. If second field is empty (i.e. contains only space characters), replace it with NA.

$ cat fw.txt
1.3  rs   90  0.134563
3.8           6
5.2  ye       8.2387
4.2  kt   32  45.1

##### add your solution here
1.3,rs,0.134563
3.8,NA,6
5.2,ye,8.2387
4.2,kt,45.1

h) For the input file scores.csv, display the header as well as any row which contains b or t (irrespective of case) in the first field.

##### add your solution here
Name,Maths,Physics,Chemistry
Blue,67,46,99
Ort,68,72,66
Ith,100,100,100

i) Extract all whole words that contains 42 but not at the edge of a word. Assume a word cannot contain 42 more than once.

$ s='hi42bye nice1423 bad42 cool_42a 42fake'
$ echo "$s" | ##### add your solution here
hi42bye
nice1423
cool_42a

j) For the input file scores.csv, add another column named GP which is calculated out of 100 by giving 50% weightage to Maths and 25% each for Physics and Chemistry.

##### add your solution here
Name,Maths,Physics,Chemistry,GP
Blue,67,46,99,69.75
Lin,78,83,80,79.75
Er,56,79,92,70.75
Cy,97,98,95,96.75
Ort,68,72,66,68.5
Ith,100,100,100,100.0

k) For the input file mixed_fs.txt, retain only first two fields from each input line. The input and output field separators should be space for first two lines and , for the rest of the lines.

$ cat mixed_fs.txt
rose lily jasmine tulip
pink blue white yellow
car,mat,ball,basket
light green,brown,black,purple

##### add your solution here
rose lily
pink blue
car,mat
light green,brown

l) For the given space separated numbers, filter only numbers in the range 20 to 1000 (inclusive).

$ echo '20 -983 5 756 634223' | ##### add your solution here
20 756

m) For the given space separated words, randomize the order of characters for each word.

$ s='this is a sample sentence'

$ # sample randomized output shown here, could be different for you
$ echo "$s" | ##### add your solution here
shti si a salemp sneentce

n) For the given input file words.txt, filter all lines containing characters in ascending and descending order.

$ cat words.txt
bot
art
are
boat
toe
flee
reed

$ # ascending order
##### add your solution here
bot
art

$ # descending order
##### add your solution here
toe
reed

o) For the given space separated words, extract the three longest words.

$ s='I bought two bananas and three mangoes'

$ echo "$s" | ##### add your solution here
mangoes
bananas
bought

p) Convert the contents of split.txt as shown below.

$ cat split.txt
apple,1:2:5,mango
wry,4,look
pencil,3:8,paper

##### add your solution here
apple,1,mango
apple,2,mango
apple,5,mango
wry,4,look
pencil,3,paper
pencil,8,paper