Field separators

This chapter will dive deep into field processing. You'll learn how to set input and output field separators, how to use regexps for defining fields and how to work with fixed length fields.

info The example_files directory has all the files used in the examples.

Default field separation

The -a option splits the input based on one or more sequence of whitespace characters. In addition, whitespaces at the start or end of input gets trimmed and won't be part of the field contents. Using -a is equivalent to $F = $_.split. From ruby-doc: split:

If $; is nil (its default value), the split occurs just as if field_sep were given as a space character. When field_sep is ' ' and limit is nil, the split occurs at each sequence of whitespace.

$ echo '   a   b   c   ' | ruby -ane 'puts $F.size'
3
# note that the leading whitespaces aren't part of the field content
$ echo '   a   b   c   ' | ruby -ane 'puts "(#{$F[0]})"'
(a)
# trailing whitespaces are removed as well
$ echo '   a   b   c   ' | ruby -ane 'puts "(#{$F[-1]})"'
(c)

# here's another example with more whitespace characters thrown in
$ printf '     one \t\f\v two\t\r\tthree \t\r ' | ruby -ane 'puts $F.size'
3
$ printf '     one \t\f\v two\t\r\tthree \t\r ' | ruby -ane 'puts $F[1] + "."'
two.

Input field separator

You can use the -F command line option to specify a custom field separator. The value passed to this option will be treated as a regexp. Note that the -a option is also necessary for -F to work.

# use ':' as the input field separator
$ echo 'goal:amazing:whistle:kwality' | ruby -F: -ane 'puts $F[0], $F[-1], $F[1]'
goal
kwality
amazing

# use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | ruby -F';' -ane 'puts $F[2]'
three

$ echo 'load;err_msg--\ant,r2..not' | ruby -F'\W+' -ane 'puts $F[2]'
ant

$ echo 'hi.bye.hello' | ruby -F'\.' -ane 'puts $F[1]'
bye

# count the number of vowels for each input line
$ printf 'COOL\nnice car\n' | ruby -F'(?i)[aeiou]' -ane 'puts $F.size - 1'
2
3

Character-wise separation

No need to use field separation to access individual characters. See ruby-doc: Encoding for details on handling different string encodings.

$ echo 'apple' | ruby -ne 'puts $_[0]'
a

$ ruby -e 'puts Encoding.default_external'
UTF-8
$ LC_ALL=C ruby -e 'puts Encoding.default_external'
US-ASCII

$ echo 'fox:αλεπού' | ruby -ne 'puts $_[4..5]'
αλ
# use the -E option to explicitly specify external/internal encodings
$ echo 'fox:αλεπού' | ruby -E UTF-8:UTF-8 -ne 'puts $_[4..5]'
αλ

Newline character in the last field

If the custom field separator doesn't affect the newline character, then the last element can contain the newline character.

# last element will not have the newline character with the -a option
# as leading/trailing whitespaces are trimmed with default split
$ echo 'cat dog' | ruby -ane 'puts "[#{$F[-1]}]"'
[dog]

# last element will have the newline character since the field separator is ':'
$ echo 'cat:dog' | ruby -F: -ane 'puts "[#{$F[-1]}]"'
[dog
]
# unless the input itself doesn't have newline characters
$ printf 'cat:dog' | ruby -F: -ane 'puts "[#{$F[-1]}]"'
[dog]

The newline character can also show up as the entire content of the last field.

# both the leading and trailing whitespaces are trimmed
$ echo '  a b   c   ' | ruby -ane 'puts $F.size'
3

# leading empty element won't be removed here
# and the last element will have only the newline character as the value
$ echo ':a:b:c:' | ruby -F: -ane 'puts $F.size; puts "[#{$F[-1]}]"'
5
[
]

Using the -l option for field splitting

As mentioned before, the -l option is helpful if you wish to remove the newline character (more details will be discussed in the Record separators chapter). A side effect of removing the newline character before applying split is that the trailing empty fields will also get removed (you can explicitly call the split method with -1 as limit to prevent this).

# -l will remove the newline character
$ echo 'cat:dog' | ruby -F: -lane 'puts "[#{$F[-1]}]"'
[dog]
# -l will also cause 'print' to append the newline character
$ echo 'cat:dog' | ruby -F: -lane 'print "[#{$F[-1]}]"'
[dog]

# since the newline character is chomped, last element is empty
# which is then removed due to the default 'split' behavior
$ echo ':a:b:c:' | ruby -F: -lane 'puts $F.size'
4
# explicit call to split with -1 as the limit will preserve the empty element
$ echo ':a:b:c:' | ruby -lane 'puts $_.split(/:/, -1).size'
5

Output field separator

There are a few ways to affect the separator to be used while displaying multiple values. The value of the $, global variable is used as the separator when multiple arguments are passed to the print method. This is usually used in combination with the -l option so that a newline character is appended automatically as well. The join method also uses $, as the default value.

$ ruby -lane 'BEGIN{$, = " "}; print $F[0], $F[2]' table.txt
brown mat
blue mug
yellow window

The other options include manually building the output string within double quotes. Or, use the join method. Note that the -l option is used in the examples below as a good practice even when not needed.

$ ruby -lane 'puts "#{$F[0]} #{$F[2]}"' table.txt
brown mat
blue mug
yellow window

$ echo 'Sample123string42with777numbers' | ruby -F'\d+' -lane 'puts $F.join(",")'
Sample,string,with,numbers

$ s='goal:amazing:whistle:kwality'
$ echo "$s" | ruby -F: -lane 'puts $F.values_at(-1, 1, 0).join("-")'
kwality-amazing-goal
# you can also use the '*' operator
$ echo "$s" | ruby -F: -lane '$F.append(42); puts $F * "::"'
goal::amazing::whistle::kwality::42

scan method

The -F option uses the split method to generate the fields. In contrast, the scan method allows you to define what should the fields be made up of. The scan method does not have the concept of removing empty trailing fields nor does it have the limit argument.

$ s='Sample123string42with777numbers'
# define fields to be one or more consecutive digits
$ echo "$s" | ruby -ne 'puts $_.scan(/\d+/)[1]'
42

$ s='coat Bin food tar12 best Apple fig_42'
# whole words made up of lowercase alphabets and digits only
$ echo "$s" | ruby -ne 'puts $_.scan(/\b[a-z0-9]+\b/) * ","'
coat,food,tar12,best

$ s='items: "apple" and "mango"'
# get the second double quoted item
$ echo "$s" | ruby -ne 'puts $_.scan(/"[^"]+"/)[1]'
"mango"
# no need to use 'scan' to extract the first matching portion
$ echo "$s" | ruby -ne 'puts $_[/"[^"]+"/]'
"apple"

A simple split fails for CSV input where fields can contain embedded delimiter characters. For example, a field content "fox,42" when , is the delimiter.

$ s='eagle,"fox,42",bee,frog'

# simply using , as the separator isn't sufficient
$ echo "$s" | ruby -F, -lane 'puts $F[1]'
"fox

While the ruby-doc: CSV library should be preferred for robust CSV parsing, regexp is enough for simple formats.

$ echo "$s" | ruby -lne 'puts $_.scan(/"[^"]*"|[^,]+/)[1]'
"fox,42"

Fixed width processing

The unpack method is more than just a different way of using string slicing. It supports various formats and pre-processing, see ruby-doc: Packed Data for details.

In the example below, a indicates arbitrary binary string. The optional number that follows indicates length of the field.

$ cat items.txt
apple   fig banana
50      10  200

# here field widths have been assigned such that
# extra spaces are placed at the end of each field
$ ruby -ne 'puts $_.unpack("a8a4a6") * ","' items.txt
apple   ,fig ,banana
50      ,10  ,200

$ ruby -ne 'puts $_.unpack("a8a4a6")[1]' items.txt
fig 
10  

You can specify characters to be ignored with x followed by an optional length.

# first field is 5 characters
# then 3 characters are ignored and 3 characters for the second field
# then 1 character is ignored and 6 characters for the third field
$ ruby -ne 'puts $_.unpack("a5x3a3xa6") * ","' items.txt
apple,fig,banana
50   ,10 ,200

Using * will cause remaining characters of that particular format to be consumed. Here Z is used to process strings that are separated by the ASCII NUL character.

$ printf 'banana\x0050\x00' | ruby -ne 'puts $_.unpack("Z*Z*") * ":"'
banana:50

# first field is 5 characters, then 3 characters are ignored
# all the remaining characters are assigned to the second field
$ ruby -ne 'puts $_.unpack("a5x3a*") * ","' items.txt
apple,fig banana
50   ,10  200

Unpacking isn't always needed, simple string slicing might suffice.

$ echo 'b 123 good' | ruby -ne 'puts $_[2,3]'
123
$ echo 'b 123 good' | ruby -ne 'puts $_[6,4]'
good

# replacing arbitrary slice
$ echo 'b 123 good' | ruby -lpe '$_[2,3] = "gleam"'
b gleam good

Assorted field processing methods

Having seen command line options and features commonly used for field processing, this section will highlight some of the built-in array and Enumerable methods. There are just too many to meaningfully cover them all in detail, so consider this to be just a brief overview of features.

First up, regexp based field selection. grep(cond) and grep_v(cond) are specialized filter methods that perform cond === object test check. See stackoverflow: What does the === operator do in Ruby? for more details.

$ s='goal:amazing:42:whistle:kwality:3.14'

# fields containing 'in' or 'it' or 'is'
$ echo "$s" | ruby -F: -lane 'puts $F.grep(/i[nts]/) * ":"'
amazing:whistle:kwality

# fields NOT containing a digit character
$ echo "$s" | ruby -F: -lane 'puts $F.grep_v(/\d/) * ":"'
goal:amazing:whistle:kwality

# no more than one field can contain 'r'
$ ruby -lane 'print if $F.grep(/r/).size <= 1' table.txt
blue cake mug shirt -7
yellow banana window shoes 3.14

The map method transforms each element according to the logic passed to it.

$ s='goal:amazing:42:whistle:kwality:3.14'
$ echo "$s" | ruby -F: -lane 'puts $F.map(&:upcase) * ":"'
GOAL:AMAZING:42:WHISTLE:KWALITY:3.14

$ echo '23 756 -983 5' | ruby -ane 'puts $F.map {_1.to_i ** 2} * " "'
529 571536 966289 25

$ echo 'AaBbCc' | ruby -lne 'puts $_.chars.map(&:ord) * " "'
65 97 66 98 67 99

$ echo '3.14,17,6' | ruby -F, -ane 'puts $F.map(&:to_f).sum'
26.14

The filter method (which has other aliases and opposites too) is handy to construct all kinds of selection conditions. You can combine with map by using the filter_map method.

$ s='hour hand band mat heated pineapple'

$ echo "$s" | ruby -ane 'puts $F.filter {_1[0]!="h" && _1.size<6}'
band
mat

$ echo "$s" | ruby -ane 'puts $F.filter_map {|w|
                         w.gsub(/[ae]/, "X") if w[0]=="h"}'
hour
hXnd
hXXtXd

The reduce method can be used to perform an action against all the elements of an array and get a singular value as the result.

# sum of input numbers, with initial value of 100
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.map(&:to_f).reduce(100, :+)'
126.14

# product of input numbers
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.map(&:to_f).reduce(:*)'
320.28000000000003
# with initial value of 2
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.reduce(2) {|op,n| op*n.to_f}'
640.5600000000001

Here are some examples with the sort, sort_by and uniq methods for arrays and strings.

$ s='floor bat to dubious four'
$ echo "$s" | ruby -ane 'puts $F.sort * ":"'
bat:dubious:floor:four:to
$ echo "$s" | ruby -ane 'puts $F.sort_by(&:size) * ":"'
to:bat:four:floor:dubious

# numeric sort example
$ echo '23 756 -983 5' | ruby -lane 'puts $F.sort_by(&:to_i) * ":"'
-983:5:23:756

$ echo 'dragon' | ruby -lne 'puts $_.chars.sort.reverse * ""'
rongda

$ s='try a bad to good i teal by nice how'
# longer words first, ascending alphabetic order as tie-breaker
$ echo "$s" | ruby -ane 'puts $F.sort_by {|w| [-w.size, w]} * ":"'
good:nice:teal:bad:how:try:by:to:a:i

$ s='3,b,a,3,c,d,1,d,c,2,2,2,3,1,b'
# note that the input order of elements is preserved
$ echo "$s" | ruby -F, -lane 'puts $F.uniq * ","'
3,b,a,c,d,1,2

Here's an example for sorting in descending order based on header column names.

$ cat marks.txt
Dept    Name    Marks
ECE     Raj     53
ECE     Joel    72
EEE     Moi     68
CSE     Surya   81
EEE     Tia     59
ECE     Om      92
CSE     Amy     67

$ ruby -ane 'idx = $F.each_index.sort {$F[_2] <=> $F[_1]} if $.==1;
             puts $F.values_at(*idx) * "\t"' marks.txt
Name    Marks   Dept
Raj     53      ECE
Joel    72      ECE
Moi     68      EEE
Surya   81      CSE
Tia     59      EEE
Om      92      ECE
Amy     67      CSE

The shuffle method randomizes the order of elements.

$ s='floor bat to dubious four'
$ echo "$s" | ruby -ane 'puts $F.shuffle * ":"'
bat:floor:dubious:to:four

$ echo 'foobar' | ruby -lne 'print $_.chars.shuffle * ""'
bofrao

Use the sample method to get one or more elements of an array in random order.

$ s='hour hand band mat heated pineapple'

$ echo "$s" | ruby -ane 'puts $F.sample'
band
$ echo "$s" | ruby -ane 'puts $F.sample(2)'
pineapple
hand

Summary

This chapter discussed various ways in which you can split (or define) the input into fields and manipulate them. Many more examples will be discussed in later chapters.

Exercises

info The exercises directory has all the files used in this section.

1) For the input file brackets.txt, extract only the contents between () or )( from each input line. Assume that () characters will be present only once every line.

$ cat brackets.txt
foo blah blah(ice) 123 xyz$ 
(almond-pista) choco
yo )yoyo( yo

##### add your solution here
ice
almond-pista
yoyo

2) For the input file scores.csv, extract Name and Physics fields in the format shown below.

$ cat scores.csv
Name,Maths,Physics,Chemistry
Blue,67,46,99
Lin,78,83,80
Er,56,79,92
Cy,97,98,95
Ort,68,72,66
Ith,100,100,100

##### add your solution here
Name:Physics
Blue:46
Lin:83
Er:79
Cy:98
Ort:72
Ith:100

3) For the input file scores.csv, display names of those who've scored above 70 in Maths.

##### add your solution here
Lin
Cy
Ith

4) Display the number of word characters for the given inputs. Word definition here is same as used in regular expressions. Can you construct a solution with gsub and one without the substitution functions?

# solve using gsub
$ echo 'hi there' | ##### add your solution here
7

# solve without using the substitution functions
$ echo 'u-no;co%."(do_12:as' | ##### add your solution here
12

5) For the input file quoted.txt, extract the sequence of characters surrounded by double quotes and display them in the format shown below.

$ cat quoted.txt
1 "grape" and "mango" and "guava"
("c 1""d""a-2""b")

##### add your solution here
"grape","guava","mango"
"a-2","b","c 1","d"

6) Display only the third and fifth characters from each input line.

$ printf 'restore\ncat one\ncricket' | ##### add your solution here
so
to
ik

7) Transform the given input file fw.txt to get the output as shown below. If the second field is empty (i.e. contains only space characters), replace it with NA.

$ cat fw.txt
1.3  rs   90  0.134563
3.8           6
5.2  ye       8.2387
4.2  kt   32  45.1

##### add your solution here
1.3,rs,0.134563
3.8,NA,6
5.2,ye,8.2387
4.2,kt,45.1

8) For the input file scores.csv, display the header as well as any row which contains b or t (irrespective of case) in the first field.

##### add your solution here
Name,Maths,Physics,Chemistry
Blue,67,46,99
Ort,68,72,66
Ith,100,100,100

9) Extract all whole words containing 42 but not at the edge of a word. Assume a word cannot contain 42 more than once.

$ s='hi42bye nice1423 bad42 cool_42a 42fake'
$ echo "$s" | ##### add your solution here
hi42bye
nice1423
cool_42a

10) For the input file scores.csv, add another column named GP which is calculated out of 100 by giving 50% weightage to Maths and 25% each for Physics and Chemistry.

##### add your solution here
Name,Maths,Physics,Chemistry,GP
Blue,67,46,99,69.75
Lin,78,83,80,79.75
Er,56,79,92,70.75
Cy,97,98,95,96.75
Ort,68,72,66,68.5
Ith,100,100,100,100.0

11) For the input file mixed_fs.txt, retain only the first two fields from each input line. The input and output field separators should be space for first two lines and , for the rest of the lines.

$ cat mixed_fs.txt
rose lily jasmine tulip
pink blue white yellow
car,mat,ball,basket
light green,brown,black,purple
apple,banana,cherry

##### add your solution here
rose lily
pink blue
car,mat
light green,brown
apple,banana

12) For the given space separated numbers, filter only numbers in the range 20 to 1000 (inclusive).

$ s='20 -983 5 756 634223 1000'

$ echo "$s" | ##### add your solution here
20 756 1000

13) For the given space separated words, randomize the order of characters for each word.

$ s='this is a sample sentence'

# sample randomized output shown here, could be different for you
$ echo "$s" | ##### add your solution here
shti si a salemp sneentce

14) For the given input file words.txt, filter all lines containing characters in ascending and descending order.

$ cat words.txt
bot
art
are
boat
toe
flee
reed

# ascending order
##### add your solution here
bot
art

# descending order
##### add your solution here
toe
reed

15) For the given space separated words, extract the three longest words.

$ s='I bought two bananas and three mangoes'

$ echo "$s" | ##### add your solution here
mangoes
bananas
bought

16) Convert the contents of split.txt as shown below.

$ cat split.txt
apple,1:2:5,mango
wry,4,look
pencil,3:8,paper

##### add your solution here
apple,1,mango
apple,2,mango
apple,5,mango
wry,4,look
pencil,3,paper
pencil,8,paper

17) For the input file varying_fields.txt, construct a solution to get the output shown below.

$ cat varying_fields.txt
hi,bye,there,was,here,to
1,2,3,4,5

##### add your solution here
hi:bye:to
1:2:5

18) The fields.txt file has fields separated by the : character. Delete : and the last field if there is a digit character anywhere before the last field. Solution shouldn't use the substitution functions.

$ cat fields.txt
42:cat
twelve:a2b
we:be:he:0:a:b:bother
apple:banana-42:cherry:
dragon:unicorn:centaur

##### add your solution here
42
twelve:a2b
we:be:he:0:a:b
apple:banana-42:cherry
dragon:unicorn:centaur

19) The sample string shown below uses cat as the field separator (irrespective of case). Use space as the output field separator and add 42 as the last field.

$ s='applecatfigCaT12345cAtbanana'

$ echo "$s" | ##### add your solution here
apple fig 12345 banana 42

20) For the input file sample.txt, filter lines containing 5 or more lowercase vowels.

##### add your solution here
How are you
Believe it
No doubt you like it too
Much ado about nothing