Field separators
This chapter will dive deep into field processing. You'll learn how to set input and output field separators, how to use regexps for defining fields and how to work with fixed length fields.
The example_files directory has all the files used in the examples.
Default field separation
The -a
option splits the input based on one or more sequence of whitespace characters. In addition, whitespaces at the start or end of input gets trimmed and won't be part of the field contents. Using -a
is equivalent to $F = $_.split
. From ruby-doc: split:
If
$;
isnil
(its default value), the split occurs just as iffield_sep
were given as a space character. Whenfield_sep
is' '
andlimit
isnil
, the split occurs at each sequence of whitespace.
$ echo ' a b c ' | ruby -ane 'puts $F.size'
3
# note that the leading whitespaces aren't part of the field content
$ echo ' a b c ' | ruby -ane 'puts "(#{$F[0]})"'
(a)
# trailing whitespaces are removed as well
$ echo ' a b c ' | ruby -ane 'puts "(#{$F[-1]})"'
(c)
# here's another example with more whitespace characters thrown in
$ printf ' one \t\f\v two\t\r\tthree \t\r ' | ruby -ane 'puts $F.size'
3
$ printf ' one \t\f\v two\t\r\tthree \t\r ' | ruby -ane 'puts $F[1] + "."'
two.
Input field separator
You can use the -F
command line option to specify a custom field separator. The value passed to this option will be treated as a regexp. Note that the -a
option is also necessary for -F
to work.
# use ':' as the input field separator
$ echo 'goal:amazing:whistle:kwality' | ruby -F: -ane 'puts $F[0], $F[-1], $F[1]'
goal
kwality
amazing
# use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | ruby -F';' -ane 'puts $F[2]'
three
$ echo 'load;err_msg--\ant,r2..not' | ruby -F'\W+' -ane 'puts $F[2]'
ant
$ echo 'hi.bye.hello' | ruby -F'\.' -ane 'puts $F[1]'
bye
# count the number of vowels for each input line
$ printf 'COOL\nnice car\n' | ruby -F'(?i)[aeiou]' -ane 'puts $F.size - 1'
2
3
Character-wise separation
No need to use field separation to access individual characters. See ruby-doc: Encoding for details on handling different string encodings.
$ echo 'apple' | ruby -ne 'puts $_[0]'
a
$ ruby -e 'puts Encoding.default_external'
UTF-8
$ LC_ALL=C ruby -e 'puts Encoding.default_external'
US-ASCII
$ echo 'fox:αλεπού' | ruby -ne 'puts $_[4..5]'
αλ
# use the -E option to explicitly specify external/internal encodings
$ echo 'fox:αλεπού' | ruby -E UTF-8:UTF-8 -ne 'puts $_[4..5]'
αλ
Newline character in the last field
If the custom field separator doesn't affect the newline character, then the last element can contain the newline character.
# last element will not have the newline character with the -a option
# as leading/trailing whitespaces are trimmed with default split
$ echo 'cat dog' | ruby -ane 'puts "[#{$F[-1]}]"'
[dog]
# last element will have the newline character since the field separator is ':'
$ echo 'cat:dog' | ruby -F: -ane 'puts "[#{$F[-1]}]"'
[dog
]
# unless the input itself doesn't have newline characters
$ printf 'cat:dog' | ruby -F: -ane 'puts "[#{$F[-1]}]"'
[dog]
The newline character can also show up as the entire content of the last field.
# both the leading and trailing whitespaces are trimmed
$ echo ' a b c ' | ruby -ane 'puts $F.size'
3
# leading empty element won't be removed here
# and the last element will have only the newline character as the value
$ echo ':a:b:c:' | ruby -F: -ane 'puts $F.size; puts "[#{$F[-1]}]"'
5
[
]
Using the -l option for field splitting
As mentioned before, the -l
option is helpful if you wish to remove the newline character (more details will be discussed in the Record separators chapter). A side effect of removing the newline character before applying split
is that the trailing empty fields will also get removed (you can explicitly call the split
method with -1
as limit to prevent this).
# -l will remove the newline character
$ echo 'cat:dog' | ruby -F: -lane 'puts "[#{$F[-1]}]"'
[dog]
# -l will also cause 'print' to append the newline character
$ echo 'cat:dog' | ruby -F: -lane 'print "[#{$F[-1]}]"'
[dog]
# since the newline character is chomped, last element is empty
# which is then removed due to the default 'split' behavior
$ echo ':a:b:c:' | ruby -F: -lane 'puts $F.size'
4
# explicit call to split with -1 as the limit will preserve the empty element
$ echo ':a:b:c:' | ruby -lane 'puts $_.split(/:/, -1).size'
5
Output field separator
There are a few ways to affect the separator to be used while displaying multiple values. The value of the $,
global variable is used as the separator when multiple arguments are passed to the print
method. This is usually used in combination with the -l
option so that a newline character is appended automatically as well. The join
method also uses $,
as the default value.
$ ruby -lane 'BEGIN{$, = " "}; print $F[0], $F[2]' table.txt
brown mat
blue mug
yellow window
The other options include manually building the output string within double quotes. Or, use the join
method. Note that the -l
option is used in the examples below as a good practice even when not needed.
$ ruby -lane 'puts "#{$F[0]} #{$F[2]}"' table.txt
brown mat
blue mug
yellow window
$ echo 'Sample123string42with777numbers' | ruby -F'\d+' -lane 'puts $F.join(",")'
Sample,string,with,numbers
$ s='goal:amazing:whistle:kwality'
$ echo "$s" | ruby -F: -lane 'puts $F.values_at(-1, 1, 0).join("-")'
kwality-amazing-goal
# you can also use the '*' operator
$ echo "$s" | ruby -F: -lane '$F.append(42); puts $F * "::"'
goal::amazing::whistle::kwality::42
scan method
The -F
option uses the split
method to generate the fields. In contrast, the scan
method allows you to define what should the fields be made up of. The scan
method does not have the concept of removing empty trailing fields nor does it have the limit
argument.
$ s='Sample123string42with777numbers'
# define fields to be one or more consecutive digits
$ echo "$s" | ruby -ne 'puts $_.scan(/\d+/)[1]'
42
$ s='coat Bin food tar12 best Apple fig_42'
# whole words made up of lowercase alphabets and digits only
$ echo "$s" | ruby -ne 'puts $_.scan(/\b[a-z0-9]+\b/) * ","'
coat,food,tar12,best
$ s='items: "apple" and "mango"'
# get the second double quoted item
$ echo "$s" | ruby -ne 'puts $_.scan(/"[^"]+"/)[1]'
"mango"
# no need to use 'scan' to extract the first matching portion
$ echo "$s" | ruby -ne 'puts $_[/"[^"]+"/]'
"apple"
A simple split
fails for CSV input where fields can contain embedded delimiter characters. For example, a field content "fox,42"
when ,
is the delimiter.
$ s='eagle,"fox,42",bee,frog'
# simply using , as the separator isn't sufficient
$ echo "$s" | ruby -F, -lane 'puts $F[1]'
"fox
While the ruby-doc: CSV library should be preferred for robust CSV parsing, regexp is enough for simple formats.
$ echo "$s" | ruby -lne 'puts $_.scan(/"[^"]*"|[^,]+/)[1]'
"fox,42"
Fixed width processing
The unpack
method is more than just a different way of using string slicing. It supports various formats and pre-processing, see ruby-doc: Packed Data for details.
In the example below, a
indicates arbitrary binary string. The optional number that follows indicates length of the field.
$ cat items.txt
apple fig banana
50 10 200
# here field widths have been assigned such that
# extra spaces are placed at the end of each field
$ ruby -ne 'puts $_.unpack("a8a4a6") * ","' items.txt
apple ,fig ,banana
50 ,10 ,200
$ ruby -ne 'puts $_.unpack("a8a4a6")[1]' items.txt
fig
10
You can specify characters to be ignored with x
followed by an optional length.
# first field is 5 characters
# then 3 characters are ignored and 3 characters for the second field
# then 1 character is ignored and 6 characters for the third field
$ ruby -ne 'puts $_.unpack("a5x3a3xa6") * ","' items.txt
apple,fig,banana
50 ,10 ,200
Using *
will cause remaining characters of that particular format to be consumed. Here Z
is used to process strings that are separated by the ASCII NUL character.
$ printf 'banana\x0050\x00' | ruby -ne 'puts $_.unpack("Z*Z*") * ":"'
banana:50
# first field is 5 characters, then 3 characters are ignored
# all the remaining characters are assigned to the second field
$ ruby -ne 'puts $_.unpack("a5x3a*") * ","' items.txt
apple,fig banana
50 ,10 200
Unpacking isn't always needed, simple string slicing might suffice.
$ echo 'b 123 good' | ruby -ne 'puts $_[2,3]'
123
$ echo 'b 123 good' | ruby -ne 'puts $_[6,4]'
good
# replacing arbitrary slice
$ echo 'b 123 good' | ruby -lpe '$_[2,3] = "gleam"'
b gleam good
Assorted field processing methods
Having seen command line options and features commonly used for field processing, this section will highlight some of the built-in array and Enumerable methods. There are just too many to meaningfully cover them all in detail, so consider this to be just a brief overview of features.
First up, regexp based field selection. grep(cond)
and grep_v(cond)
are specialized filter methods that perform cond === object
test check. See stackoverflow: What does the === operator do in Ruby? for more details.
$ s='goal:amazing:42:whistle:kwality:3.14'
# fields containing 'in' or 'it' or 'is'
$ echo "$s" | ruby -F: -lane 'puts $F.grep(/i[nts]/) * ":"'
amazing:whistle:kwality
# fields NOT containing a digit character
$ echo "$s" | ruby -F: -lane 'puts $F.grep_v(/\d/) * ":"'
goal:amazing:whistle:kwality
# no more than one field can contain 'r'
$ ruby -lane 'print if $F.grep(/r/).size <= 1' table.txt
blue cake mug shirt -7
yellow banana window shoes 3.14
The map
method transforms each element according to the logic passed to it.
$ s='goal:amazing:42:whistle:kwality:3.14'
$ echo "$s" | ruby -F: -lane 'puts $F.map(&:upcase) * ":"'
GOAL:AMAZING:42:WHISTLE:KWALITY:3.14
$ echo '23 756 -983 5' | ruby -ane 'puts $F.map {_1.to_i ** 2} * " "'
529 571536 966289 25
$ echo 'AaBbCc' | ruby -lne 'puts $_.chars.map(&:ord) * " "'
65 97 66 98 67 99
$ echo '3.14,17,6' | ruby -F, -ane 'puts $F.map(&:to_f).sum'
26.14
The filter
method (which has other aliases and opposites too) is handy to construct all kinds of selection conditions. You can combine with map
by using the filter_map
method.
$ s='hour hand band mat heated pineapple'
$ echo "$s" | ruby -ane 'puts $F.filter {_1[0]!="h" && _1.size<6}'
band
mat
$ echo "$s" | ruby -ane 'puts $F.filter_map {|w|
w.gsub(/[ae]/, "X") if w[0]=="h"}'
hour
hXnd
hXXtXd
The reduce
method can be used to perform an action against all the elements of an array and get a singular value as the result.
# sum of input numbers, with initial value of 100
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.map(&:to_f).reduce(100, :+)'
126.14
# product of input numbers
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.map(&:to_f).reduce(:*)'
320.28000000000003
# with initial value of 2
$ echo '3.14,17,6' | ruby -F, -lane 'puts $F.reduce(2) {|op,n| op*n.to_f}'
640.5600000000001
Here are some examples with the sort
, sort_by
and uniq
methods for arrays and strings.
$ s='floor bat to dubious four'
$ echo "$s" | ruby -ane 'puts $F.sort * ":"'
bat:dubious:floor:four:to
$ echo "$s" | ruby -ane 'puts $F.sort_by(&:size) * ":"'
to:bat:four:floor:dubious
# numeric sort example
$ echo '23 756 -983 5' | ruby -lane 'puts $F.sort_by(&:to_i) * ":"'
-983:5:23:756
$ echo 'dragon' | ruby -lne 'puts $_.chars.sort.reverse * ""'
rongda
$ s='try a bad to good i teal by nice how'
# longer words first, ascending alphabetic order as tie-breaker
$ echo "$s" | ruby -ane 'puts $F.sort_by {|w| [-w.size, w]} * ":"'
good:nice:teal:bad:how:try:by:to:a:i
$ s='3,b,a,3,c,d,1,d,c,2,2,2,3,1,b'
# note that the input order of elements is preserved
$ echo "$s" | ruby -F, -lane 'puts $F.uniq * ","'
3,b,a,c,d,1,2
Here's an example for sorting in descending order based on header column names.
$ cat marks.txt
Dept Name Marks
ECE Raj 53
ECE Joel 72
EEE Moi 68
CSE Surya 81
EEE Tia 59
ECE Om 92
CSE Amy 67
$ ruby -ane 'idx = $F.each_index.sort {$F[_2] <=> $F[_1]} if $.==1;
puts $F.values_at(*idx) * "\t"' marks.txt
Name Marks Dept
Raj 53 ECE
Joel 72 ECE
Moi 68 EEE
Surya 81 CSE
Tia 59 EEE
Om 92 ECE
Amy 67 CSE
The shuffle
method randomizes the order of elements.
$ s='floor bat to dubious four'
$ echo "$s" | ruby -ane 'puts $F.shuffle * ":"'
bat:floor:dubious:to:four
$ echo 'foobar' | ruby -lne 'print $_.chars.shuffle * ""'
bofrao
Use the sample
method to get one or more elements of an array in random order.
$ s='hour hand band mat heated pineapple'
$ echo "$s" | ruby -ane 'puts $F.sample'
band
$ echo "$s" | ruby -ane 'puts $F.sample(2)'
pineapple
hand
Summary
This chapter discussed various ways in which you can split (or define) the input into fields and manipulate them. Many more examples will be discussed in later chapters.
Exercises
The exercises directory has all the files used in this section.
1) For the input file brackets.txt
, extract only the contents between ()
or )(
from each input line. Assume that ()
characters will be present only once every line.
$ cat brackets.txt
foo blah blah(ice) 123 xyz$
(almond-pista) choco
yo )yoyo( yo
##### add your solution here
ice
almond-pista
yoyo
2) For the input file scores.csv
, extract Name
and Physics
fields in the format shown below.
$ cat scores.csv
Name,Maths,Physics,Chemistry
Blue,67,46,99
Lin,78,83,80
Er,56,79,92
Cy,97,98,95
Ort,68,72,66
Ith,100,100,100
##### add your solution here
Name:Physics
Blue:46
Lin:83
Er:79
Cy:98
Ort:72
Ith:100
3) For the input file scores.csv
, display names of those who've scored above 70
in Maths.
##### add your solution here
Lin
Cy
Ith
4) Display the number of word characters for the given inputs. Word definition here is same as used in regular expressions. Can you construct a solution with gsub
and one without the substitution functions?
# solve using gsub
$ echo 'hi there' | ##### add your solution here
7
# solve without using the substitution functions
$ echo 'u-no;co%."(do_12:as' | ##### add your solution here
12
5) For the input file quoted.txt
, extract the sequence of characters surrounded by double quotes and display them in the format shown below.
$ cat quoted.txt
1 "grape" and "mango" and "guava"
("c 1""d""a-2""b")
##### add your solution here
"grape","guava","mango"
"a-2","b","c 1","d"
6) Display only the third and fifth characters from each input line.
$ printf 'restore\ncat one\ncricket' | ##### add your solution here
so
to
ik
7) Transform the given input file fw.txt
to get the output as shown below. If the second field is empty (i.e. contains only space characters), replace it with NA
.
$ cat fw.txt
1.3 rs 90 0.134563
3.8 6
5.2 ye 8.2387
4.2 kt 32 45.1
##### add your solution here
1.3,rs,0.134563
3.8,NA,6
5.2,ye,8.2387
4.2,kt,45.1
8) For the input file scores.csv
, display the header as well as any row which contains b
or t
(irrespective of case) in the first field.
##### add your solution here
Name,Maths,Physics,Chemistry
Blue,67,46,99
Ort,68,72,66
Ith,100,100,100
9) Extract all whole words containing 42
but not at the edge of a word. Assume a word cannot contain 42
more than once.
$ s='hi42bye nice1423 bad42 cool_42a 42fake'
$ echo "$s" | ##### add your solution here
hi42bye
nice1423
cool_42a
10) For the input file scores.csv
, add another column named GP which is calculated out of 100 by giving 50% weightage to Maths and 25% each for Physics and Chemistry.
##### add your solution here
Name,Maths,Physics,Chemistry,GP
Blue,67,46,99,69.75
Lin,78,83,80,79.75
Er,56,79,92,70.75
Cy,97,98,95,96.75
Ort,68,72,66,68.5
Ith,100,100,100,100.0
11) For the input file mixed_fs.txt
, retain only the first two fields from each input line. The input and output field separators should be space for first two lines and ,
for the rest of the lines.
$ cat mixed_fs.txt
rose lily jasmine tulip
pink blue white yellow
car,mat,ball,basket
light green,brown,black,purple
apple,banana,cherry
##### add your solution here
rose lily
pink blue
car,mat
light green,brown
apple,banana
12) For the given space separated numbers, filter only numbers in the range 20
to 1000
(inclusive).
$ s='20 -983 5 756 634223 1000'
$ echo "$s" | ##### add your solution here
20 756 1000
13) For the given space separated words, randomize the order of characters for each word.
$ s='this is a sample sentence'
# sample randomized output shown here, could be different for you
$ echo "$s" | ##### add your solution here
shti si a salemp sneentce
14) For the given input file words.txt
, filter all lines containing characters in ascending and descending order.
$ cat words.txt
bot
art
are
boat
toe
flee
reed
# ascending order
##### add your solution here
bot
art
# descending order
##### add your solution here
toe
reed
15) For the given space separated words, extract the three longest words.
$ s='I bought two bananas and three mangoes'
$ echo "$s" | ##### add your solution here
mangoes
bananas
bought
16) Convert the contents of split.txt
as shown below.
$ cat split.txt
apple,1:2:5,mango
wry,4,look
pencil,3:8,paper
##### add your solution here
apple,1,mango
apple,2,mango
apple,5,mango
wry,4,look
pencil,3,paper
pencil,8,paper
17) For the input file varying_fields.txt
, construct a solution to get the output shown below.
$ cat varying_fields.txt
hi,bye,there,was,here,to
1,2,3,4,5
##### add your solution here
hi:bye:to
1:2:5
18) The fields.txt
file has fields separated by the :
character. Delete :
and the last field if there is a digit character anywhere before the last field. Solution shouldn't use the substitution functions.
$ cat fields.txt
42:cat
twelve:a2b
we:be:he:0:a:b:bother
apple:banana-42:cherry:
dragon:unicorn:centaur
##### add your solution here
42
twelve:a2b
we:be:he:0:a:b
apple:banana-42:cherry
dragon:unicorn:centaur
19) The sample string shown below uses cat
as the field separator (irrespective of case). Use space as the output field separator and add 42
as the last field.
$ s='applecatfigCaT12345cAtbanana'
$ echo "$s" | ##### add your solution here
apple fig 12345 banana 42
20) For the input file sample.txt
, filter lines containing 5 or more lowercase vowels.
##### add your solution here
How are you
Believe it
No doubt you like it too
Much ado about nothing