Processing structured data
Examples in the previous chapters dealt with simple text formats separated by a newline or some other sequence of characters. Such data could be linearly read and processed. However, formats like json
, csv
, xml
and html
have a specific structure that require a custom parser before they can be made available for further processing. This chapter will give a brief overview of libraries to parse such structured inputs.
The example_files directory has all the files used in the examples.
JSON
The JSON
built-in module helps you to process json
data by converting it to a Ruby hash
object. See ruby-doc: JSON for documentation.
Here's an example of converting possibly minified json
input to a pretty printed output.
$ s='{"greeting":"hi","marks":[78,62,93]}'
# ARGF.read is an alternate for -0777 to slurp the entire input
$ echo "$s" | ruby -rjson -e 'ip = JSON.parse(ARGF.read);
puts JSON.pretty_generate(ip)'
{
"greeting": "hi",
"marks": [
78,
62,
93
]
}
You can create a shortcut to make it easier for one-liners.
# check if the shortcut is available
$ type rq
bash: type: rq: not found
# add this to your ~/.bashrc (or the file you use for aliases/functions)
$ rq() { ruby -rjson -e 'ip = JSON.parse(ARGF.read);'"$@" ; }
$ s='{"greeting":"hi","marks":[78,62,93]}'
# get value of the given key
$ echo "$s" | rq 'puts ip["greeting"]'
hi
# you can even pass options after the code snippet
$ echo "$s" | rq 'print ip["marks"]' -l
[78, 62, 93]
# use JSON.pretty_generate(ip) if you need pretty output
$ echo "$s" | rq 'ip["marks"][1] = 100; puts JSON.generate(ip)'
{"greeting":"hi","marks":[78,100,93]}
Here's another example.
$ cat sample.json
{
"fruit": "apple",
"blue": ["toy", "flower", "sand stone"],
"light blue": ["flower", "sky", "water"],
"language": {
"natural": ["english", "hindi", "spanish"],
"programming": ["python", "kotlin", "ruby"]
},
"physics": 84
}
# process top-level keys not containing 'e'
$ rq 'ip.each {|k,v| puts "#{k}:#{v}" if !k.match?(/e/)}' sample.json
fruit:apple
physics:84
# process keys within the 'language' key that contain 't'
$ rq 'ip["language"].each {|k,v| puts "#{k}:#{v}" if k.match?(/t/)}' sample.json
natural:["english", "hindi", "spanish"]
CSV
The CSV
built-in class comes in handy for processing csv
files. See ruby-doc: CSV for documentation.
Here's a simple example that parses the entire input string in one shot.
$ s='eagle,"fox,42",bee,frog\n1,2,3,4'
$ printf '%b' "$s" | ruby -rcsv -le 'ip=CSV.new(ARGF.read); print ip.read'
[["eagle", "fox,42", "bee", "frog"], ["1", "2", "3", "4"]]
$ printf '%b' "$s" | ruby -rcsv -e 'ip=CSV.new(ARGF.read); puts ip.read[0][1]'
fox,42
Here's an example with embedded newline characters. This example directly uses the input filename instead of passing it as command line argument to the Ruby script.
$ cat newline.csv
apple,"1
2
3",good
guava,"32
54",nice
# this will parse the entire input in one shot
$ ruby -rcsv -le 'ip=CSV.read("newline.csv"); print ip'
[["apple", "1\n2\n3", "good"], ["guava", "32\n54", "nice"]]
# this is better suited for large inputs
$ ruby -rcsv -e 'CSV.foreach("newline.csv"){ |row|
puts row[2] if row[0]=~/pp/ }'
good
You can change the field separator using the col_sep
option.
$ ruby -rcsv -e 'CSV.foreach("marks.txt", :col_sep => "\t"){ |r|
puts r * "," if r[0]=="ECE" }'
ECE,Raj,53
ECE,Joel,72
ECE,Om,92
The headers
option will treat the first row as the header, useful for named field processing.
$ ruby -rcsv -e 'CSV.foreach("marks.txt", :headers => true, :col_sep => "\t"){
|r| puts r["Name"] }'
Raj
Joel
Moi
Surya
Tia
Om
Amy
You can automatically try to convert field value to given data types using the converters
option. See ruby-doc: CSV Field Converters for details.
$ ruby -rcsv -le 'CSV.foreach("marks.txt", :converters => :integer,
:col_sep => "\t"){ print _1 }'
["Dept", "Name", "Marks"]
["ECE", "Raj", 53]
["ECE", "Joel", 72]
["EEE", "Moi", 68]
["CSE", "Surya", 81]
["EEE", "Tia", 59]
["ECE", "Om", 92]
["CSE", "Amy", 67]
XML and HTML
nokogiri is a popular third-party library to parse xml
and html
formats. It also supports working with malformed data.
To parse an input, you can either pass a string or pass a filehandle. You can also pass a URI
filehandle for working with URLs directly.
Both XPath and CSS based selections are available. See also XPath introduction and difference between XPath and CSS Selectors. Here are some examples with xpath
methods.
$ cat sample.xml
<doc>
<greeting type="ask">Hi there. How are you?</greeting>
<greeting type="reply">I am good.</greeting>
<color>
<blue>flower</blue>
<blue>sand stone</blue>
<light-blue>sky</light-blue>
<light-blue>water</light-blue>
</color>
</doc>
# all results for the 'blue' tag
# note that the ARGF filehandle is passed directly here
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
puts ip.xpath("//blue")' sample.xml
<blue>flower</blue>
<blue>sand stone</blue>
# filtering based on the content of 'blue' tags
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
puts ip.xpath("//blue").grep(/stone/)' sample.xml
<blue>sand stone</blue>
# use 'at_xpath' instead of 'xpath' to get only the first result
# or use 'ip.xpath("//blue")[0]' to get specific elements
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
puts ip.at_xpath("//blue")' sample.xml
<blue>flower</blue>
Here's an example with css
methods. Use the text
method to get the value of the selected tags.
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
puts ip.css("light-blue").map(&:text)' sample.xml
sky
water
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
puts ip.at_css("blue").text' sample.xml
flower
Here's an example of matching attributes.
# you can use //@type if you want all attributes that are named as 'type'
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
puts ip.xpath("//greeting/@type")' sample.xml
ask
reply
# match a specific attribute value
# same as: ip.css("greeting[type=\"ask\"]")
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
puts ip.xpath("//greeting[@type=\"ask\"]").text' sample.xml
Hi there. How are you?
Here's an example with html
input.
$ s='https://learnbyexample.github.io/substitution-with-ripgrep/'
$ url="$s" ruby -rnokogiri -ropen-uri -e 'ip=Nokogiri.HTML(URI.open(ENV["url"]));
puts ip.css("@href")[5..7]'
https://learnbyexample.github.io/books
https://learnbyexample.github.io/mini
https://learnbyexample.github.io/tips
Summary
This chapter showed basic examples of processing structured data like json
, csv
, xml
and html
using built-in and third-party libraries. As mentioned in the introduction chapter, using Ruby one-liners for such tasks helps you avoid learning the syntax and idioms of a custom command line tool. Another advantage is that you have the entire ecosystem of a programming language at disposal once the structured input has been parsed. If performance becomes a concern, then custom CLI tools like jq, xsv and xmlstarlet will come in handy.
No exercises for this final chapter (author is lazy and doesn't have much experience with these formats). So, I'll leave you with links for further reading and as a source of exercises.