Processing structured data

All the examples in previous chapters dealt with simple text formats separated by newline or some other sequence of characters. Such data could be linearly read and processed. However, formats like json, csv, xml and html have a specific structure that requires a custom parser before they can be made available for further processing. This chapter will give a brief overview of libraries to parse such structured inputs.

JSON

The JSON built-in module helps you to process json data by converting it to a Ruby hash object. See ruby-doc: JSON for documentation.

Here's an example of converting possibly minified json input to a pretty printed output.

$ s='{"greeting":"hi","marks":[78,62,93]}'

# ARGF.read is an alternate for -0777 option to slurp entire input
$ echo "$s" | ruby -rjson -e 'ip = JSON.parse(ARGF.read);
                   puts JSON.pretty_generate(ip)'
{
  "greeting": "hi",
  "marks": [
    78,
    62,
    93
  ]
}

You can create a shortcut to make it easier for one-liners.

$ # check if shortcut is available
$ type rq
bash: type: rq: not found

$ # add this to your ~/.bashrc (or the file you use for aliases/functions)
$ rq() { ruby -rjson -e 'ip = JSON.parse(ARGF.read);'"$@" ; }

$ s='{"greeting":"hi","marks":[78,62,93]}'

$ # get value of given key
$ echo "$s" | rq 'puts ip["greeting"]'
hi
$ # you can even pass options after the code snippet
$ echo "$s" | rq 'print ip["marks"]' -l
[78, 62, 93]

$ # use JSON.pretty_generate(ip) if you need pretty output
$ echo "$s" | rq 'ip["marks"][1] = 100; puts JSON.generate(ip)'
{"greeting":"hi","marks":[78,100,93]}

Here's another example.

$ cat sample.json
{
    "fruit": "apple",
    "blue": ["toy", "flower", "sand stone"],
    "light blue": ["flower", "sky", "water"],
    "language": {
        "natural": ["english", "hindi", "spanish"],
        "programming": ["python", "kotlin", "ruby"]
    },
    "physics": 84
}

$ # process top-level keys not containing 'e'
$ rq 'ip.each {|k,v| puts "#{k}:#{v}" if !k.match?(/e/)}' sample.json
fruit:apple
physics:84

$ # process keys within 'language' key that contain 't'
$ rq 'ip["language"].each {|k,v| puts "#{k}:#{v}" if k.match?(/t/)}' sample.json
natural:["english", "hindi", "spanish"]

CSV

The CSV built-in class comes in handy for processing csv files. See ruby-doc: CSV for documentation.

Here's a simple example that parses entire input string in one shot.

$ s='eagle,"fox,42",bee,frog\n1,2,3,4'

$ printf '%b' "$s" | ruby -rcsv -le 'ip=CSV.new(ARGF.read); print ip.read'
[["eagle", "fox,42", "bee", "frog"], ["1", "2", "3", "4"]]

$ printf '%b' "$s" | ruby -rcsv -e 'ip=CSV.new(ARGF.read); puts ip.read[0][1]'
fox,42

Here's an example with newline character inside quoted fields. This example directly uses the input filename instead of passing it as command line argument to the ruby script.

$ cat newline.csv
apple,"1
2
3",good
guava,"32
54",nice

$ # this will parse entire input in one shot
$ ruby -rcsv -le 'ip=CSV.read("newline.csv"); print ip'
[["apple", "1\n2\n3", "good"], ["guava", "32\n54", "nice"]]

$ # this is better suited for large inputs
$ ruby -rcsv -e 'CSV.foreach("newline.csv"){ |row|
                 puts row[2] if row[0]=~/pp/ }'
good

You can change field separator using the col_sep option.


$ ruby -rcsv -e 'CSV.foreach("marks.txt", :col_sep => "\t"){ |r|
                 puts r * "," if r[0]=="ECE" }'
ECE,Raj,53
ECE,Joel,72
ECE,Om,92

The headers option will treat the first row as the header, useful for named field processing.

$ ruby -rcsv -e 'CSV.foreach("marks.txt", :headers => true, :col_sep => "\t"){
                 |r| puts r["Name"] }'
Raj
Joel
Moi
Surya
Tia
Om
Amy

You can automatically try to convert field value to given data type(s) using the converters option. See ruby-doc: CSV Typed data reading for details.

$ ruby -rcsv -le 'CSV.foreach("marks.txt", :converters => :integer,
                  :col_sep => "\t"){ print _1 }'
["Dept", "Name", "Marks"]
["ECE", "Raj", 53]
["ECE", "Joel", 72]
["EEE", "Moi", 68]
["CSE", "Surya", 81]
["EEE", "Tia", 59]
["ECE", "Om", 92]
["CSE", "Amy", 67]

XML and HTML

nokogiri is a popular third-party library to parse xml and html formats. It also supports working with malformed data.

To parse an input, you can either pass the string or pass a filehandle. You can also pass URI filehandle for working with URLs directly.

Both XPath and CSS based selections are available. See also XPath introduction and difference between XPath and CSS Selectors. Here's some examples with xpath methods.

$ cat sample.xml
<doc>
    <greeting type="ask">Hi there. How are you?</greeting>
    <greeting type="reply">I am good.</greeting>
    <color>
        <blue>flower</blue>
        <blue>sand stone</blue>
        <light-blue>sky</light-blue>
        <light-blue>water</light-blue>
    </color>
</doc>

$ # all results for 'blue' tag
$ # note that ARGF filehandle is passed directly here
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
        puts ip.xpath("//blue")' sample.xml
<blue>flower</blue>
<blue>sand stone</blue>

$ # selecting based on content of 'blue' tags
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
        puts ip.xpath("//blue").grep(/stone/)' sample.xml
<blue>sand stone</blue>

$ # use 'at_xpath' instead of 'xpath' to get only the first result
$ # or use 'ip.xpath("//blue")[0]' to get specific elements
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
        puts ip.at_xpath("//blue")' sample.xml
<blue>flower</blue>

Here's an example with css methods. Use text method to get the value of the selected tags.

$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
        puts ip.css("light-blue").map(&:text)' sample.xml
sky
water

$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
        puts ip.at_css("blue").text' sample.xml
flower

Here's an example of matching attributes.

$ # you can use //@type if you want all attributes that are named as 'type'
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
        puts ip.xpath("//greeting/@type")' sample.xml
ask
reply

$ # match a specific attribute value
$ # same as: ip.css("greeting[type=\"ask\"]")
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
        puts ip.xpath("//greeting[@type=\"ask\"]").text' sample.xml
Hi there. How are you?

Here's an example with html input.

$ s='https://learnbyexample.github.io/substitution-with-ripgrep/'
$ url="$s" ruby -rnokogiri -ropen-uri -e 'ip=Nokogiri.XML(URI.open(ENV["url"]));
                 puts ip.css("@href")[3..5]'
https://learnbyexample.github.io
https://learnbyexample.github.io/books
https://learnbyexample.github.io/tags

Summary

This chapter showed basic examples of processing structured data like json, csv, xml and html using built-in and third-party libraries. As mentioned in the introduction chapter, using ruby one-liners for such tasks helps you avoid learning the syntax and idioms of a custom command line tool. Another advantage is that you have the entire ecosystem of a programming language at disposal once the structured input has been parsed. If performance becomes a concern, then custom cli tools like jq, xsv and xmlstarlet will come in handy.

No exercises for this final chapter (author is lazy and doesn't have much experience with these formats). So, I'll leave you with links for further reading and as a source of exercises.