Processing structured data

Examples in the previous chapters dealt with simple text formats separated by a newline or some other sequence of characters. Such data could be linearly read and processed. However, formats like json, csv, xml and html have a specific structure that require a custom parser before they can be made available for further processing. This chapter will give a brief overview of libraries to parse such structured inputs.

info The example_files directory has all the files used in the examples.

JSON

The JSON built-in module helps you to process json data by converting it to a Ruby hash object. See ruby-doc: JSON for documentation.

Here's an example of converting possibly minified json input to a pretty printed output.

$ s='{"greeting":"hi","marks":[78,62,93]}'

# ARGF.read is an alternate for -0777 to slurp the entire input
$ echo "$s" | ruby -rjson -e 'ip = JSON.parse(ARGF.read);
                              puts JSON.pretty_generate(ip)'
{
  "greeting": "hi",
  "marks": [
    78,
    62,
    93
  ]
}

You can create a shortcut to make it easier for one-liners.

# check if the shortcut is available
$ type rq
bash: type: rq: not found

# add this to your ~/.bashrc (or the file you use for aliases/functions)
$ rq() { ruby -rjson -e 'ip = JSON.parse(ARGF.read);'"$@" ; }

$ s='{"greeting":"hi","marks":[78,62,93]}'

# get value of the given key
$ echo "$s" | rq 'puts ip["greeting"]'
hi
# you can even pass options after the code snippet
$ echo "$s" | rq 'print ip["marks"]' -l
[78, 62, 93]

# use JSON.pretty_generate(ip) if you need pretty output
$ echo "$s" | rq 'ip["marks"][1] = 100; puts JSON.generate(ip)'
{"greeting":"hi","marks":[78,100,93]}

Here's another example.

$ cat sample.json
{
    "fruit": "apple",
    "blue": ["toy", "flower", "sand stone"],
    "light blue": ["flower", "sky", "water"],
    "language": {
        "natural": ["english", "hindi", "spanish"],
        "programming": ["python", "kotlin", "ruby"]
    },
    "physics": 84
}

# process top-level keys not containing 'e'
$ rq 'ip.each {|k,v| puts "#{k}:#{v}" if !k.match?(/e/)}' sample.json
fruit:apple
physics:84

# process keys within the 'language' key that contain 't'
$ rq 'ip["language"].each {|k,v| puts "#{k}:#{v}" if k.match?(/t/)}' sample.json
natural:["english", "hindi", "spanish"]

CSV

The CSV built-in class comes in handy for processing csv files. See ruby-doc: CSV for documentation.

Here's a simple example that parses the entire input string in one shot.

$ s='eagle,"fox,42",bee,frog\n1,2,3,4'

$ printf '%b' "$s" | ruby -rcsv -le 'ip=CSV.new(ARGF.read); print ip.read'
[["eagle", "fox,42", "bee", "frog"], ["1", "2", "3", "4"]]

$ printf '%b' "$s" | ruby -rcsv -e 'ip=CSV.new(ARGF.read); puts ip.read[0][1]'
fox,42

Here's an example with embedded newline characters. This example directly uses the input filename instead of passing it as command line argument to the Ruby script.

$ cat newline.csv
apple,"1
2
3",good
guava,"32
54",nice

# this will parse the entire input in one shot
$ ruby -rcsv -le 'ip=CSV.read("newline.csv"); print ip'
[["apple", "1\n2\n3", "good"], ["guava", "32\n54", "nice"]]

# this is better suited for large inputs
$ ruby -rcsv -e 'CSV.foreach("newline.csv"){ |row|
                 puts row[2] if row[0]=~/pp/ }'
good

You can change the field separator using the col_sep option.

$ ruby -rcsv -e 'CSV.foreach("marks.txt", :col_sep => "\t"){ |r|
                 puts r * "," if r[0]=="ECE" }'
ECE,Raj,53
ECE,Joel,72
ECE,Om,92

The headers option will treat the first row as the header, useful for named field processing.

$ ruby -rcsv -e 'CSV.foreach("marks.txt", :headers => true, :col_sep => "\t"){
                 |r| puts r["Name"] }'
Raj
Joel
Moi
Surya
Tia
Om
Amy

You can automatically try to convert field value to given data types using the converters option. See ruby-doc: CSV Field Converters for details.

$ ruby -rcsv -le 'CSV.foreach("marks.txt", :converters => :integer,
                  :col_sep => "\t"){ print _1 }'
["Dept", "Name", "Marks"]
["ECE", "Raj", 53]
["ECE", "Joel", 72]
["EEE", "Moi", 68]
["CSE", "Surya", 81]
["EEE", "Tia", 59]
["ECE", "Om", 92]
["CSE", "Amy", 67]

XML and HTML

nokogiri is a popular third-party library to parse xml and html formats. It also supports working with malformed data.

To parse an input, you can either pass a string or pass a filehandle. You can also pass a URI filehandle for working with URLs directly.

Both XPath and CSS based selections are available. See also XPath introduction and difference between XPath and CSS Selectors. Here are some examples with xpath methods.

$ cat sample.xml
<doc>
    <greeting type="ask">Hi there. How are you?</greeting>
    <greeting type="reply">I am good.</greeting>
    <color>
        <blue>flower</blue>
        <blue>sand stone</blue>
        <light-blue>sky</light-blue>
        <light-blue>water</light-blue>
    </color>
</doc>

# all results for the 'blue' tag
# note that the ARGF filehandle is passed directly here
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
                      puts ip.xpath("//blue")' sample.xml
<blue>flower</blue>
<blue>sand stone</blue>

# filtering based on the content of 'blue' tags
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
                      puts ip.xpath("//blue").grep(/stone/)' sample.xml
<blue>sand stone</blue>

# use 'at_xpath' instead of 'xpath' to get only the first result
# or use 'ip.xpath("//blue")[0]' to get specific elements
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
                      puts ip.at_xpath("//blue")' sample.xml
<blue>flower</blue>

Here's an example with css methods. Use the text method to get the value of the selected tags.

$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
                      puts ip.css("light-blue").map(&:text)' sample.xml
sky
water

$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
                      puts ip.at_css("blue").text' sample.xml
flower

Here's an example of matching attributes.

# you can use //@type if you want all attributes that are named as 'type'
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
                      puts ip.xpath("//greeting/@type")' sample.xml
ask
reply

# match a specific attribute value
# same as: ip.css("greeting[type=\"ask\"]")
$ ruby -rnokogiri -e 'ip=Nokogiri.XML(ARGF);
                      puts ip.xpath("//greeting[@type=\"ask\"]").text' sample.xml
Hi there. How are you?

Here's an example with html input.

$ s='https://learnbyexample.github.io/substitution-with-ripgrep/'
$ url="$s" ruby -rnokogiri -ropen-uri -e 'ip=Nokogiri.HTML(URI.open(ENV["url"]));
                                          puts ip.css("@href")[5..7]'
https://learnbyexample.github.io/books
https://learnbyexample.github.io/mini
https://learnbyexample.github.io/tips

Summary

This chapter showed basic examples of processing structured data like json, csv, xml and html using built-in and third-party libraries. As mentioned in the introduction chapter, using Ruby one-liners for such tasks helps you avoid learning the syntax and idioms of a custom command line tool. Another advantage is that you have the entire ecosystem of a programming language at disposal once the structured input has been parsed. If performance becomes a concern, then custom CLI tools like jq, xsv and xmlstarlet will come in handy.

No exercises for this final chapter (author is lazy and doesn't have much experience with these formats). So, I'll leave you with links for further reading and as a source of exercises.