warning warning warning This is a work-in-progress draft version.



htmlq

From htmlq

Like jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.

Installation

See htmlq: installation for details.

Documentation

From htmlq --help:

$ htmlq --help
htmlq 0.4.0
Michael Maclean <michael@mgdm.net>
Runs CSS selectors on HTML

USAGE:
    htmlq [FLAGS] [OPTIONS] [--] [selector]...

FLAGS:
    -B, --detect-base          Try to detect the base URL from the <base> tag in the document. If not found, default to
                               the value of --base, if supplied
    -h, --help                 Prints help information
    -w, --ignore-whitespace    When printing text nodes, ignore those that consist entirely of whitespace
    -p, --pretty               Pretty-print the serialised output
    -t, --text                 Output only the contents of text nodes inside selected elements
    -V, --version              Prints version information

OPTIONS:
    -a, --attribute <attribute>         Only return this attribute (if present) from selected elements
    -b, --base <base>                   Use this URL as the base for links
    -f, --filename <FILE>               The input file. Defaults to stdin
    -o, --output <FILE>                 The output file. Defaults to stdout
    -r, --remove-nodes <SELECTOR>...    Remove nodes matching this expression before output. May be specified multiple
                                        times

ARGS:
    <selector>...    The CSS expression to select [default: html]

Examples

Consider the following sample.html input file:

<html>
    <head> <title>Sample HTML</title> </head>
    <body>
        <p>Fantasy books:</p>
        <li>Harry Potter</li>
        <li>Cradle</li>
        <li>The Stormlight Archive</li>
        <a href="https://learnbyexample.github.io/">blog link</a>
    </body>
</html>

The --text option will show the text part of selected elements, which is entire document by default. The -f option helps to specify the input file, which is stdin data by default. The -w option helps to ignore text nodes having only whitespace.

$ htmlq -w --text -f sample.html
Sample HTML
Fantasy books:
Harry Potter
Cradle
The Stormlight Archive
blog link

To select specific elements, provide that as an argument:

$ htmlq 'li' -f sample.html
<li>Harry Potter</li>
<li>Cradle</li>
<li>The Stormlight Archive</li>

$ htmlq --text 'li' -f sample.html
Harry Potter
Cradle
The Stormlight Archive

The -a option helps to get an attribute's value from the selected elements:

$ htmlq -a href 'a' -f sample.html
https://learnbyexample.github.io/

The project's README also includes example use cases.




More to come