This is a work-in-progress draft version.
htmlq
From htmlq
Like
jq
, but for HTML. Uses CSS selectors to extract bits of content from HTML files.
Installation
See htmlq: installation for details.
Documentation
From htmlq --help
:
$ htmlq --help
htmlq 0.4.0
Michael Maclean <michael@mgdm.net>
Runs CSS selectors on HTML
USAGE:
htmlq [FLAGS] [OPTIONS] [--] [selector]...
FLAGS:
-B, --detect-base Try to detect the base URL from the <base> tag in the document. If not found, default to
the value of --base, if supplied
-h, --help Prints help information
-w, --ignore-whitespace When printing text nodes, ignore those that consist entirely of whitespace
-p, --pretty Pretty-print the serialised output
-t, --text Output only the contents of text nodes inside selected elements
-V, --version Prints version information
OPTIONS:
-a, --attribute <attribute> Only return this attribute (if present) from selected elements
-b, --base <base> Use this URL as the base for links
-f, --filename <FILE> The input file. Defaults to stdin
-o, --output <FILE> The output file. Defaults to stdout
-r, --remove-nodes <SELECTOR>... Remove nodes matching this expression before output. May be specified multiple
times
ARGS:
<selector>... The CSS expression to select [default: html]
Examples
Consider the following sample.html
input file:
<html>
<head> <title>Sample HTML</title> </head>
<body>
<p>Fantasy books:</p>
<li>Harry Potter</li>
<li>Cradle</li>
<li>The Stormlight Archive</li>
<a href="https://learnbyexample.github.io/">blog link</a>
</body>
</html>
The --text
option will show the text part of selected elements, which is entire document by default. The -f
option helps to specify the input file, which is stdin
data by default. The -w
option helps to ignore text nodes having only whitespace.
$ htmlq -w --text -f sample.html
Sample HTML
Fantasy books:
Harry Potter
Cradle
The Stormlight Archive
blog link
To select specific elements, provide that as an argument:
$ htmlq 'li' -f sample.html
<li>Harry Potter</li>
<li>Cradle</li>
<li>The Stormlight Archive</li>
$ htmlq --text 'li' -f sample.html
Harry Potter
Cradle
The Stormlight Archive
The -a
option helps to get an attribute's value from the selected elements:
$ htmlq -a href 'a' -f sample.html
https://learnbyexample.github.io/
The project's README also includes example use cases.
More to come