I’ve been putting a little more time into CurlyQ this week, as I’m able.
First thing to note is a breaking change: it will always return an array now, even if there’s only one result. I had waffled on this a little, but for predictability in scripting it really always needs to be a consistent format. So even a single-string result, e.g. a command that targets a single element with --search
and then uses .source
in the query (which previously would have just returned the source string for the matched tag) will now return an array containing a single string.
Secondly, I’ve put a considerable amount of effort into the --query
feature. You can now use jq
-like syntax to query multiple items in an array, use dot-syntax for attribute comparisons, and use comparisons (like ^=
) on hashes, returning true if any value in the hash matches the query. Still, if you want the full power of something like jq
or yq
, you can just pipe the output to either and work with more familiar tools.
But on to a cool thing. I mentioned CurlyQ’s screenshot capability in the intro post, but it’s received some improvements, and I thought it deserved a little more detail.
I incorporated Selenium to allow scraping of dynamic web pages. One of the features Selenium provides is screenshots saved from the browser of choice. Thus CurlyQ has a screenshot feature:
curlyq screenshot -b 'firefox' -t 'full' -o 'screenshot_name' URL
The --browser
flag (-b
) determines whether it uses Chrome or Firefox, and the selected browser must be installed on your system. The full-page capture (-t full
) is only available with Firefox. Chrome can only output visible
(the visible part of the page on first load) and print
, a print version of the page with @media print
styling applied. Firefox can output all types.
The --type
flag (-t
) accepts full
, visible
, and print
. With -t full
and -b firefox
, you get a full-length version of the rendered page, including offscreen elements. All of these can be abbreviated to their first letter, e.g. -t f
or -b c
.
The --output
flag (-o
) is required and determines the path/name of the output file. Providing just a name will save the file to the current directory. Extensions can be provided but will be changed depending on output type, .png
for full
and visible
, .pdf
for print
. So you can just provide a name without extension and CurlyQ will apply the appropriate extension.
As a side note, saving a screenshot with -t print
will output a PDF with actual text that can be searched by Spotlight (and other tools). So you could ostensibly use CurlyQ to crawl an entire site (by parsing the links
subcommand output and spidering) and save every page to a searchable PDF. I don’t know offhand why you’d do that, but it’s possible.
CurlyQ is still being refined and your input is welcome. Join me on the Forum, or just message me on Mastodon with suggestions and bug reports.
See the project page for full details.