Today I’m releasing an initial version of my latest tool, CurlyQ. It’s a work in progress, though should be immediately useful to those who need it. I need your input on where it goes next, what’s missing, and what you’d like to do with it that it can’t handle yet. Join me in the forum to discuss1!
CurlyQ is a helper for the curl command, with some extra functionality. Sure, it can grab the contents of a web page, but it can also provide a breakdown of all of the metadata, page images, page links, and can work with dynamic pages (where the page is loaded by a JavaScript call and the raw source is empty except for script tags). It will even do screenshots. It’s designed to alleviate some of the chores when scraping web pages or getting JSON responses.
A Scripter’s Tool
CurlyQ is designed to be part of a scripting pipeline, making it as simple as possible to do something like get a page’s title, find the largest image on the page, or examine and validate all the links on a page. You can query the results based on any attribute of the returned tag, showing, for example, only links with a rel=me
attribute or a paragraph with a certain class. The tags
subcommand can output a hierarchy of all tags on the page, with each parent tag containing a tags
key with its immediate children, on down the line. This can be queried and filtered using command line flags.
Failure Prevention
This tool has multiple User Agent strings configured and can accept custom headers. If a request fails, it will try again with various User Agent strings. This is because some sites block pages with certain (or missing) User Agents, and some don’t, so it has a built-in retry. It can also handle pages that respond with gzipped data using --compressed
on the command line. If you don’t use --compressed
and it detects gzipped data, it will quietly fail and notify you that you need to add the flag. I may make this an automatic fallback in the future. You can also specify a browser as a fallback (Chrome or Firefox), so if regular curling fails or is blocked, it can actually load the page in a web browser and retrieve/process the source from the window.
Retrieving Page Elements
CurlyQ also incorporates Nokogiri, allowing it to perform element selection using CSS selectors or XPaths. For example, the html
command accepts --search 'article header h3'
to return an array of all h3s contained in a header tag inside an article tag on the page. It can output as JSON or YAML, and for queries that target a specific element or key in the response, you can output raw strings.
There are also tools for extracting content between two strings, returning an array of all matches on the page. The idea is that in cases where you need to extract content that might not be easily located with tags, you can provide before/after strings to extract the necessary information.
Ready, Set, Shoot
CurlyQ can take several types of screenshots: full page (one long PNG), visible page (just the part of the page initially visible in the browser), or the print output, applying @media print
stylesheets. It doesn’t currently offer any type of image manipulation, but it might someday at least be able to create miniature versions (thumbnails) automatically.
The screenshot capability works best with Firefox. You can shoot the visible part of pages using Chrome, but to get full-page screenshots, Firefox is required. CurlyQ uses Selenium to load up an instance of the selected browser and grab the rendered source or take a screenshot. The source is fed through the same processor as a regular curl call, so most aspects remain the same. The result is missing response header info, though. CurlyQ does not aim to be a full web automation tool, for that you’ll want to get accustomed to using a headless browser in your scripting language of choice.
JSON Handling
There’s very limited support for handling JSON responses. It currently only handles GET requests and allows you to specify request headers, returning response headers as well as the pretty-printed (optional) results of parsing the JSON, all as one JSON or YAML blob. It’s assumed that you’ll do any handling of the results using something like jq or yq. It can cycle through User Agent strings to find one that works and elegantly return a response code and headers on errors.
What’s Next
One major area that’s missing right now is the ability to make requests other than GET. I would like to add POST capabilities, accepting data from command line flags or just passing it a JSON blob on STDIN or from a file. That’s for the next version, though.
I would greatly appreciate feedback on this tool. If you have a use for something like it, but it doesn’t do quite what you need, please list your use cases and expectations in the Issues on GitHub. I’d love to flesh this out into an all-purpose web scraping tool.
See the CurlyQ project page for more details on installation and usage. I look forward to your feedback (in the forum or on GitHub), positive or negative!
-
Leaving a comment on this page will automatically create a new forum topic if there isn’t one, or add to an existing topic. ↩