CurlyQ is a utility that provides a simple interface for curl, with additional features for things like extracting images and links, finding elements by CSS selector or XPath, getting detailed header info, and more. It’s designed to be part of a scripting pipeline, outputting everything as structured data (JSON or YAML). It also has rudimentary support for making calls to JSON endpoints easier, but it’s expected that you’ll use something like jq to parse the output.
Installation
Assuming you have Ruby and RubyGems installed, you can just run gem install curlyq. If you run into errors, try gem install --user-install curlyq, or use sudo gem install curlyq.
If you’re using Homebrew, you have the option to install via brew-gem:
brew install brew-gem
brew gem install curlyq
If you don’t have Ruby/RubyGems, you can install them pretty easily with Homebrew, rvm, or asdf.
Usage
Run curlyq help for a list of subcommands. Run curlyq help SUBCOMMAND for details on a particular subcommand and its options.
NAME
curlyq - A scriptable interface to curl
SYNOPSIS
curlyq [global options] command [command options] [arguments...]
VERSION
0.0.16
GLOBAL OPTIONS
--help - Show this message
--[no-]pretty - Output "pretty" JSON (default: enabled)
--version - Display the program version
-y, --[no-]yaml - Output YAML instead of json
COMMANDS
execute - Execute JavaScript on a URL
extract - Extract contents between two regular expressions
headlinks - Return all <head> links on URL's page
help - Shows a list of commands or help for one command
html, curl - Curl URL and output its elements, multiple URLs allowed
images - Extract all images from a URL
json - Get a JSON response from a URL, multiple URLs allowed
links - Return all links on a URL's page
scrape - Scrape a page using a web browser, for dynamic (JS) pages. Be sure to have the selected --browser installed.
screenshot - Save a screenshot of a URL
tags - Extract all instances of a tag
Query and Search syntax
You can shape the results using --search (-s) and --query (-q) on some commands.
A search uses either CSS or XPath syntax to locate elements. For example, if you wanted to locate all of the <article> elements with a class of post inside of the div with an id of main, you would run --search '#main article.post'. Searches can target tags, ids, and classes, and can accept > to target direct descendents. You can also use XPaths, but I hate those so I’m not going to document them.
I’ve tried to make the query function useful, but if you want to do any kind of advanced shaping, you’re better off piping the JSON output to jq or yq.
Queries are specifically for shaping CurlyQ output. If you’re using the html command, it returns a key called images, so you can target just the images in the response with -q 'images'. The queries accept array syntax, so to get the first image, you would use -q 'images[0]'. Ranges are accepted as well, so -q 'images[1..4]' will return the 2nd through 5th images found on the page. You can also do comparisons, e.g. images[rel=me]' to target only images with a rel attribute of me.
The comparisons for the query flag are:
< less than
> greater than
<= less than or equal to
>= greater than or equal to
= or == is equal to
*= contains text
^= starts with text
$= ends with text
Comparisons can be numeric or string comparisons. A numeric comparison like curlyq images -q '[width>500]' URL would return all of the images on the page with a width attribute greater than 500.
You can also use dot syntax inside of comparisons, e.g. [links.rel*=me] to target the links object (html command), and return only the links with a rel=me attribute. If the comparison is to an array object (like class or rel), it will match if any of the elements of the array match your comparison.
If you end the query with a specific key, only that key will be output. If there’s only one match, it will be output as a raw string. If there are multiple matches, output will be an array:
curlyq makes use of subcommands, e.g. curlyq html [options] URL or curlyq extract [options] URL. Each subcommand takes its own options, but I’ve made an effort to standardize the choices between each command as much as possible.
extract
Example:
$curlyqextract-i-b'Adding'-a'accessingthesource.''https://stackoverflow.com/questions/52428409/get-fully-rendered-html-using-selenium-webdriver-and-python'["Adding <code>time.sleep(10)</code> in various places in case the page had not fully loaded when I was accessing the source."]
This specifies a before and after string and includes them (-i) in the result.
NAME
extract - Extract contents between two regular expressions
SYNOPSIS
curlyq [global options] extract [command options] URL...
COMMAND OPTIONS
-a, --after=arg - Text after extraction (default: none)
-b, --before=arg - Text before extraction (default: none)
-c, --[no-]compressed - Expect compressed results
--[no-]clean - Remove extra whitespace from results
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
-i, --[no-]include - Include the before/after matches in the result
-r, --[no-]regex - Process before/after strings as regular expressions
--[no-]strip - Strip HTML tags from results
execute
You can execute JavaScript on a given web page using the execute subcommand.
You can specify an element id to wait for using --id, and define a pause to wait after executing a script with --wait (defaults to 2 seconds). Scripts can be read from the command line arguments with --script "SCRIPT", from STDIN with --script -, or from a file using --script PATH.
If you expect a return value, be sure to include a return statement in your executed script. Results will be output to STDOUT.
NAME
execute - Execute JavaScript on a URL
SYNOPSIS
curlyq [global options] execute [command options] URL...
COMMAND OPTIONS
-b, --browser=arg - Browser to use (firefox, chrome) (default: chrome)
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
-i, --id=arg - Element ID to wait for before executing (default: none)
-s, --script=arg - Script to execute, use - to read from STDIN (may be used more than once, default: none)
-w, --wait=arg - Seconds to wait after executing JS (default: 2)
This pulls all <links> from the <head> of the page, and uses a query -q to only show links with rel="stylesheet".
NAME
headlinks - Return all <head> links on URL's page
SYNOPSIS
curlyq [global options] headlinks [command options] URL...
COMMAND OPTIONS
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
html
The html command (aliased as curl) gets the entire text of the web page and provides a JSON response with a breakdown of:
URL, after any redirects
Response code
Response headers as a keyed hash
Meta elements for the page as a keyed hash
All meta links in the head as an array of objects containing (as available):
rel
href
type
title
source of <head>
source of <body>
the page title (determined first by og:title, then by a title tag)
description (using og:description first)
All links on the page as an array of objects with:
href
title
rel
text content
classes as array
All images on the page as an array of objects containing:
class
all attributes as key/value pairs
width and height (if specified)
src
alt and title
You can add a query (-q) to only get the information needed, e.g. -q images[width>600].
Example:
$curlyqhtml-s'#mainarticle.aligncenter'-q'images[1]''https://brettterpstra.com'[{"class":"aligncenter","original":"https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb_tw.jpg","at2x":"https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb@2x.jpg","width":"800","height":"226","src":"https://cdn3.brettterpstra.com/uploads/2023/09/giveaway-keyboardmaestro2024-rb.jpg","alt":"Giveaway Robot with Keyboard Maestro icon","title":"Giveaway Robot with Keyboard Maestro icon"}]
The above example queries the full html of the page, but narrows the elements using --search and then takes the 2nd image from the results.
$ curlyq html -q'meta.title' https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
Introducing CurlyQ, a pipeline-oriented curl helper - BrettTerpstra.com
The above example curls the page and returns the title attribute found in the meta (-q 'meta.title').
NAME
html - Curl URL and output its elements, multiple URLs allowed
SYNOPSIS
curlyq [global options] html [command options] URL...
COMMAND OPTIONS
-I, --info - Only retrieve headers/info
-b, --browser=arg - Use a browser to retrieve a dynamic web page (firefox, chrome) (default: none)
-c, --compressed - Expect compressed results
--[no-]clean - Remove extra whitespace from results
-f, --fallback=arg - If curl doesn't work, use a fallback browser (firefox, chrome) (default: none)
-h, --header=arg - Define a header to send as "key=value" (may be used more than once, default: none)
--[no-]ignore_fragments - Ignore fragment hrefs when gathering content links
--[no-]ignore_relative - Ignore relative hrefs when gathering content links
-l, --local_links_only - Only gather internal (same-site) links
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
-r, --raw=arg - Output a raw value for a key (default: none)
-s, --search=arg - Regurn an array of matches to a CSS or XPath query (default: none)
-x, --external_links_only - Only gather external links
images
The images command returns only the images on the page as an array of objects. It can be queried to match certain requirements (see Query and Search syntax above).
The base command will return all images on the page, including OpenGraph images from the head, <img> tags from the body, and <srcset> tags along with their child images.
OpenGraph images will be returned with the structure:
This example will only return images that have a width greater than 750 pixels. This query depends on the images having proper width attributes set on them in the source.
NAME
images - Extract all images from a URL
SYNOPSIS
curlyq [global options] images [command options] URL...
COMMAND OPTIONS
-c, --[no-]compressed - Expect compressed results
--[no-]clean - Remove extra whitespace from results
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
-t, --type=arg - Type of images to return (img, srcset, opengraph, all) (may be used more than once, default: ["all"])
json
The json command just returns an object with header/response info, and the contents of the JSON response after it’s been read by the Ruby JSON library and output. If there are fetching or parsing errors it will fail gracefully with an error code.
NAME
json - Get a JSON response from a URL, multiple URLs allowed
SYNOPSIS
curlyq [global options] json [command options] URL...
COMMAND OPTIONS
-c, --[no-]compressed - Expect compressed results
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
links
Returns all the links on the page, which can be queried on any attribute.
This example gets all links from the page but only returns ones with link content containing ‘twitter’ (-q '[content*=twitter]').
NAME
links - Return all links on a URL's page
SYNOPSIS
curlyq [global options] links [command options] URL...
COMMAND OPTIONS
-d, --[no-]dedup - Filter out duplicate links, preserving only first one
--[no-]ignore_fragments - Ignore fragment hrefs when gathering content links
--[no-]ignore_relative - Ignore relative hrefs when gathering content links
-l, --local_links_only - Only gather internal (same-site) links
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
-x, --external_links_only - Only gather external links
scrape
Loads the page in a web browser, allowing scraping of dynamically loaded pages that return nothing but scripts when curled. The -b (--browser) option is required and should be ‘chrome’ or ‘firefox’ (or just ‘c’ or ‘f’). The selected browser must be installed on your system.
This example scrapes the page using firefox and finds the first link with a rel of ‘me’ and text containing ‘mastodon’.
NAME
scrape - Scrape a page using a web browser, for dynamic (JS) pages. Be sure to have the selected --browser installed.
SYNOPSIS
curlyq [global options] scrape [command options] URL...
COMMAND OPTIONS
-b, --browser=arg - Browser to use (firefox, chrome) (required, default: none)
--[no-]clean - Remove extra whitespace from results
-h, --header=arg - Define a header to send as "key=value" (may be used more than once, default: none)
-q, --query, --filter=arg - Filter output using dot-syntax path (default: none)
-r, --raw=arg - Output a raw value for a key (default: none)
--search=arg - Regurn an array of matches to a CSS or XPath query (default: none)
screenshot
Full-page screenshots require Firefox, installed and specified with --browser firefox.
Type defaults to full, but will only work if -b is Firefox. If you want to use Chrome, you must specify a --type as ‘visible’ or ‘print’.
The -o (--output) flag is required. It should be a path to a target PNG file (or PDF for -t print output). Extension will be modified automatically, all you need is the base name.
Example:
$ curlyq screenshot -b f -o ~/Desktop/test https://brettterpstra.com/2024/01/10/introducing-curlyq-a-pipeline-oriented-curl-helper/
Screenshot saved to /Users/ttscoff/Desktop/test.png
You can wait for an element ID to be visible using --id. This can be any #ID on the page. If the ID doesn’t exist on the page, though, the screenshot will hang for a timeout of 10 seconds.
You can execute a script before taking the screenshot with the --script flag. If this is set to -, it will read the script from STDIN. If it’s set to an existing file path, that file will be read for script input. Specify an interval (in seconds) to wait after executing the script with --wait.
NAME
screenshot - Save a screenshot of a URL
SYNOPSIS
curlyq [global options] screenshot [command options] URL...
COMMAND OPTIONS
-b, --browser=arg - Browser to use (firefox, chrome) (default: chrome)
-h, --header=arg - Define a header to send as key=value (may be used more than once, default: none)
-i, --id=arg - Element ID to wait for before taking screenshot (default: none)
-o, --out, --file=arg - File destination (required, default: none)
-s, --script=arg - Script to execute before taking screenshot (may be used more than once, default: none)
-t, --type=arg - Type of screenshot to save (full (requires firefox), print, visible) (default: visible)
-w, --wait=arg - Time to wait before taking screenshot (default: 0)
tags
Return a hierarchy of all tags in a page. Use -t to limit to a specific tag.
The above command filters the tags based on a CSS query, then further filters them to just tags with an id containing ‘what’.
NAME
tags - Extract all instances of a tag
SYNOPSIS
curlyq [global options] tags [command options] URL...
COMMAND OPTIONS
-c, --[no-]compressed - Expect compressed results
--[no-]clean - Remove extra whitespace from results
-h, --header=KEY=VAL - Define a header to send as key=value (may be used more than once, default: none)
-q, --query, --filter=DOT_SYNTAX - Dot syntax query to filter results (default: none)
--search=CSS/XPATH - Regurn an array of matches to a CSS or XPath query (default: none)
--[no-]source, --[no-]html - Output the HTML source of the results
-t, --tag=TAG - Specify a tag to collect (may be used more than once, default: none)
Changelog
Click to expand
0.0.16
2024-11-07 06:45
FIXED
Encoding error
0.0.15
2024-10-25 10:31
IMPROVED
Better error when no results, return nothing to STDOUT
0.0.14
2024-10-25 10:26
FIXED
Fix permissions
0.0.13
2024-10-25 10:23
FIXED
Fix tests, handle empty results better
0.0.12
2024-04-04 13:06
NEW
Add –script option to screenshot command
Add execute command for executing JavaScript on a page
0.0.11
2024-01-21 15:29
IMPROVED
Add option for –local_links_only to html and links command, only returning links with the same origin site
0.0.10
2024-01-17 13:50
IMPROVED
Update YARD documentation
Breaking change, ensure all return types are Arrays, even with single objects, to aid in scriptability
Screenshot test suite
0.0.9
2024-01-16 12:38
IMPROVED
You can now use dot syntax inside of a square bracket comparison in –query ([attrs.id*=what])
*=, ^=, $=, and == work with array values
[] comparisons with no comparison, e.g. [attrs.id], will return every match that has that element populated
0.0.8
2024-01-15 16:45
IMPROVED
Dot syntax query can now operate on a full array using empty set []
Dot syntax query should output a specific key, e.g. attrs[id*=news].content (work in progress)
Dot query syntax handling touch-ups. Piping to jq is still more flexible, but the basics are there.
0.0.7
2024-01-12 17:03
FIXED
Revert back to offering single response (no array) in cases where there are single results (for some commands)
0.0.6
2024-01-12 14:44
CHANGED
Attributes array is now a hash directly keyed to the attribute key
NEW
Tags command has option to output only raw html of matched tags
FIXED
–query works with –search on scrape and tags command
Json command dot query works now
0.0.5
2024-01-11 18:06
IMPROVED
Add –query capabilities to images command
Add –query to links command
Allow hyphens in query syntax
Allow any character other than comma, ampersand, or right square bracket in query value
FIXED
Html –search returns a full Curl::Html object
–query works better with –search and is consistent with other query functions
Scrape command outputting malformed data
Hash output when –query is used with scrape
Nil match on tags command
0.0.4
2024-01-10 13:54
FIXED
Queries combined with + or & not requiring all matches to be true
0.0.3
2024-01-10 13:38
IMPROVED
Refactor Curl and Json libs to allow setting of options after creation of object
Allow setting of headers on most subcommands
–clean now affects source, head, and body keys of output