As promised, I’ve updated Read2Text as a standalone binary using the Swift versions of Readability and HTML2Text. In the process I made it a whole new tool called Gather. You may recall that name from a little utility I made a while back. This does the exact same thing, just without the GUI, and I liked that name enough to revive it.
I designed it to be flexible and easy to drop into any kind of workflow, so you can use it in Services and Shortcuts, PopClip extensions, LaunchBar Actions, Alfred, Raycast, whatever floats your boat. It can take URLs, raw HTML, even rich text from web page copies. It can arguments, piped input via STDIN, and even pull urls and HTML data out of environment variables. Whatever you want to do with it, it should be flexible enough to handle.
And it does a better, more consistent job than Marky (my API-based markdownifier) ever did.
What Is This?
If you’ve never seen this tool before, it allows you to turn any URL into Markdown text with comments, ads, etc. stripped out, and most Markdown-compatible elements properly converted. To use it, just run gather https://yoururl.com
on the command line. It will output the result to STDOUT, so you can add |pbcopy
to the end of it to clip it directly to the clipboard. The tool has a bunch more options that I’ve detailed in the README.
And yes, I know PopClip and some other tools have some HTML-to-Markdown conversion built in already. What sets Gather apart is its ability to locate the important content of a web page, its handling of some advanced things like tables and definition lists, advanced handling of highlighted code blocks, and other products of my weekend obsessions.
Gimme
This version is Mac-only (but a Universal Binary). Sorry everybody else. For people on other systems that happen to have Python available, you can still use the older, python-based read2text.
To install, download the zip at the end of the post, unzip, and move the gather
executable into your path. If you’re on an Intel Mac, your best bet is /usr/local/bin
. If you’re on an M1 Mac, you’ll want to use /opt/homebrew
or similar. Anything in your PATH is good. I might eventually see if I can get this set up as a Homebrew formula, but it needs testing first.
Now What?
Obviously you can use this in Terminal, and it will fit perfectly into scripted solutions or just as a standalone tool.
This binary can also be called from Shortcuts and Services. It takes a single argument or accepts raw HTML or a URL on STDIN if you add --stdin
as an argument. To process raw HTML, be sure to include the --html
argument. It can also pull content from your clipboard or an environment variable. See the docs for more info.
curl mywebsite.com/article | gather --stdin --html --file article.md
You can also include --no-readability
to skip using Readability. If the default parsing cuts out too much of the article for you, just run again with --no-readability
to get the entire thing (including ads, menus, etc., unfortunately).
There are also options for switching to inline links, determining how links are sorted, and handling Unicode. Check it out.
I’ll be making a new LaunchBar action soon, if I have time. I’d love it if anyone interested started using this in other tools and letting me know how it goes. Should be good as part of a Shortcuts or Automator workflow, and I intend to use it in some of my Mac apps as an NSTask that I can just shell out to. I did, however, design it as a Swift Package that you can download and play with for your own nefarious purposes. I’m literally just now learning Swift, so be ready for some funky code that I’ll clean up as I learn more.
One More Thing…
Back when I had Marky the Markdownifier working, it had some special handling for various sites like StackOverflow, Twitter, and I don’t even remember what else. I intend to add all of those back into Gather, eventually. Parsing web pages is a fragile thing as a single change to a class name can break it, so I don’t go overboard with it.
That said, I really missed the StackExchange special handling, so now when saving a Question page from any StackExchange site (like StackOverflow or AskDifferent), the formatting will automatically be cleaned up, accepted answers moved to the top, and comments are excluded by default.
There are special options for --accepted-only
, --min-upvotes X
, and --include-comments
. Those seem pretty self-explanatory.
I put a lot of StackOverflow answers into my notes so I have a highly searchable index of curated knowledge. This just makes it that much easier. The next step will be setting up a LaunchBar Action or Service that uses nvUltra’s URL handler to automatically add StackExchange answers to my programming notebook. I made one that copies the selected answer to the clipboard, but I want it to go to my notebook without interaction. Soon.
I Couldn’t Have Done It Without…
Special thanks to Shahaf Levi who’s made everything I’ve done this week possible. A while back I sent him a custom version of html2text.py and some Readability ports I’d been using for Marky. He converted all of it to Swift packages, which I sorely needed now that Python is no longer included with macOS1. Since getting my hands on them I’ve made some pretty significant changes that haven’t been merged upstream yet, but I just made pull requests. You can find my forks here:
Oh, and of course Gather is open source on GitHub, which I’m a bit nervous about because I’m still so new at Swift, but hey.
Let’s Do It
So as much as I’ve talked it up, there are always bugs. Markdownifying the web is a very difficult thing to do well. I think I’ve gotten pretty far, but there are SO many edge cases to deal with. I hope you find it useful, though.
If you have questions, comments, or requests, please use the GitHub Issues page for Gather. I’ll respond to tweets and comments, of course, but it’s the least stressful for me to manage bug reports and requests via GitHub.
Gather CLI v2.1.6
A Frankenstinian combination of html2text and Arc90 Readability. This command line tool makes clipping web pages into Markdown text without ads and comments simple.
Published 01/04/12.
Updated 09/18/23. Changelog
-
Yes, it’s pretty simple to install, but when I want to distribute a script to do one cool thing, a user shouldn’t have to install a processor just to run it. Sheesh. ↩