The news has, in general, been pretty terrible lately.
Like many folks, I’ve worked to narrow the focus of the things that I’m willing to care deeply about, because caring about many things is just too difficult when, y’know, nazis
are trying to destroy them all.
I’ve got friends who’ve stopped consuming news media entirely. I’ve not felt the need to go so far, and I think the reason is that I already have a moderately-disciplined
relationship with news. It’s relatively easy for me to regulate how much I’m exposed to all the crap news in the world and stay focussed and forward-looking.
The secret is that I get virtually all of my news… through my feed reader (some of it pre-filtered, e.g. my de-crappified BBC News feeds).
I use FreshRSS and I love it. But really: any feed reader can improve your relationship with
the Web.
Without a feed reader, I can see how I might feel the need to “check the news” several times a day. Pick up my phone to check the time… glance at the news while I’m there… you know how
to play that game, right?
But with a feed reader, I can treat my different groups of feeds like… periodicals. The news media I subscribe to get collated in my feed reader and I can read them once, maybe twice
per day, just like a daily newspaper. If an article remains unread for several days then, unless I say otherwise, it’s configured to be quietly archived.
My current events are less like a firehose (or sewage pipe), and more like a bottle of (filtered) water.
Categorising my feeds means that I can see what my friends are doing almost-immediately, but I don’t have to be disturbed by anything else unless I want to be. Try getting that
from a siloed social network!
Maybe sometimes I see a new breaking news story… perhaps 12 hours after you do. Is that such a big deal? In exchange, I get to apply filters of any kind I like to the news I read, and I
get to read it as a “bundle”, missing (or not missing) as much or as little as I like.
On a scale from “healthy media consumption” to “endless doomscrolling”, proper use of a feed reader is way towards the healthy end.
If you stopped using feeds when Google tried to kill them, maybe it’s time to think again. The ecosystem’s alive and well, and having a one-stop place where you can
enjoy the parts of the Web that are most-important to you, personally, in an ad-free, tracker-free, algorithmic-filtering-free space that you can make your very own… brings a
special kind of peace that I can highly recommend.
Setting up and debugging your FreshRSS XPath Scraper
Okay, so here’s Adam’s blog. I’ve checked, and there’s no RSS feed1, so it’s time to start planning my XPath Scraper. The first thing I want to do is to find some way of identifying the “posts” on the page. Sometimes people use
solid, logical id="..." and class="..." attributes, but I’m going to need to use my browser’s “Inspect Element” tool to check:
If you’re really lucky, the site you’re scraping uses an established microformat like h-feed. No such luck here, though…
The next thing that’s worth checking is that the content you’re inspecting is delivered with the page, and not loaded later using JavaScript. FreshRSS’s XPath Scraper works with the raw
HTML/XML that’s delivered to it; it doesn’t execute any JavaScript2,
so I use “View Source” and quickly search to see that the content I’m looking for is there, too.
New developers are sometimes surprised to see how different View Source and Inspect Element’s output can be3.
This looks pretty promising, though.
Now it’s time to try and write some XPath queries. Luckily, your browser is here to help! If you pop up your debug console, you’ll discover that you’re probably got a predefined
function, $x(...), to which you can path a string containing an XPath query and get back a NodeList of the element.
First, I’ll try getting all of the links inside the #posts section by running $x( '//*[@id="posts"]//a' ) –
Once you’ve run a query, you can expand the resulting array and hover over any element in it to see it highlighted on the page. This can be used to help check that you’ve found what
you’re looking for (and nothing else).
In my first attempt, I discovered that I got not only all the posts… but also the “tags” at the top. That’s no good. Inspecting the URLs of each, I noticed that the post URLs all
contained /posts/, so I filtered my query down to $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]' ) which gave me the
expected number of results. That gives me //*[@id="posts"]//a[contains(@href, "/posts/")]
as the XPath query for “news items”:
I like to add the rules I’ve learned to my FreshRSS configuration as I go along, to remind me what I still need to find.
Obviously, this link points to the full post, so that tells me I can put ./@href as the “item link” attribute in FreshRSS.
Next, it’s time to see what other metadata I can extract from each post to help FreshRSS along:
Inspecting the post titles shows that they’re <h3>s. Running $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]//h3' ) gets them.
Within FreshRSS, everything “within” a post is referenced relative to the post, so I convert this to descendant::h3 for my “XPath (relative to item) for Item
Title:” attribute.
I was pleased to see that Adam’s using a good accessible heading cascade. This also makes my XPathing easier!
Inspecting within the post summary content, it’s… not great for scraping. The elements class names don’t correspond to what the content is4: it looks like Adam’s using a utility class library5.
Everything within the <a> that we’ve found is wrapped in a <div class="flex-grow">. But within that, I can see that the date is
directly inside a <p>, whereas the summary content is inside a <p>within a<div class="mb-2">. I don’t want my code to
be too fragile, and I think it’s more-likely that Adam will change the class names than the structure, so I’ll tie my queries to the structure. That gives me
descendant::div/p for the date and descendant::div/div/p for the “content”. All that remains is to tell FreshRSS that Adam’s using F j, Y as his
date format (long month name, space, short day number, comma, space, long year number) so it knows how to parse those dates, and the feed’s good.
If it’s wrong and I need to change anything in FreshRSS, the “Reload Articles” button can be used to force it to re-load the most-recent X posts. Useful if you need to tweak things. In
my case, I’ve also set the “Article CSS selector on original website” field to article so that the full post text can be pulled into my reader rather than having to visit
the actual site. Then I’m done!
Yet another blog I can read entirely from my feed reader, despite the fact that it doesn’t offer a “feed”.
Takeaways
Use Inspect Element to find the elements you want to scrape for.
Use $x( ... ) to test your XPath expressions.
Remember that most of FreshRSS’s fields ask for expressions relative to the news item and adapt accordingly.
If you make a mistake, use “Reload Articles” to pull them again.
2 If you need a scraper than executes JavaScript, you need something more-sophisticated. I
used to use my very own RSSey for this purpose but nowadays XPath Scraping is sufficient so I don’t bother any more, but RSSey might be a
good starting point for you if you really need that kind of power!
3 If you’ve not had the chance to think about it before: View Source shows you the actual
HTML code that was delivered from the web server to your browser. This then gets interpreted by the browser to generate the DOM, which might result in changes to it: for example,
invalid elements might be removed, ambiguous markup will have an interpretation applied, and so on. The DOM might further change as a result of JavaScript code, browser plugins, and
whatever else. When you Inspect Element, you’re looking at the DOM (represented “as if” it were HTML), not the actual underlying HTML
4 The date isn’t in a <time> element nor does it have a class like
.post--date or similar.
5 I’ll spare you my thoughts on utility class libraries for now, but they’re… not
positive. I can see why people use them, and I’ve even used them myself before… but I don’t think they’re a good thing.
tl;dr: I’m tidying up and consolidating my personal hosting; I’ve made a little progress, but I’ve got a way to go – fortunately I’ve got a sabbatical coming up at
work!
At the weekend, I kicked-off what will doubtless be a multi-week process of gradually tidying and consolidating some of the disparate digital things I run, around the Internet.
I’ve a long-standing habit of having an idea (e.g. gamebook-making tool Twinebook, lockpicking puzzle game Break Into Us, my Cheating Hangman game, and even FreeDeedPoll.org.uk!),
deploying it to one of several servers I run, and then finding it a huge headache when I inevitably need to upgrade or move said server because there’s such an insane diversity of
different things that need testing!
DNDle, my Wordle-clone where you have to guess the Dungeons & Dragons 5e monster’s stat block, is now hosted by GitHub Pages. Also, I
fixed an issue reported a month ago that meant that I was reporting Giant Scorpions as having a WIS of 19 instead of 9.
Abnib, which mostly reminds people of upcoming birthdays and serves as a dumping ground for any Abnib-related shit I produce, is now hosted by
GitHub Pages.
RockMonkey.org.uk, which doesn’t really do much any more, is now hosted by GitHub Pages.
Sour Grapes, the single-page promo for a (remote) murder mystery party I hosted during a COVID lockdown, is now hosted by GitHub
Pages.
A convenience-page for giving lost people directions to my house is now hosted by GitHub Pages.
Dan Q’s Things is now automatically built on a schedule and hosted by GitHub Pages.
Robin’s Improbable Blog, which spun out from 52 Reflect, wasn’t getting enough traffic to justify
“proper” hosting so now it sits in a Docker container on my NAS.
My μlogger server, which records my location based on pings from my phone, has also moved to my NAS. This has broken
Find Dan Q, but I’m not sure if I’ll continue with that in its current form anyway.
All of my various domain/subdomain redirects have been consolidated on, or are in the process of moving to, to a tinyLinode/Akamai
instance. It’s a super simple plain Nginx server that does virtually nothing except redirect people – this is where I’ll park the domains I register but haven’t found a use for yet, in
future.
I was pretty proud of EGXchange.org, but I’ll be first to admit that it’s among the stupider of my throwaway domains.
It turns out GitHub pages is a fine place to host simple, static websites that were open-source already. I’ve been working on improving my understanding of GitHub Actions
anyway as part of what I’ve been doing while wearing my work, volunteering, and personal hats, so switching some static build processes like DNDle’s to GitHub
Actions was a useful exercise.
Stuff I’m still to tidy…
There’s still a few things I need to tidy up to bring my personal hosting situation under control:
DanQ.me
You’re looking at it. But later this year, you might be looking at it… elsewhere?
This is the big one, because it’s not just a WordPress blog: it’s also a Gemini, Spartan, and Gopher server (thanks CapsulePress!), a Finger server, a general-purpose host to a stack of complex stuff only some of which is powered by Bloq (my WordPress/PHP integrations): e.g.
code to generate the maps that appear on my geopositioned posts, code to integrate with the Fediverse, a whole stack of configuration to make my caching work the way I want, etc.
FreeDeedPoll.org.uk
Right now this is a Ruby/Sinatra application, but I’ve got a (long-running) development branch that will make it run completely in the browser, which will further improve privacy, allow
it to run entirely-offline (with a service worker), and provide a basis for new features I’d like to provide down the line. I’m hoping to get to finishing this during my Automattic
sabbatical this winter.
The website’s basically unchanged for most of a decade and a half, and… umm… it looks it!
A secondary benefit of it becoming browser-based, of course, is that it can be hosted as a static site, which will allow me to move it to GitHub Pages too.
When I took over running the world’s geohashing hub from xkcd‘s Randall Munroe (and davean), I flung the site together on whatever hosting I had sitting
around at the time, but that’s given me some headaches. The outbound email transfer agent is a pain, for example, and it’s a hard host on which to apply upgrades. So I want to get that
moved somewhere better this winter too. It’s actually the last site left running on its current host, so it’ll save me a little money to get it moved, too!
Geohashing’s one of the strangest communities I’m honoured to be a part of. So it’d be nice to treat their primary website to a little more respect and attention.
Right now I run this on my NAS, but that turns out to be a pain sometimes because it means that if my home Internet goes down (e.g. thanks to a power cut, which we have from time to time), I lose access to the first and last place I
go on the Internet! So I’d quite like to move that to somewhere on the open Internet. Haven’t worked out where yet.
Next steps
It’s felt good so far to consolidate and tidy-up my personal web hosting (and to rediscover some old projects I’d forgotten about). There’s work still to do, but I’m expecting to spend
a few months not-doing-my-day-job very soon, so I’m hoping to find the opportunity to finish it then!
Vmail is cool. It’s vole.wtf’s (of ARCC etc. fame) community
newsletter, and it’s as batshit crazy as you’d expect if you were to get the kinds of people who enjoy that site and asked them all to chip in on a newsletter.
Totes bonkers.
But email’s not how I like to consume this kind of media. So obviously, I scraped it.
I’m not a monster: I want Vmail’s stats to be accurate. So I signed up with an unmonitored OpenTrashMail account as well. I just don’t read it (except for the confirmation link
email). It actually took me a few attempts because there seems to be some kind of arbitrary maximum length validation on the signup form. But I got there in the end.
Recipe
Want to subscribe to Vmail using your own copy of FreshRSS? Here’s the settings you’re looking for –
Type of feed source:HTML + XPath (Web scraping)
XPath for finding news items://table/tbody/tr
It’s just a table with each row being a newsletter; simple!
XPath for item title:descendant::a
XPath for item content:.
XPath for item link (URL):descendant::a/@href
XPath for item date:descendant::td[1]
Custom date/time format:d M *y
The dates are in a format that’s like 01 May ’24 – two-digit days with leading zeros, three-letter months, and a two-digit year preceded by a curly quote, separated by spaces. That
curl quote screws up PHP’s date parser, so we have to give it a hint.
XPath for unique item ID:descendant::th
Optional, but each issue’s got its own unique ID already anyway; we might as well use it!
Article CSS selector on original website:#vmail
Optional, but recommended: this option lets you read the entire content of each newsletter without leaving FreshRSS.
So yeah, FreshRSS continues to be amazing. And lately it’s helped me keep on top of the amazing/crazy of vole.wtf too.
I’ve got a (now four-year-old) Unraid NAS called Fox and I’m a huge fan. I particularly love the fact that Unraid can work not only as a NAS, but also as a fully-fledged Docker appliance, enabling me to easily install and maintain all manner of applications.
There isn’t really a generator attached to Fox, just a UPS battery backup. The sign was liberated from our shonky home electrical system.
I was chatting this week to a colleague who was considering getting a similar setup, and he seemed to be taking notes of things he might like to install, once he’s got one. So I figured
I’d round up five of my favourite things to install on an Unraid NAS that:
Don’t require any third-party accounts (low dependencies),
Don’t need any kind of high-powered hardware (low specs), and
Provide value with very little set up (low learning curve).
It’d have been cooler if I’d have secretly written this blog post while sitting alongside said colleague (shh!). But sadly it had to wait until I was home.
Syncthing’s just an awesome piece of set-and-forget software that facilitates file synchronisation between all of your devices and can also form part of a backup strategy.
Here’s the skinny: you install Syncthing on several devices, then give each the identification key of another to pair them. Now you can add folders on each and “share” them with the
others, and the two are kept in-sync. There’s lots of options for power users, but just as a starting point you can use this to:
Manage the photos on your phone and push copies to your desktop whenever you’re home (like your favourite cloud photo sync service, but selfhosted).
Keep your Obsidian notes in-sync between all your devices (normally costs $4/month).1
Get a copy of the documents from all your devices onto your NAS, for backup purposes (note that sync’ing alone, even with
versioning enabled, is not a good backup: the idea is that you run an actual backup from your NAS!).
You know IFTTT? Zapier? Services that help you to “automate” things based on inputs and outputs. Huginn’s like that, but selfhosted.
Also: more-powerful.
When we first started looking for a dog to adopt (y’know, before we got this derper), I set up Huginn watchers to monitor the websites of several rescue centres, filter them by some of our criteria, and push
the results to us in real-time on Slack, giving us an edge over other prospective puppy-parents.
The learning curve is steeper than anything else on this list, and I almost didn’t include it for that reason alone. But once you’ve learned your way around its idiosyncrasies and
dipped your toe into the more-advanced Javascript-powered magic it can do, you really begin to unlock its potential.
It couples well with Home Assistant, if that’s your jam. But even without it, you can find yourself automating things you never expected to.
Many of these suggested apps benefit well from you exposing them to the open Web rather than just running them on your LAN,
and an RSS reader is probably the best example (you want to read your news feeds when you’re out and about, right?). What you
need for that is a reverse proxy, and there are lots of guides to doing it super-easily, even if you’re not on a static IP
address.2.
Alternatively you can just VPN in to your home: your router might be able to arrange this, or else Unraid can do it for you!
You know how sometimes you need to give somebody your email address but you don’t actually want to. Like: sure, I’d like you to email me a verification code for this download, but I
don’t trust you not to spam me later! What you need is a disposable email address.3
How do you feel about having infinite email addresses that you can make up on-demand (without even having access to a computer), subscribe to by RSS, and never have to see unless you specifically want to.
You just need to install Open Trashmail, point the MX records of a few domain names or subdomains (you’ve got some spare domain names
lying around, right? if not; they’re pretty cheap…) at it, and it will now accept email to any address on those domains. You can make up addresses off the top of your head,
even away from an Internet connection when using a paper-based form, and they work. You can check them later if you want to… or ignore them forever.
Couple it with an RSS reader, or Huginn, or Slack, and you can get a notification or take some action when an email arrives!
Need to give that escape room your email address to get a copy of your “team photo”? Give them a throwaway, pick up the picture when you get home, and then forget you ever gave it
to them.
Company give you a freebie on your birthday if you sign up their mailing list? Sign up 366 times with them and write a Huginn workflow that puts “today’s” promo code into your
Obsidian notetaking app (Sync’d over Syncthing) but filters out everything else.
Suspect some organisation is selling your email address on to third parties? Give them a unique email address that you only give to them and catch them in a honeypot.
It isn’t pretty, but… it doesn’t need to be! Nobody actually sees the admin interface except you anyway.
Plus, it’s just kinda cool to be able to brand your shortlinks with your own name, right? If you follow only one link from this post, let it be to watch this video
that helps explain why this is important: danq.link/url-shortener-highlights.
I run many, many other Docker containers and virtual machines on my NAS. These five aren’t even the “top five” that I
use… they’re just five that are great starters because they’re easy and pack a lot of joy into their learning curve.
And if your NAS can’t do all the above… consider Unraid for your next NAS!
Footnotes
1 I wrote the beginnings of this post on my phone while in the Channel Tunnel and then
carried on using my desktop computer once I was home. Sync is magic.
2 I can’t share or recommend one reverse proxy guide in particular because I set my own up
because I can configure Nginx in my sleep, but I did a quick search and found several that all look good so I imagine you can do the same. You don’t have to do it on day one, though!
It turns out that by default, WordPress replaces emoji in its feeds (and when sending email) with images of those emoji, using the Tweemoji set, and with the alt-text set to the original emoji. These images are hosted at https://s.w.org/images/core/emoji/…-based
URLs.
I can see why this functionality was added: what if the feed reader didn’t support Unicode or didn’t have a font capable of showing the appropriate emoji?
But I can also see reasons why it might not be desirable to everybody. For example:
Downloading an image will always be slower than rendering an emoji.
The code to include an image is always more-verbose than simply including an emoji.
As seen above: a feed reader which imposes a minimum size on embedded images might well render one “wrong”.
It’s marginally more-verbose for screen reader users to say “Image: heart emoji” than just “heart emoji”, I imagine.
Serving an third-party image when a feed item is viewed has potential privacy implications that I try hard to avoid.
Replacing emoji with images is probably unnecessary for modern feed readers anyway.
That’s all there is to it. Now, my feed reader shows my system’s emoji instead of a huge image:
I’m always grateful to discover that a piece of WordPress functionality, whether core or in an extension, makes proper use of hooks so that its functionality can be changed, extended,
or disabled. One of the single best things about the WordPress open-source ecosystem is that you almost never have to edit somebody else’s code (and remember to re-edit it
every time you install an update).
The week before last, Katie shared with me that article from last month, Who killed Google Reader? I’d read it before so I
didn’t bother clicking through again, but we did end up chatting about RSS a bit1.
I ditched Google Reader several years before its untimely demise, but I can confirm “461 unread items” was a believable message.
Katie “abandoned feeds a few years ago” because they were “regularly ending up with 200+ unread items that felt overwhelming”.
Conversely: I think that dropping your feed reader because there’s too much to read is… solving the wrong problem.
About half way through editing this image I completely forgot what message I was trying to convey, but I figured I’d keep it anyway and let you come up with your own
interpretation.
I think that he, like Katie, might be looking at his reader in a different way than I do mine.
At time of writing, I’ve got 567 unread items. And that’s fine.
RSS is not email!
I’ve been in the position that Katie and David describe: of feeling overwhelmed by the sheer volume of unread items. And I know others have, too. So let me share something I’ve learned
sooner:
There’s nothing special about reaching Inbox Zero in your feed reader.
It’s not noble nor enlightened to get to the bottom of your “unread” list.
Your 👏 feed 👏 reader 👏 is 👏 not 👏 an 👏 email 👏 client. 👏
The idea of Inbox Zero as applied to your email inbox is about productivity. Any message in your email might be something that requires urgent action, and you
won’t know until you filter through and categorise .
But your RSS reader doesn’t (shouldn’t?) be there to add to your to-do list. Your RSS reader is a list of things you might like to read. In an ideal world, reaching “RSS Zero” would mean that you’ve seen everything on the Internet that you might
enjoy. That’s not enlightened; that’s sad!
Google Reader understood this, although the word “congratulations” was misplaced.
Use RSS for joy
My RSS reader is a place of joy, never of stress. I’ve tried to boil down the principles that makes it so, and here they are:
Zero is not the target.
The numbers are to inspire about how much there is “out there” for you, not to enumerate how much work need have to do.
Group your feeds by importance.
Your feed reader probably lets you group (folder, tag…) your feeds, so you can easily check-in on what you care about and leave other feeds for a rainy day.2 This is good.
Don’t read every article.
Your feed reader gives you the convenience of keeping content in one place, but you’re not obligated to read every single one. If something doesn’t interest you, mark it
as read and move on. No judgement.
Keep things for later.
Something you want to read, but not now? Find a way to “save for later” to get it out of your main feed so you. Don’t have to scroll past it every day! Star it or tag
it3 or push it to your link-saving or note-taking app. I use a
link shortener which then feeds back into my feed reader into a “for later” group!
Let topical content expire.
Have topical/time-dependent feeds (general news media, some social media etc.)? Have reader “purge” unread articles after a time. I have my subscription to BBC News headlines expire after 5 days: if I’ve taken that long to
read a headline, it might as well disappear.4
Use your feed reader deliberately.
You don’t need popup notifications (a new article’s probably already up to an hour stale by the time it hits your reader). We’re all already slaves to
notifications! Visit your reader when it suits you. I start and end every day in mine; most days I hit it again a couple of other times. I don’t need a notification: there’s always new
content. The reader keeps track of what I’ve not looked at.
It’s not just about text.
Don’t limit your feed reader to just text. Podcasts are nothing more than RSS feeds with attached
audio files; you can keep track in your reader if you like. Most video platforms let you subscribe to a feed of new videos on a channel or playlist basis, so you can e.g. get notified about YouTube channel updates without having to fight with The
Algorithm. Features like XPath Scraping in FreshRSS let you subscribe to services that
don’t even have feeds: to watch the listings of dogs on local shelter websites when you’re looking to adopt, for example.
Do your reading in your reader.
Your reader respects your preferences: colour scheme, font size, article ordering, etc. It doesn’t nag you with newsletter signup popups, cookie notices, or ads. Make the
most of that. Some RSS feeds try to disincentivise this by providing only summary content, but a good feed reader can work
around this for you, fetching actual content in the background.5
Use offline time to catch up on your reading.
Some of the best readers support offline mode. I find this fantastic when I’m on an aeroplane, because I can catch up on all of the interesting articles I’d not
had time to yet while grounded, and my reading will get synchronised when I touch down and disable flight mode.
Make your reader work foryou.
A feed reader is a tool that works for you. If it’s causing you pain, switch to a different tool6,
or reconfigure the one you’ve got. And if the way you find joy from RSS is different from me, that’s fine: this is
a personal tool, and we don’t have to have the same answer.
2 If your feed reader doesn’t support any kind of grouping, get a better reader.
3 If your feed reader doesn’t support any kind of marking/favouriting/tagging of articles,
get a better reader.
4 If your feed reader doesn’t support customisable expiry times… well that’s not too
unusual, but you might want to consider getting a better reader.
5 FreshRSS calls the feature that fetches actual post content from the resulting page
“Article CSS selector on original website”, which is a bit of a mouthful, but you can see what it’s doing. If your feed reader doesn’t support fetching full content… well, it’s
probably not that big a deal, but it’s a good nice-to-have if you’re shopping around for a reader, in my opinion.
6 There’s so much choice in feed readers, and migrating between them is (usually)
very easy, so everybody can find the best choice for them. Feedly, Inoreader, and The Old Reader are popular, free, and easy-to-use if you’re looking to get started. I prefer a selfhosted tool so I use the amazing FreshRSS (having migrated from Tiny Tiny RSS). Here’s some more tips on getting started. You might prefer a desktop or
mobile tool, or even something exotic: part of the beauty of RSS feeds is they’re open and interoperable, so if for example
you love using Slack, you can use Slack to push feed updates to you and get almost all the features you need to do everything in my list, including grouping (using
channels) and saving for later (using Slackbot/”remind me about this”). Slack’s a perfectly acceptable feed reader for some people!
Wait, there’s new Far Side content? Yup: it turns out Gary Larson’s dusted off his pen
and started drawing again. That’s awesome! But the last thing I want is to have to go to the website once every few… what: days? weeks? months? He’s not syndicated any more so
he’s not got a deadline to work to! If only there were some way to have my feed reader, y’know, do it for me and let me know whenever he draws something new.
It turns out, there is.
Here’s my setup for getting Larson’s new funnies right where I want them:
Feed URL:https://www.thefarside.com/new-stuff/1
This isn’t a valid address for any of the new stuff, but always seems to redirect to somewhere that is, so that’s nice.
XPath for finding news items://div[@class="swiper-slide"]
Turns out all the “recent” new stuff gets loaded in the HTML and then JavaScript turns it into a slider etc.; some of the
CSS classes change when the JavaScript runs so I needed to View Source rather than use my browser’s inspector to find
everything.
Item title:concat("Far Side #", descendant::button[@aria-label="Share"]/@data-shareable-item)
Ugh. The easiest place I could find a “clean” comic ID number was in a data- attribute of the “share” button, where it’s presumably used for engagement tracking. Still,
whatever works right?
Item content:descendant::figcaption
When Larson captions a comic, the caption is important.
Item link (URL) and item unique ID: concat("https://www.thefarside.com",
./@data-path)
The URLs work as direct links to the content, and because they’re unique, they make a reasonable unique ID too (so long as
their numbering scheme is internally-consistent, this should stop a re-run of new content popping up in your feed reader if the same comic comes around again).
Item thumbnail:concat("https://fox.q-t-a.uk/referer-faker.php?pw=YOUR-SECRET-PASSWORD-GOES-HERE&referer=https://www.thefarside.com/&url=",
descendant::img[@data-src]/@data-src)
The Far Side uses Referer: headers as an anti-hotlinking measure, which prevents us easily loading the images directly in an RSS reader. I use this tiny PHP script as a proxy to mitigate that. If
you don’t have such a proxy set up, you could simply omit the “Item thumbnail” and “Item content” fields and click the link to go to the original page.
Item date:normalize-space(descendant::div[@class="tfs-comic-new__meta"]/*[1])
The date is spread through two separate text nodes, so we get the content of their wrapper and use normalize-space to tidy the whitespace up. The date format then looks
like “Wednesday, March 29, 2023”, which we can parse using a custom date/time format string:
Custom date/time format:l, F j, Y
I promise I’ll stop writing about how awesome FreshRSS + XPath is someday. Today isn’t that day.
Meanwhile: if you used to use a feed reader but gave up when the Web started to become hostile to them and big social media systems started to wall you in, you should really consider
picking one up again. The stuff I write about is complex edge-cases that most folks don’t need to think about in order to benefit from RSS… but it’s super convenient to have the things you care about online (news, blogs, social media, videos, newsletters, comics, search trends…)
collated and sorted for you… without interference from algorithms that want to push “sticky” content, without invasive tracking or advertisements (or cookie banners or privacy popups),
without something “disappearing” simply because you put off reading it for a few days.
The goal: date-ordered, numbered, titled episodes of Forward in my feed reader.
Here’s the settings I came up with –
Feed URL:http://forwardcomic.com/list.php
Type of feed source:HTML + XPath (Web scraping)
XPath for finding news items://a[starts-with(@href,'archive.php')]
Item title:.
Item link (URL):./@href
Item date:./following-sibling::text()[1]
Custom date/time format:- Y.m.d
The comic pages themselves do a great thing for accessibility by including a complete transcript of each. But the listing page, which is basically a series of <a>s
separated by <br>s rather than a <ul> and <li>s, for example, leaves something to be desired (and makes it harder to scrape,
too!).
I continue to love this “killer feature” of FreshRSS, but I’m beginning to see how it could go further – I wish I had the free time to contribute to its development!
I’d love to see a mechanism for exporting/importing feed configurations like this so that I could share them more-easily, for example. I’d also be delighted if I could expand on my
XPath rules to load pages referenced by the results and get data from them, too, e.g. so I could use an image found by XPath on the “item link” page as the thumbnail
image! These are things RSSey could do for me, but FreshRSS can’t… yet!
There’s been a bit of a resurgence lately of sites whose only subscription option is email, or – worse yet – who provide certain “exclusive” content only to email subscribers.
I don’t want to go giving an actual email address to every damn service, because:
It’s not great for privacy, even when (as usual) I use a unique alias for each sender.
It’s usually harder to unsubscribe than I’d like, and rarely consistent: you need to find a recent message, click a link, sometimes that’s enough or sometimes you need to uncheck a
box or click a button, or sometimes you’ll get another email with something to click in it…
I rarely want to be notified the very second a new issue is published; email is necessarily more “pushy” than I like a subscription to be.
I don’t want to use my email Inbox to keep track of which articles I’ve read/am still going to read: that’s what a feed reader is for! (It also provides tagging, bookmarking,
filtering, standardised and bulk unsubscribing tools, etc.)
So what do I do? Well…
I already operate an OpenTrashMail instance for one-shot throwaway email addresses (which I highly recommend). And
OpenTrashMail provides a rich RSS feed. Sooo…
How I subscribe to newsletters (in my feed reader)
If I want to subscribe to your newsletter, here’s what I do:
Put an email address (I usually just bash the keyboard to make a random one, then put @-a-domain-I-control on the end, where that domain is handled by OpenTrashMail) in to
subscribe.
Put https://my-opentrashmail-server/rss/the-email-address-I-gave-you/rss.xml into my feed reader.
That’s all. There is no step 3.
Now I get your newsletter alongside all my other subscriptions. If I want to unsubscribe I just tell my feed reader to stop polling the RSS feed (You don’t even get to find out that I’ve unsubscribed; you’re now just dropping emails into an unmonitored box, but of course I can
resubscribe and pick up from where I left off if I ever want to).
Obviously this approach isn’t suitable for personalised content or sites for which your email address is used for authentication, because anybody who can guess the random email address
can get the feed! But it’s ideal for those companies who’ll ocassionally provide vouchers in exchange for being able to send you other stuff to your Inbox, because you can
simply pipe their content to your feed reader, then add a filter to drop anything that doesn’t contain the magic keyword: regular vouchers, none of the spam. Or for blogs that provide
bonus content to email subscribers, you can get the bonus content in the same way as the regular content, right there in a folder of your reader. It’s pretty awesome.
If you don’t already have and wouldn’t benefit from running OpenTrashMail (or another trashmail system with feed support) it’s probably not worth setting one up just for this
purpose. But otherwise, I can certainly recommend it.
A few yeras ago, I wanted to subscribe to The Far Side‘s “Daily Dose” via my RSS reader. The Far Side doesn’t have an RSS feed, so I implemented a proxy/middleware to bridge the two.
If you’re looking for a more-general instruction on using XPath scraping in FreshRSS, this isn’t it.
The release of version 1.20.0 of my favourite RSS reader FreshRSS provided a new mechanism for subscribing to content from sites that didn’t provide feeds: XPath scraping. I demonstrated the use of this to subscribe to my friend Beverley‘s blog, but this week I
figured it was time to have a go at retiring my middleware and subscribing directly to The Far Side from FreshRSS.
It turns out that FreshRSS’s XPath Scraping is almost enough to achieve exactly what I want. The big problem is that the image server on The Far Side website tries to
prevent hotlinking by checking the Referer: header on requests, so we need a proxy to spoof that. I threw together a quick PHP program to act as a
proxy (if you don’t have this, you’ll have to click-through to read each comic), then configured my FreshRSS feed as follows:
Feed URL:https://www.thefarside.com/
The “Daily Dose” gets published to The Far Side‘s homepage each day.
XPath for finding new items://div[@class="card tfs-comic js-comic"]
Finds each comic on the page. This is probably a little over-specific and brittle; I should probably switch to using the contains function at some point. I subsequently have to use parent:: and
ancestor:: selectors which is usually a sign that your screen-scraping is suboptimal, but in this case it’s necessary because it’s only at this deep level that we start
seeing really specific classes.
Item title:concat("Far Side #", parent::div/@data-id)
The comics don’t have titles (“The one with the cow”?), but these seem to have unique IDs in the data-id attribute of the parent <div>, so I’m using
those as a reference.
Item content:descendant::div[@class="card-body"]
Within each item, the <div class="card-body"> contains the comic and its text. The comic itself can’t be loaded this way for two reasons: (1) the <img
src="..."> just points to a placeholder (the site uses JavaScript-powered lazy-loading, ugh – the actual source is in the data-src attribute), and (2) as
mentioned above, there’s anti-hotlink protection we need to work around.
Item link:descendant::input[@data-copy-item]/@value
Each comic does have a unique link which you can access by clicking the “share” button under it. This makes a hidden text <input> appear, which we can
identify by the presence of the data-copy-item attribute. The contents of this textbox is the sharing URL for
the comic.
Item thumbnail:concat("https://example.com/referer-faker.php?pw=YOUR-SECRET-PASSWORD-GOES-HERE&referer=https://www.thefarside.com/&url=",
descendant::div[@class="tfs-comic__image"]/img/@data-src)
Here’s where I hook into my special proxy server, which spoofs the Referer: header to work around the anti-hotlinking code. If you wanted you might be able to come up
with an alternative solution using a custom JavaScript loaded into your FreshRSS instance (there’s a plugin for that!), perhaps to load an iframe of the sharing URL? Or you can host a copy of my proxy server yourself (you can’t use mine, it’s got a password and that password isn’tYOUR-SECRET-PASSWORD-GOES-HERE!)
Item date:ancestor::div[@class="tfs-page__full tfs-page__full--md"]/descendant::h3
There’s nothing associating each comic with the date it appeared in the Daily Dose, so we have to ascend up to the top level of the page to find the date from the heading.
Item unique ID:parent::div/@data-id
Giving FreshRSS a unique ID can help it stop showing duplicates. We use the unique ID we discovered earlier; this way, if the Daily Dose does a re-run of something it already did
since I subscribed, I won’t be shown it again. Omit this if you want to see reruns.
Hurrah; once again I can laugh at repeats of Gary Larson’s best work alongside my other morning feeds.
There’s a moral to this story: when you make your website deliberately hard to consume, fewer people will access it in the way you want!The Far Side‘s website
is actively hostile to users (JavaScript lazy-loading, anti-right click scripts, hotlink protection, incorrect MIME types, no feeds etc.), and an inevitable consequence of that is that people like me will find and share workarounds to that
hostility.
If you’re ad-supported or collect webstats and want to keep traffic “on your site” on this side of 2004, you should make it as easy as possible for people to subscribe to content.
Consider The Oatmeal or Oglaf, for example, which offer RSS feeds that include only a partial thumbnail of each comic and a link through to the full thing. I don’t feel the need to screen-scrape those sites
because they’ve given me a subscription option that works, and I routinely click-through to both of them to enjoy their latest content!
Conversely, the Far Side‘s aggressive anti-subscription technology ultimately means that there are fewer actual visitors to their website… because folks like me work
to circumvent them.
And now you know how I did so.
Update: want the new content that’s being published to The Far Side in FreshRSS, too? I’ve got a recipe for that!
My day usually starts in my feed reader, accessed via the FeedMe app from my mobile (although FreshRSS provides a reasonably good
responsive interface out-of-the-box!)
But with FreshRSS 1.20.0, I no longer have to maintain my own tool to get this brilliant functionality, and I’m overjoyed. Let’s look at how it works by re-subscribing to Beverley’s
blog but without a middleware tool.
This post is about to get pretty technical. If you don’t want to learn some XPath but just want to make a feed out of a web page, use a
graphical tool like FetchRSS.
In the latest version of FreshRSS, when you add a new feed to your reader, a new section “Type of feed source” is available. Unfold it, and you can change from the default
(“RSS / Atom”) to the new option “HTML + XPath (Web scraping)”.
Put a human-readable page address rather than a feed address into the “Feed URL” field and fill these fields to tell FreshRSS
how to parse the page to get the content you want. Note that it doesn’t matter if the web page isn’t valid XML (e.g. missing
closing tags) because it’s going to get run through PHP’s
DOMDocument anyway which will “correct” for some really sloppy code if needed.
You can use your browser’s debugger to help check your XPath rules: here I’ve run document.evaluate('//li[@class="blog__post-preview"]', document).iterateNext() and
got back the first blog post on the page, so I know I’m on the right track.
You’ll need to use XPath to express how to find a “feed item” on the page. Here’s the rules I used for https://webdevbev.co.uk/blog.html (many of these fields were optional – I didn’t have to do this much work):
Feed title://h1
I override this anyway in FreshRSS, so I could just have used the a string, but I wanted the XPath practice. There’s only one <h1> on the page, and it can be
considered the “title” of the feed.
Finding items://li[@class="blog__post-preview"]
Each “post” on the page is an <li class="blog__post-preview">.
Item titles:descendant::h2
Each post has a <h2> which is the post title. The descendant:: selector scopes the search to each post as found above.
Item content:descendant::p[3]
Beverley’s static site generator template puts the post summary in the third paragraph of the <li>, which we can select like this.
Item link:descendant::h2/a/@href
This expects a URL, so we need the /@href to make sure we get the value of the <h2><a
href="...">, rather than its contents.
Item thumbnail:descendant::img[@class="blog__image--preview"]/@src
Again, this expects a URL, which we get from the <img src="...">.
Item author:"Beverley Newing"
Beverley’s blog doesn’t host any guest posts, so I just use a string literal here.
Item date:substring-after(descendant::p[@class="blog__date-posted"], "Date posted: ")
This is the only complicated one: the published dates on Beverley’s blog aren’t explicitly marked-up, but part of a string that begins with the words “Date posted: “, so I use XPath’s
substring-after function to strtip this. The result gets passed to PHP’s
strtotime(), which is pretty tolerant of different date formats (although not of the words “Date posted:” it turns out!).
I’d love one day for FreshRSS to provide some kind of “preview” feature here so you can see what you’ll expect to get back, as you work. That, and support for different input types
(JSON, perhaps?), perhaps other selectors (I find CSS-style
selectors much simpler than XPath), and maybe even an option to execute Javascript on the page before scraping (I use this in my own toolchain, but that’s just because I want to have
my cake and eat it too). But this is still all pretty awesome.
I hope that this is just the beginning for this new killer feature in FreshRSS: there’s so much more it can be and do. But for now, I’m still mighty impressed that I can begin to
phase-out my use of my relatively resource-intensive feed-building middleware and use my feed reader to do more and more of the heavy lifting for which I love it so much.
I also love that this functionally adds h-feed support in by the back door. I’d still prefer there to be a “h-feed” option in the “Type of feed source” drop-down, but at least
I can add such support manually, now!
The finished result: Bev’s blog posts appear directly in my feed reader, even though they don’t have a feed, and now without going through the middleware I’d set up for that
purpose.
Footnotes
1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some
are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.
My @FreshRSS installation is the first, last, and sometimes only place I go on the Internet. When a site doesn’t have a feed but I wish it
did, I add one using middleware (e.g. danq.me/far-side-rss).