The Far Side in FreshRSS

A few yeras ago, I wanted to subscribe to The Far Side‘s “Daily Dose” via my RSS reader. The Far Side doesn’t have an RSS feed, so I implemented a proxy/middleware to bridge the two.

Browser debugger running document.evaluate('//li[@class="blog__post-preview"]', document).iterateNext() on Beverley's weblog and getting the first blog entry.
If you’re looking for a more-general instruction on using XPath scraping in FreshRSS, this isn’t it.
The release of version 1.20.0 of my favourite RSS reader FreshRSS provided a new mechanism for subscribing to content from sites that didn’t provide feeds: XPath scraping. I demonstrated the use of this to subscribe to my friend Beverley‘s blog, but this week I figured it was time to have a go at retiring my middleware and subscribing directly to The Far Side from FreshRSS.

It turns out that FreshRSS’s XPath Scraping is almost enough to achieve exactly what I want. The big problem is that the image server on The Far Side website tries to prevent hotlinking by checking the Referer: header on requests, so we need a proxy to spoof that. I threw together a quick PHP program to act as a proxy (if you don’t have this, you’ll have to click-through to read each comic), then configured my FreshRSS feed as follows:

FreshRSS "HTML + XPath" configuration page, configured as described below.

  • Feed URL: https://www.thefarside.com/
    The “Daily Dose” gets published to The Far Side‘s homepage each day.
  • XPath for finding new items: //div[@class="card tfs-comic js-comic"]
    Finds each comic on the page. This is probably a little over-specific and brittle; I should probably switch to using the contains function at some point. I subsequently have to use parent:: and ancestor:: selectors which is usually a sign that your screen-scraping is suboptimal, but in this case it’s necessary because it’s only at this deep level that we start seeing really specific classes.
  • Item title: concat("Far Side #", parent::div/@data-id)
    The comics don’t have titles (“The one with the cow”?), but these seem to have unique IDs in the data-id attribute of the parent <div>, so I’m using those as a reference.
  • Item content: descendant::div[@class="card-body"]
    Within each item, the <div class="card-body"> contains the comic and its text. The comic itself can’t be loaded this way for two reasons: (1) the <img src="..."> just points to a placeholder (the site uses JavaScript-powered lazy-loading, ugh – the actual source is in the data-src attribute), and (2) as mentioned above, there’s anti-hotlink protection we need to work around.
  • Item link: descendant::input[@data-copy-item]/@value
    Each comic does have a unique link which you can access by clicking the “share” button under it. This makes a hidden text <input> appear, which we can identify by the presence of the data-copy-item attribute. The contents of this textbox is the sharing URL for the comic.
  • Item thumbnail: concat("https://example.com/referer-faker.php?pw=YOUR-SECRET-PASSWORD-GOES-HERE&referer=https://www.thefarside.com/&url=", descendant::div[@class="tfs-comic__image"]/img/@data-src)
    Here’s where I hook into my special proxy server, which spoofs the Referer: header to work around the anti-hotlinking code. If you wanted you might be able to come up with an alternative solution using a custom JavaScript loaded into your FreshRSS instance (there’s a plugin for that!), perhaps to load an iframe of the sharing URL? Or you can host a copy of my proxy server yourself (you can’t use mine, it’s got a password and that password isn’t YOUR-SECRET-PASSWORD-GOES-HERE!)
  • Item date: ancestor::div[@class="tfs-page__full tfs-page__full--md"]/descendant::h3
    There’s nothing associating each comic with the date it appeared in the Daily Dose, so we have to ascend up to the top level of the page to find the date from the heading.
  • Item unique ID: parent::div/@data-id
    Giving FreshRSS a unique ID can help it stop showing duplicates. We use the unique ID we discovered earlier; this way, if the Daily Dose does a re-run of something it already did since I subscribed, I won’t be shown it again. Omit this if you want to see reruns.
Far Side comic #12326, from 23 November 2022, shown in FreshRSS. The comic shows two bulls dressed in trenchcoats and hats browsing a china shop; one staff member says to the other "I got a bad feeling about this, Harriet."
Hurrah; once again I can laugh at repeats of Gary Larson’s best work alongside my other morning feeds.

There’s a moral to this story: when you make your website deliberately hard to consume, fewer people will access it in the way you want! The Far Side‘s website is actively hostile to users (JavaScript lazy-loading, anti-right click scripts, hotlink protection, incorrect MIME types, no feeds etc.), and an inevitable consequence of that is that people like me will find and share workarounds to that hostility.

If you’re ad-supported or collect webstats and want to keep traffic “on your site” on this side of 2004, you should make it as easy as possible for people to subscribe to content. Consider The Oatmeal or Oglaf, for example, which offer RSS feeds that include only a partial thumbnail of each comic and a link through to the full thing. I don’t feel the need to screen-scrape those sites because they’ve given me a subscription option that works, and I routinely click-through to both of them to enjoy their latest content!

Conversely, the Far Side‘s aggressive anti-subscription technology ultimately means that there are fewer actual visitors to their website… because folks like me work to circumvent them.

And now you know how I did so.

XPath Scraping with FreshRSS

I’ve been spending a while running on reduced brain capacity lately so, to ease myself back into thinking like a programmer, I upgraded my preferred feed reader FreshRSS to version 1.20.0 – which was released a couple of weeks ago – and tried out what I believe is its killer new feature: HTML + XPath scraping.

Screenshot showing Beverley Newing's weblog; two articles are visible - Paperback copy of 'Disability Visibility', edited by Alice Wong, next to a cup of tea Setting up an Accessibility Book Club, published on 1 March 2022, and Reflecting on 2021, published on 1 January 2022.
I like to keep up-to-date with my friend Bev’s blog, but they don’t have an RSS feed.

I’ve been using RSS1 for about 20 years and I love it. It feels great to be able to curate my updates based on “what I care about”, and not on “what some social network thinks I should care about”, to keep things to read later, to prioritise effectively based on my own categorisation, to consume content offline and have my to-read list synchronise later, etc.

RSS never went away, of course (what do you think a podcast is?), but it got steamrollered out of the public eye by big companies who make their money out of keeping your eyes on their platforms and off the open Web. But it feels like it’s slowly coming back: even Substack – whose entire thing is that an email client is more-convenient than a feed reader for most people – launched an RSS reader this week!

A smartphone on a wooden surface. The screen shows the FeedMe app, showing the most-recent blog post from Beverley's blog.
My day usually starts in my feed reader, accessed via the FeedMe app from my mobile (although FreshRSS provides a reasonably good responsive interface out-of-the-box!)

I love RSS so much that I routinely retrofit other people’s websites with feeds just so I can subscribe to them: I even published the tool I use to do so! Whether filtering sports headlines out of BBC News, turning retro webcomics into “reading lists” so I can track my progress, or just working around sites that really should have feeds but refuse to, I just love sidestepping these “missing feeds”. My friend Beverley has a blog without any kind of feed, so I added one so I could subscribe to it. Magic.

But with FreshRSS 1.20.0, I no longer have to maintain my own tool to get this brilliant functionality, and I’m overjoyed. Let’s look at how it works by re-subscribing to Beverley’s blog but without a middleware tool.

Screenshot showing FetchRSS being used to graphically create a feed from Beverley's blog.
This post is about to get pretty technical. If you don’t want to learn some XPath but just want to make a feed out of a web page, use a graphical tool like FetchRSS.

In the latest version of FreshRSS, when you add a new feed to your reader, a new section “Type of feed source” is available. Unfold it, and you can change from the default (“RSS / Atom”) to the new option “HTML + XPath (Web scraping)”. Put a human-readable page address rather than a feed address into the “Feed URL” field and fill these fields to tell FreshRSS how to parse the page to get the content you want. Note that it doesn’t matter if the web page isn’t valid XML (e.g. missing closing tags) because it’s going to get run through PHP’s DOMDocument anyway which will “correct” for some really sloppy code if needed.

Browser debugger running document.evaluate('//li[@class="blog__post-preview"]', document).iterateNext() on Beverley's weblog and getting the first blog entry.
You can use your browser’s debugger to help check your XPath rules: here I’ve run  document.evaluate('//li[@class="blog__post-preview"]', document).iterateNext() and got back the first blog post on the page, so I know I’m on the right track.
You’ll need to use XPath to express how to find a “feed item” on the page. Here’s the rules I used for https://webdevbev.co.uk/blog.html (many of these fields were optional – I didn’t have to do this much work):

  • Feed title: //h1
    I override this anyway in FreshRSS, so I could just have used the a string, but I wanted the XPath practice. There’s only one <h1> on the page, and it can be considered the “title” of the feed.
  • Finding items: //li[@class="blog__post-preview"]
    Each “post” on the page is an <li class="blog__post-preview">.
  • Item titles: descendant::h2
    Each post has a <h2> which is the post title. The descendant:: selector scopes the search to each post as found above.
  • Item content: descendant::p[3]
    Beverley’s static site generator template puts the post summary in the third paragraph of the <li>, which we can select like this.
  • Item link: descendant::h2/a/@href
    This expects a URL, so we need the /@href to make sure we get the value of the <h2><a href="...">, rather than its contents.
  • Item thumbnail: descendant::img[@class="blog__image--preview"]/@src
    Again, this expects a URL, which we get from the <img src="...">.
  • Item author: "Beverley Newing"
    Beverley’s blog doesn’t host any guest posts, so I just use a string literal here.
  • Item date: substring-after(descendant::p[@class="blog__date-posted"], "Date posted: ")
    This is the only complicated one: the published dates on Beverley’s blog aren’t explicitly marked-up, but part of a string that begins with the words “Date posted: “, so I use XPath’s substring-after function to strtip this. The result gets passed to PHP’s strtotime(), which is pretty tolerant of different date formats (although not of the words “Date posted:” it turns out!).
Screenshot: Adding a "HTML + XPath (Web scraping)" feed via FreshRSS.
I’d love one day for FreshRSS to provide some kind of “preview” feature here so you can see what you’ll expect to get back, as you work. That, and support for different input types (JSON, perhaps?), perhaps other selectors (I find CSS-style selectors much simpler than XPath), and maybe even an option to execute Javascript on the page before scraping (I use this in my own toolchain, but that’s just because I want to have my cake and eat it too). But this is still all pretty awesome.

I hope that this is just the beginning for this new killer feature in FreshRSS: there’s so much more it can be and do. But for now, I’m still mighty impressed that I can begin to phase-out my use of my relatively resource-intensive feed-building middleware and use my feed reader to do more and more of the heavy lifting for which I love it so much.

I also love that this functionally adds h-feed support in by the back door. I’d still prefer there to be a “h-feed” option in the “Type of feed source” drop-down, but at least I can add such support manually, now!

Beverley's blog post "Setting up an Accessibility Book Club" in FreshRSS.
The finished result: Bev’s blog posts appear directly in my feed reader, even though they don’t have a feed, and now without going through the middleware I’d set up for that purpose.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.