A few yeras ago, I wanted to subscribe to The Far Side‘s “Daily Dose” via my RSS reader. The Far Side doesn’t have an RSS feed, so I implemented a proxy/middleware to bridge the two.
The release of version 1.20.0 of my favourite RSS reader FreshRSS provided a new mechanism for subscribing to content from sites that didn’t provide feeds: XPath scraping. I demonstrated the use of this to subscribe to my friend Beverley‘s blog, but this week I figured it was time to have a go at retiring my middleware and subscribing directly to The Far Side from FreshRSS.
It turns out that FreshRSS’s XPath Scraping is almost enough to achieve exactly what I want. The big problem is that the image server on The Far Side website tries to prevent hotlinking by checking the
Referer: header on requests, so we need a proxy to spoof that. I threw together a quick PHP program to act as a proxy (if you don’t have this, you’ll have to click-through to read each comic), then configured my FreshRSS feed as follows:
- Feed URL:
The “Daily Dose” gets published to The Far Side‘s homepage each day.
- XPath for finding new items:
//div[@class="card tfs-comic js-comic"]
Finds each comic on the page. This is probably a little over-specific and brittle; I should probably switch to using the
containsfunction at some point. I subsequently have to use
ancestor::selectors which is usually a sign that your screen-scraping is suboptimal, but in this case it’s necessary because it’s only at this deep level that we start seeing really specific classes.
- Item title:
concat("Far Side #", parent::div/@data-id)
The comics don’t have titles (“The one with the cow”?), but these seem to have unique IDs in the
data-idattribute of the parent
<div>, so I’m using those as a reference.
- Item content:
Within each item, the
<div class="card-body">contains the comic and its text. The comic itself can’t be loaded this way for two reasons: (1) the
data-srcattribute), and (2) as mentioned above, there’s anti-hotlink protection we need to work around.
- Item link:
Each comic does have a unique link which you can access by clicking the “share” button under it. This makes a hidden text
<input>appear, which we can identify by the presence of the
data-copy-itemattribute. The contents of this textbox is the sharing URL for the comic.
- Item thumbnail:
Here’s where I hook into my special proxy server, which spoofs the
- Item date:
There’s nothing associating each comic with the date it appeared in the Daily Dose, so we have to ascend up to the top level of the page to find the date from the heading.
- Item unique ID:
Giving FreshRSS a unique ID can help it stop showing duplicates. We use the unique ID we discovered earlier; this way, if the Daily Dose does a re-run of something it already did since I subscribed, I won’t be shown it again. Omit this if you want to see reruns.
If you’re ad-supported or collect webstats and want to keep traffic “on your site” on this side of 2004, you should make it as easy as possible for people to subscribe to content. Consider The Oatmeal or Oglaf, for example, which offer RSS feeds that include only a partial thumbnail of each comic and a link through to the full thing. I don’t feel the need to screen-scrape those sites because they’ve given me a subscription option that works, and I routinely click-through to both of them to enjoy their latest content!
Conversely, the Far Side‘s aggressive anti-subscription technology ultimately means that there are fewer actual visitors to their website… because folks like me work to circumvent them.
And now you know how I did so.