The Far Side in FreshRSS

A few yeras ago, I wanted to subscribe to The Far Side‘s “Daily Dose” via my RSS reader. The Far Side doesn’t have an RSS feed, so I implemented a proxy/middleware to bridge the two.

Browser debugger running document.evaluate('//li[@class="blog__post-preview"]', document).iterateNext() on Beverley's weblog and getting the first blog entry. — If you’re looking for a more-general instruction on using XPath scraping in FreshRSS, this isn’t it.

The release of version 1.20.0 of my favourite RSS reader FreshRSS provided a new mechanism for subscribing to content from sites that didn’t provide feeds: XPath scraping. I demonstrated the use of this to subscribe to my friend Beverley‘s blog, but this week I figured it was time to have a go at retiring my middleware and subscribing directly to The Far Side from FreshRSS.

It turns out that FreshRSS’s XPath Scraping is almost enough to achieve exactly what I want. The big problem is that the image server on The Far Side website tries to prevent hotlinking by checking the Referer: header on requests, so we need a proxy to spoof that. I threw together a quick PHP program to act as a proxy (if you don’t have this, you’ll have to click-through to read each comic), then configured my FreshRSS feed as follows:

Feed URL: https://www.thefarside.com/
The “Daily Dose” gets published to The Far Side‘s homepage each day.
XPath for finding new items: //div[@class="card tfs-comic js-comic"]
Finds each comic on the page. This is probably a little over-specific and brittle; I should probably switch to using the contains function at some point. I subsequently have to use parent:: and ancestor:: selectors which is usually a sign that your screen-scraping is suboptimal, but in this case it’s necessary because it’s only at this deep level that we start seeing really specific classes.
Item title: concat("Far Side #", parent::div/@data-id)
The comics don’t have titles (“The one with the cow”?), but these seem to have unique IDs in the data-id attribute of the parent <div>, so I’m using those as a reference.
Item content: descendant::div[@class="card-body"]
Within each item, the <div class="card-body"> contains the comic and its text. The comic itself can’t be loaded this way for two reasons: (1) the <img src="..."> just points to a placeholder (the site uses JavaScript-powered lazy-loading, ugh – the actual source is in the data-src attribute), and (2) as mentioned above, there’s anti-hotlink protection we need to work around.
Item link: descendant::input[@data-copy-item]/@value
Each comic does have a unique link which you can access by clicking the “share” button under it. This makes a hidden text <input> appear, which we can identify by the presence of the data-copy-item attribute. The contents of this textbox is the sharing URL for the comic.
Item thumbnail: concat("https://example.com/referer-faker.php?pw=YOUR-SECRET-PASSWORD-GOES-HERE&referer=https://www.thefarside.com/&url=", descendant::div[@class="tfs-comic__image"]/img/@data-src)
Here’s where I hook into my special proxy server, which spoofs the Referer: header to work around the anti-hotlinking code. If you wanted you might be able to come up with an alternative solution using a custom JavaScript loaded into your FreshRSS instance (there’s a plugin for that!), perhaps to load an iframe of the sharing URL? Or you can host a copy of my proxy server yourself (you can’t use mine, it’s got a password and that password isn’t YOUR-SECRET-PASSWORD-GOES-HERE!)
Item date: ancestor::div[@class="tfs-page__full tfs-page__full--md"]/descendant::h3
There’s nothing associating each comic with the date it appeared in the Daily Dose, so we have to ascend up to the top level of the page to find the date from the heading.
Item unique ID: parent::div/@data-id
Giving FreshRSS a unique ID can help it stop showing duplicates. We use the unique ID we discovered earlier; this way, if the Daily Dose does a re-run of something it already did since I subscribed, I won’t be shown it again. Omit this if you want to see reruns.

Far Side comic #12326, from 23 November 2022, shown in FreshRSS. The comic shows two bulls dressed in trenchcoats and hats browsing a china shop; one staff member says to the other "I got a bad feeling about this, Harriet." — Hurrah; once again I can laugh at repeats of Gary Larson’s best work alongside my other morning feeds.

There’s a moral to this story: when you make your website deliberately hard to consume, fewer people will access it in the way you want! The Far Side‘s website is actively hostile to users (JavaScript lazy-loading, anti-right click scripts, hotlink protection, incorrect MIME types, no feeds etc.), and an inevitable consequence of that is that people like me will find and share workarounds to that hostility.

If you’re ad-supported or collect webstats and want to keep traffic “on your site” on this side of 2004, you should make it as easy as possible for people to subscribe to content. Consider The Oatmeal or Oglaf, for example, which offer RSS feeds that include only a partial thumbnail of each comic and a link through to the full thing. I don’t feel the need to screen-scrape those sites because they’ve given me a subscription option that works, and I routinely click-through to both of them to enjoy their latest content!

Conversely, the Far Side‘s aggressive anti-subscription technology ultimately means that there are fewer actual visitors to their website… because folks like me work to circumvent them.

And now you know how I did so.

Update: want the new content that’s being published to The Far Side in FreshRSS, too? I’ve got a recipe for that!

11 comments

No Body

Do you know of a way to handle infinite scroll sites?

17 December, 2022, 02:06

Dan Q

@No Body:

Shouldn’t be difficult, although probably not necessary for most people either (the most-recent content is usually at the top, so as long as your RSS polling interval is frequent enough you won’t need to see what’s past-the-scroll). But –

The ideal way would be to use your browser’s debugger to decipher what happens when you scroll. Presumably there’s an ajax request for more data which comes probably either as HTML or JSON, which then gets rendered to the page. And then implement an (RSSey?) script to poll that (constructed) address a few times. I basically did this for my old Far Side RSS generator, for which multiple pages needed polling IIRC.

An alternative way would be to use RSSey to script the puppeteer-powered browser to scroll to the bottom, wait a bit, repeat as needed, then parse the contents.

You probably can’t handle infinite scroll using FreshRSS alone, though. But again: most folks probably wouldn’t need to!

17 December, 2022, 11:39

Lennon

Hey, this is great! Thanks for this and the Forward comic one, I set them up in my FreshRSS install and they’re working great!

Any chance you’ve figured something out for the new stuff on The Far Side website? That one doesn’t even have a static new page, so it’s pretty thoroughly defeated me at every turn.

28 March, 2023, 18:10

Dan Q

Hadn’t even noticed there was new stuff! I’ll take a look some day!

28 March, 2023, 18:11

Lennon says:

Yeah, it’s sporadic and there’s not a ton of it, but I’m not gonna complain about new Far Sides!

28 March, 2023, 20:26
1. Dan Q says:
  
  Yup, it’s absolutely achievable. I just knocked something together in about 20 minutes. Thanks for letting me know they existed!
  
  I’ll publish an article to my blog within the next 24 hours explaining how it’s done. Why not subscribe to my RSS feed to hear about it as soon as I do? My RSS can be a little high-traffic, so if you just want the articles (longer-form stuff published here first and usually with a little research, unlike my checkins, notes etc.) you can get an RSS feed of just my articles… just sayin’!
  
  28 March, 2023, 21:20
  1. Lennon says:
    
    Awesome, I’ll look out for it! As a diehard devotee of RSS feeds, I definitely appreciate that you have multiple RSS options for your site!
    
    29 March, 2023, 05:00

Kevin

First off, huge thanks for posting this, I am currently testing out FreshRSS and my old way of getting the Farside feed didn’t work in FreshRSS so when googling for alternatives I stumbled across your posts. I am hosting your php script and seem to have it mostly working now. However, I seem to be getting a large empty space followed by the caption and then followed by the thumbnail. I am assuming the empty space is the empty placeholder for the comic itself as they look to be about the same size. Trying to suss out a way to hide the empty space. I tried changing the // Item content: descendant::div[@class=”card-body”] // to something referencing more specifically the caption class which did appear to hide the blank space but also hid the caption as well which isn’t ideal as the caption is needed on some of the comics.

23 April, 2023, 15:12

Kevin Ould

So I kind of fixed this, I changed the // Item content: descendant::div[@class=”card-body”] // to // Item content: descendant::figure[@class=”figure tfs-comic__caption”] // but I noticed some comics not loading in but I also think this may have been an issue with the original settings as well. I know when I first set it up, nothing loaded in until the next day and then some showed up.

25 April, 2023, 02:11

David mcintyre

Thats all to techical for me how can i just get it sent to my tablet a bit easier

29 November, 2023, 17:00

Dan Q says:

@David: Don’t know! When I have a technical problem I come up with a technical solution. If I think it might help others, I share it.

If I’ve found the same problem as somebody else but not the right solution for them… then I hope that when they find the right solution for them that they share it, too!

29 November, 2023, 23:09

Reply here Cancel reply

Reply by email

I'd love to hear what you think. Send an email to b20835@danq.me; be sure to let me know if you're happy for your comment to appear on the Web!

No Body says:

Do you know of a way to handle infinite scroll sites?

17 December, 2022, 02:06
Dan Q says:

@No Body:

Shouldn’t be difficult, although probably not necessary for most people either (the most-recent content is usually at the top, so as long as your RSS polling interval is frequent enough you won’t need to see what’s past-the-scroll). But –

The ideal way would be to use your browser’s debugger to decipher what happens when you scroll. Presumably there’s an ajax request for more data which comes probably either as HTML or JSON, which then gets rendered to the page. And then implement an (RSSey?) script to poll that (constructed) address a few times. I basically did this for my old Far Side RSS generator, for which multiple pages needed polling IIRC.

An alternative way would be to use RSSey to script the puppeteer-powered browser to scroll to the bottom, wait a bit, repeat as needed, then parse the contents.

You probably can’t handle infinite scroll using FreshRSS alone, though. But again: most folks probably wouldn’t need to!

17 December, 2022, 11:39
Lennon says:

Hey, this is great! Thanks for this and the Forward comic one, I set them up in my FreshRSS install and they’re working great!

Any chance you’ve figured something out for the new stuff on The Far Side website? That one doesn’t even have a static new page, so it’s pretty thoroughly defeated me at every turn.

28 March, 2023, 18:10
Dan Q says:

Hadn’t even noticed there was new stuff! I’ll take a look some day!

28 March, 2023, 18:11
1. Lennon says:
  
  Yeah, it’s sporadic and there’s not a ton of it, but I’m not gonna complain about new Far Sides!
  
  28 March, 2023, 20:26
  1. Dan Q says:
    
    Yup, it’s absolutely achievable. I just knocked something together in about 20 minutes. Thanks for letting me know they existed!
    
    I’ll publish an article to my blog within the next 24 hours explaining how it’s done. Why not subscribe to my RSS feed to hear about it as soon as I do? My RSS can be a little high-traffic, so if you just want the articles (longer-form stuff published here first and usually with a little research, unlike my checkins, notes etc.) you can get an RSS feed of just my articles… just sayin’!
    
    28 March, 2023, 21:20
    1. Lennon says:
      
      Awesome, I’ll look out for it! As a diehard devotee of RSS feeds, I definitely appreciate that you have multiple RSS options for your site!
      
      29 March, 2023, 05:00
Kevin says:

First off, huge thanks for posting this, I am currently testing out FreshRSS and my old way of getting the Farside feed didn’t work in FreshRSS so when googling for alternatives I stumbled across your posts. I am hosting your php script and seem to have it mostly working now. However, I seem to be getting a large empty space followed by the caption and then followed by the thumbnail. I am assuming the empty space is the empty placeholder for the comic itself as they look to be about the same size. Trying to suss out a way to hide the empty space. I tried changing the // Item content: descendant::div[@class=”card-body”] // to something referencing more specifically the caption class which did appear to hide the blank space but also hid the caption as well which isn’t ideal as the caption is needed on some of the comics.

23 April, 2023, 15:12
Kevin Ould says:

So I kind of fixed this, I changed the // Item content: descendant::div[@class=”card-body”] // to // Item content: descendant::figure[@class=”figure tfs-comic__caption”] // but I noticed some comics not loading in but I also think this may have been an issue with the original settings as well. I know when I first set it up, nothing loaded in until the next day and then some showed up.

25 April, 2023, 02:11
David mcintyre says:

Thats all to techical for me how can i just get it sent to my tablet a bit easier

29 November, 2023, 17:00
1. Dan Q says:
  
  @David: Don’t know! When I have a technical problem I come up with a technical solution. If I think it might help others, I share it.
  
  If I’ve found the same problem as somebody else but not the right solution for them… then I hope that when they find the right solution for them that they share it, too!
  
  29 November, 2023, 23:09

Reply elsewhere

You can reply to this post on Facebook, Mastodon (@dan@danq.me), Mastodon (@blog@danq.me).

11 comments

Reply here Cancel reply

Reply on your own site

Reply elsewhere

Reply by email