Subscribing to Forward using FreshRSS’s XPath Scraping

As I’ve mentioned before, I’m a fan of Tailsteak‘s Forward comic. I’m not a fan of the author’s weird aversion to RSS, so I hacked a way around it first using an exploit in webcomic reader app Comic Chameleon (accidentally getting access to comics weeks in advance of their publication as a side-effect) and later by using my own tool RSSey.

But now I’m able to use my favourite feed reader FreshRSS to scrape websites directly – like I’ve done for The Far Side – I should switch to using this approach to subscribe to Forward, too:

Screenshot showing RSS feed items: recent Forward episodes including their numbers, titles, and publication dates. — The goal: date-ordered, numbered, titled episodes of *Forward* in my feed reader.

Here’s the settings I came up with –

Feed URL: http://forwardcomic.com/list.php
Type of feed source: HTML + XPath (Web scraping)
XPath for finding news items: //a[starts-with(@href,'archive.php')]
Item title: .
Item link (URL): ./@href
Item date: ./following-sibling::text()[1]
Custom date/time format: - Y.m.d

Annotated screenshot showing how each XPath directive maps to each part of the page. The item selector finds each hyperlink that begins with "archive.php" (notably missing the most-recent comic at any given time, which is found at index.php), and the date is found in the text node that immediately follows it, in a slightly-unusual variation on ISO8601. — The comic pages themselves do a great thing for accessibility by including a complete transcript of each. But the listing page, which is basically a series of `<a>`s separated by `<br>`s rather than a `<ul>` and `<li>`s, for example, leaves something to be desired (and makes it harder to scrape, too!).

I continue to love this “killer feature” of FreshRSS, but I’m beginning to see how it could go further – I wish I had the free time to contribute to its development!

I’d love to see a mechanism for exporting/importing feed configurations like this so that I could share them more-easily, for example. I’d also be delighted if I could expand on my XPath rules to load pages referenced by the results and get data from them, too, e.g. so I could use an image found by XPath on the “item link” page as the thumbnail image! These are things RSSey could do for me, but FreshRSS can’t… yet!

3 comments

Alkarex

Nice article, again!
If you grab this fix https://github.com/FreshRSS/FreshRSS/pull/5238 (soon in edge), you can export the XPath settings of a feed as OPML by just adding `&a=opml` to the URL (no UI yet), like https://freshrss.example.net/i/?get=f_123&a=opml
Feedback welcome!

28 March, 2023, 20:53

Alkarex

P.S. If you you use the “item thumbnail” XPath field, you should be able to get your thumbnail. If not, please open a ticket with a bit more info.
And you can load content from pages referenced by the results by taking advantage of our built-in function to add more content to truncated RSS feeds (option “Article CSS selector on original website”). 🤓

28 March, 2023, 22:05

Lennon

Hey, this is working great, except the updates always point to the previous comic. Is that happening for you too, or did I somehow mess something up? It still serves the same function at the end of the day, it just means one extra click to get to the new stuff.

1 May, 2023, 17:48

Reply here Cancel reply

Reply by email

I'd love to hear what you think. Send an email to b21201@danq.me; be sure to let me know if you're happy for your comment to appear on the Web!

Alkarex says:

Nice article, again!
If you grab this fix https://github.com/FreshRSS/FreshRSS/pull/5238 (soon in edge), you can export the XPath settings of a feed as OPML by just adding `&a=opml` to the URL (no UI yet), like https://freshrss.example.net/i/?get=f_123&a=opml
Feedback welcome!

28 March, 2023, 20:53
Alkarex says:

P.S. If you you use the “item thumbnail” XPath field, you should be able to get your thumbnail. If not, please open a ticket with a bit more info.
And you can load content from pages referenced by the results by taking advantage of our built-in function to add more content to truncated RSS feeds (option “Article CSS selector on original website”). 🤓

28 March, 2023, 22:05
Lennon says:

Hey, this is working great, except the updates always point to the previous comic. Is that happening for you too, or did I somehow mess something up? It still serves the same function at the end of the day, it just means one extra click to get to the new stuff.

1 May, 2023, 17:48

Reply elsewhere

You can reply to this post on Mastodon (@blog@danq.me), Mastodon (@dan@danq.me), Facebook.

3 comments

Reply here Cancel reply

Reply on your own site

Reply elsewhere

Reply by email