XPath Scraping AdamKoszary.co.uk

Adam Koszary – whom I worked alongside at the Bodleian – the social media specialist who brought the “absolute unit” meme to the masses, started blogging earlier (again?) this year. Yay!

But he’s completely neglected to put an RSS feed on hew new blog. Boo!

Dan, wearing a VR headset, sits in an office environment, watched by Adam.
People who saw Adam and I work together might have questioned the degree to which it counted as “work”, but that’s another story.

I’ve talked at length about how I use FreshRSS‘s “XPath Scraping” feature (for Bev’s blog, Far Side, Forward, new Far Side, and Vmail, among others), but earlier this week somebody left a comment to ask me more about how I test and debug my XPath scrapers. Given that I now need to add one for Adam’s blog, I’m in a wonderful position to walk you through it!

Setting up and debugging your FreshRSS XPath Scraper

Okay, so here’s Adam’s blog. I’ve checked, and there’s no RSS feed1, so it’s time to start planning my XPath Scraper. The first thing I want to do is to find some way of identifying the “posts” on the page. Sometimes people use solid, logical id="..." and class="..." attributes, but I’m going to need to use my browser’s “Inspect Element” tool to check:

Screenshot showing Inspect Element in use on Adam's blog.
If you’re really lucky, the site you’re scraping uses an established microformat like h-feed. No such luck here, though…

The next thing that’s worth checking is that the content you’re inspecting is delivered with the page, and not loaded later using JavaScript. FreshRSS’s XPath Scraper works with the raw HTML/XML that’s delivered to it; it doesn’t execute any JavaScript2, so I use “View Source” and quickly search to see that the content I’m looking for is there, too.

HTML source code showing id="posts" highlighted.
New developers are sometimes surprised to see how different View Source and Inspect Element’s output can be3. This looks pretty promising, though.
Now it’s time to try and write some XPath queries. Luckily, your browser is here to help! If you pop up your debug console, you’ll discover that you’re probably got a predefined function, $x(...), to which you can path a string containing an XPath query and get back a NodeList of the element.

First, I’ll try getting all of the links inside the #posts section by running $x( '//*[@id="posts"]//a' )  –

A browser's debug console executes $x('//*[@id="posts"]//a') , and gets 14 results.
Once you’ve run a query, you can expand the resulting array and hover over any element in it to see it highlighted on the page. This can be used to help check that you’ve found what you’re looking for (and nothing else).
In my first attempt, I discovered that I got not only all the posts… but also the “tags” at the top. That’s no good. Inspecting the URLs of each, I noticed that the post URLs all contained /posts/, so I filtered my query down to $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]' ) which gave me the expected number of results. That gives me //*[@id="posts"]//a[contains(@href, "/posts/")] as the XPath query for “news items”:
FreshRSS XPath feed configuration page showing my new query in the appropriate field.
I like to add the rules I’ve learned to my FreshRSS configuration as I go along, to remind me what I still need to find.

Obviously, this link points to the full post, so that tells me I can put ./@href as the “item link” attribute in FreshRSS.

Next, it’s time to see what other metadata I can extract from each post to help FreshRSS along:

Inspecting the post titles shows that they’re <h3>s. Running $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]//h3' ) gets them. Within FreshRSS, everything “within” a post is referenced relative to the post, so I convert this to descendant::h3 for my “XPath (relative to item) for Item Title:” attribute.

An XPath query identifying the titles of the posts.
I was pleased to see that Adam’s using a good accessible heading cascade. This also makes my XPathing easier!

Inspecting within the post summary content, it’s… not great for scraping. The elements class names don’t correspond to what the content is4: it looks like Adam’s using a utility class library5.

Everything within the <a> that we’ve found is wrapped in a <div class="flex-grow">. But within that, I can see that the date is directly inside a <p>, whereas the summary content is inside a <p> within a <div class="mb-2">. I don’t want my code to be too fragile, and I think it’s more-likely that Adam will change the class names than the structure, so I’ll tie my queries to the structure. That gives me descendant::div/p for the date and descendant::div/div/p for the “content”. All that remains is to tell FreshRSS that Adam’s using F j, Y as his date format (long month name, space, short day number, comma, space, long year number) so it knows how to parse those dates, and the feed’s good.

If it’s wrong and I need to change anything in FreshRSS, the “Reload Articles” button can be used to force it to re-load the most-recent X posts. Useful if you need to tweak things. In my case, I’ve also set the “Article CSS selector on original website” field to article so that the full post text can be pulled into my reader rather than having to visit the actual site. Then I’m done!

Adam's blog post "Content of the Week #7: 200 Creators" viewed in FreshRSS.
Yet another blog I can read entirely from my feed reader, despite the fact that it doesn’t offer a “feed”.

Takeaways

  • Use Inspect Element to find the elements you want to scrape for.
  • Use $x( ... ) to test your XPath expressions.
  • Remember that most of FreshRSS’s fields ask for expressions relative to the news item and adapt accordingly.
  • If you make a mistake, use “Reload Articles” to pull them again.

Footnotes

1 Boo again!

2 If you need a scraper than executes JavaScript, you need something more-sophisticated. I used to use my very own RSSey for this purpose but nowadays XPath Scraping is sufficient so I don’t bother any more, but RSSey might be a good starting point for you if you really need that kind of power!

3 If you’ve not had the chance to think about it before: View Source shows you the actual HTML code that was delivered from the web server to your browser. This then gets interpreted by the browser to generate the DOM, which might result in changes to it: for example, invalid elements might be removed, ambiguous markup will have an interpretation applied, and so on. The DOM might further change as a result of JavaScript code, browser plugins, and whatever else. When you Inspect Element, you’re looking at the DOM (represented “as if” it were HTML), not the actual underlying HTML

4 The date isn’t in a <time> element nor does it have a class like .post--date or similar.

5 I’ll spare you my thoughts on utility class libraries for now, but they’re… not positive. I can see why people use them, and I’ve even used them myself before… but I don’t think they’re a good thing.

× × × × × × ×

Vmail via FreshRSS

It’s time for… Dan Shares Yet Another FreshRSS XPath Scraping Recipe!

Vmail

I’m a huge fan of the XPath scraping feature of FreshRSS, my favourite feed reader (and one of the most important applications in my digital ecosystem). I’ve previously demonstrated how to use the feature to subscribe to Forward, reruns of The Far Side, and new The Far Side content, despite none of those sites having “official” feeds.

Signup form for VMail from Vole.WTF
Sure, I could have used my selfhosted OpenTrashMail server to convert email into RSS, but I figured XPath scraping would be more-elegant…

Vmail is cool. It’s vole.wtf’s (of ARCC etc. fame) community newsletter, and it’s as batshit crazy as you’d expect if you were to get the kinds of people who enjoy that site and asked them all to chip in on a newsletter.

Totes bonkers.

But email’s not how I like to consume this kind of media. So obviously, I scraped it.

Screenshot showing VMail subscription in FreshRSS
I’m not a monster: I want Vmail’s stats to be accurate. So I signed up with an unmonitored OpenTrashMail account as well. I just don’t read it (except for the confirmation link email). It actually took me a few attempts because there seems to be some kind of arbitrary maximum length validation on the signup form. But I got there in the end.

Recipe

Want to subscribe to Vmail using your own copy of FreshRSS? Here’s the settings you’re looking for –

  • Type of feed source: HTML + XPath (Web scraping)
  • XPath for finding news items: //table/tbody/tr
    It’s just a table with each row being a newsletter; simple!
  • XPath for item title: descendant::a
  • XPath for item content: .
  • XPath for item link (URL): descendant::a/@href
  • XPath for item date: descendant::td[1]
  • Custom date/time format: d M *y
    The dates are in a format that’s like 01 May ’24 – two-digit days with leading zeros, three-letter months, and a two-digit year preceded by a curly quote, separated by spaces. That curl quote screws up PHP’s date parser, so we have to give it a hint.
  • XPath for unique item ID: descendant::th
    Optional, but each issue’s got its own unique ID already anyway; we might as well use it!
  • Article CSS selector on original website: #vmail
    Optional, but recommended: this option lets you read the entire content of each newsletter without leaving FreshRSS.

So yeah, FreshRSS continues to be amazing. And lately it’s helped me keep on top of the amazing/crazy of vole.wtf too.

× ×

Somewhat-Effective Spam Filters

I’ve tried a variety of unusual strategies to combat email spam over the years.

Here are some of them (each rated in terms the geekiness of its implementation and its efficacy), in case you’d like to try any yourself. They’re all still in use in some form or another:

Spam filters

Geekiness: 1/10
Efficacy: 5/10

A colander filters spam email out of a stream of emails.

Your email provider or your email software probably provides some spam filters, and they’re probably pretty good. I use Proton‘s and, when I’m at my desk, Thunderbird‘s. Double-bagging your spam filter only slightly reduces the amount of spam that gets through, but increases your false-positive rate and some non-spam gets mis-filed.

A particular problem is people who email me for help after changing their name on FreeDeedPoll.org.uk, probably because they’re not only “new” unsolicited contacts to me but because by definition many of them have strange and unusual names (which is why they’re emailing me for help in the first place).

Frankly, spam filters are probably enough for many people. Spam filtering is in general much better today than it was a decade or two ago. But skim the other suggestions in case they’re of interest to you.

Unique email addresses

Geekiness: 3/10
Efficacy: 8/10

If you give a different email address to every service you deal with, then if one of them misuses it (starts spamming you, sells your data, gets hacked, whatever), you can just block that one address. All the addresses come to the same inbox, for your convenience. Using a catch-all means that you can come up with addresses on-the-fly: you can even fill a paper form with a unique email address associated with the company whose form it is.

On many email providers, including the ever-popular GMail, you can do this using plus-sign notation. But if you want to take your unique addresses to the next level and you have your own domain name (which you should), then you can simply redirect all email addresses on that domain to the same inbox. If Bob’s Building Supplies wants your email address, give them bobs@yourname.com, which works even if Bob’s website erroneously doesn’t accept email addresses with plus signs in them.

This method actually works for catching people misusing your details. On one occasion, I helped a band identify that their mailing list had been hacked. On another, I caught a dodgy entrepreneur who used the email address I gave to one of his businesses without my consent to send marketing information of a different one of his businesses. As a bonus, you can set up your filtering/tagging/whatever based on the incoming address, rather than the sender, for the most accurate finding, prioritisation, and blocking.

Emails to multiple email addresses reach the same inbox. Spam emails are blocked based on the addresses they're sent to.

Also, it makes it easy to have multiple accounts with any of those services that try to use the uniqueness of email addresses to prevent you from doing so. That’s great if, like me, you want to be in each of three different Facebook groups but don’t want to give Facebook any information (not even that you exist at the intersection of those groups).

Signed unique email addresses

Geekiness: 10/10
Efficacy: 2/10

Unique email addresses introduce two new issues: (1) if an attacker discovers that your Dreamwidth account has the email address dreamwidth@yourname.com, they can probably guess your LinkedIn email, and (2) attackers will shotgun “likely” addresses at your domain anyway, e.g. admin@yourname.com, management@yourname.com, etc., which can mean that when something gets through you get a dozen copies of it before your spam filter sits up and takes notice.

What if you could assign unique email addresses to companies but append a signature to each that verified that it was legitimate? I came up with a way to do this and implemented it as a spam filter, and made a mobile-friendly webapp to help generate the necessary signatures. Here’s what it looked like:

  1. The domain directs all emails at that domain to the same inbox.
  2. If the email address is on a pre-established list of valid addresses, that’s fine.
  3. Otherwise, the email address must match the form of:
    • A string (the company name), followed by
    • A hyphen, followed by
    • A hash generated using the mechanism described below, then
    • The @-sign and domain name as usual

The hashing algorithm is as follows: concatenate a secret password that only you know with a colon then the “company name” string, run it through SHA1, and truncate to the first eight characters. So if my password were swordfish1 and I were generating a password for Facebook, I’d go:

  1. SHA1 ( swordfish1 : facebook) [ 0 ... 8 ] = 977046ce
  2. Therefore, the email address is facebook-977046ce@myname.com
  3. If any character of that email address is modified, it becomes invalid, preventing an attacker from deriving your other email addresses from a single point (and making it hard to derive them given multiple points)

I implemented the code, but it soon became apparent that this was overkill and I was targeting the wrong behaviours. It was a fun exercise, but ultimately pointless. This is the one method on this page that I don’t still use.

Honeypots

Geekiness: 8/10
Efficacy: ?/10

Emails to multiple email addresses reach an inbox, but senders who reach a "honeypot" inbox are blocked from reaching the real inbox.

A honeypot is a “trap” email address. Anybody who emails it get aggressively marked as a spammer to help ensure that any other messages they send – even to valid email addresses – also get marked as spam.

I litter honeypots all over the place (you might find hidden email addresses on my web pages, along with text telling humans not to use them), but my biggest source of honeypots is formerly-valid unique addresses, or “guessed” catch-all addresses, which already attract spam or are otherwise compromised!

I couldn’t tell you how effective it is without looking at my spam filter’s logs, and since the most-effective of my filters is now outsourced to Proton, I don’t have easy access to that. But it certainly feels very satisfying on the occasions that I get to add a new address to the honeypot list.

Instant throwaways

Geekiness: 5/10
Efficacy: 6/10

OpenTrashmail is an excellent throwaway email server that you can deploy in seconds with Docker, point some MX records at, and be all set! A throwaway email server gives you an infinite number of unique email addresses, like other solutions described above, but with the benefit that you never have to see what gets sent to them.

Emails are delivered to an inbox and to a trash can, depending on the address they're sent to. The inbox subscribes to the trash can using RSS.

If you offer me a coupon in exchange for my email address, it’s a throwaway email address I’ll give you. I’ll make one up on the spot with one of my (several) trashmail domains at the end of it, like justgivemethedamncoupon@danstrashmailserver.com. I can just type that email address into OpenTrashmail to see what you sent me, but then I’ll never check it again so you can spam it to your heart’s content.

As a bonus, OpenTrashmail provides RSS feeds of inboxes, so I can subscribe to any email-based service using my feed reader, and then unsubscribe just as easily (without even having to tell the owner).

Summary

With the exception of whatever filters your provider or software comes with, most of these options aren’t suitable for regular folks. But you’re only a domain name (assuming you don’t have one already) away from being able to give unique email addresses to everybody you deal with, and that’s genuinely a game-changer all by itself and well worth considering, in my opinion.

× × × ×

link rel=”blogroll”

Dave Winer kindly let me know about a proposed standard for linking to OPML blogrolls. Given that I added a page containing my blogroll last year, it was easy enough for me to add a tiny bit of code to the header to add support for automatic detection of my blogroll.

<link rel="blogroll" type="text/xml" href="/blogroll.xml" title="Dan Q's blogroll">

Now all we need is some tools that can do such detection!

(You’ll note I’ve added a title attribute: as I discovered the other day, some browsers including ELinks will show all <link>s of unknown rel="..." at the top of the page and I wanted this one to make sense!)

5 Cool Apps for your Unraid NAS

I’ve got a (now four-year-old) Unraid NAS called Fox and I’m a huge fan. I particularly love the fact that Unraid can work not only as a NAS, but also as a fully-fledged Docker appliance, enabling me to easily install and maintain all manner of applications.

A cube-shaped black computer sits next to a battery pack on a laminated floor. A sign has been left atop it, reading "Caution: Generator connected to this installation."
There isn’t really a generator attached to Fox, just a UPS battery backup. The sign was liberated from our shonky home electrical system.

I was chatting this week to a colleague who was considering getting a similar setup, and he seemed to be taking notes of things he might like to install, once he’s got one. So I figured I’d round up five of my favourite things to install on an Unraid NAS that:

  1. Don’t require any third-party accounts (low dependencies),
  2. Don’t need any kind of high-powered hardware (low specs), and
  3. Provide value with very little set up (low learning curve).
Dan, his finger to his lips and his laptop on his knees, makes a "shush" action. A coworker can be seen working behind him.
It’d have been cooler if I’d have secretly written this blog post while sitting alongside said colleague (shh!). But sadly it had to wait until I was home.

Here we go:

Syncthing

I’ve been raving about Syncthing for years. If I had an “everyday carry” list of applications, it’d be high on that list.

Syncthing screenshot for computer Rebel, sharing with Fox, Idiophone, Lemmy and Maxine.
Syncthing’s just an awesome piece of set-and-forget software that facilitates file synchronisation between all of your devices and can also form part of a backup strategy.

Here’s the skinny: you install Syncthing on several devices, then give each the identification key of another to pair them. Now you can add folders on each and “share” them with the others, and the two are kept in-sync. There’s lots of options for power users, but just as a starting point you can use this to:

  • Manage the photos on your phone and push copies to your desktop whenever you’re home (like your favourite cloud photo sync service, but selfhosted).
  • Keep your Obsidian notes in-sync between all your devices (normally costs $4/month).1
  • Get a copy of the documents from all your devices onto your NAS, for backup purposes (note that sync’ing alone, even with versioning enabled, is not a good backup: the idea is that you run an actual backup from your NAS!).

Huginn

You know IFTTT? Zapier? Services that help you to “automate” things based on inputs and outputs. Huginn’s like that, but selfhosted. Also: more-powerful.

Screenshot showing Huginn workflows.
When we first started looking for a dog to adopt (y’know, before we got this derper), I set up Huginn watchers to monitor the websites of several rescue centres, filter them by some of our criteria, and push the results to us in real-time on Slack, giving us an edge over other prospective puppy-parents.

The learning curve is steeper than anything else on this list, and I almost didn’t include it for that reason alone. But once you’ve learned your way around its idiosyncrasies and dipped your toe into the more-advanced Javascript-powered magic it can do, you really begin to unlock its potential.

It couples well with Home Assistant, if that’s your jam. But even without it, you can find yourself automating things you never expected to.

FreshRSS

I’ve written a lot about how and why FreshRSS continues to be my favourite RSS reader. But you know what’s even better than an awesome RSS reader? An awesome selfhosted RSS reader!

FreshRSS screenshot.
Yes, I know I have a lot of “unread” items. That’s fine, and I can tell you why.

Many of these suggested apps benefit well from you exposing them to the open Web rather than just running them on your LAN, and an RSS reader is probably the best example (you want to read your news feeds when you’re out and about, right?). What you need for that is a reverse proxy, and there are lots of guides to doing it super-easily, even if you’re not on a static IP address.2. Alternatively you can just VPN in to your home: your router might be able to arrange this, or else Unraid can do it for you!

Open Trashmail

You know how sometimes you need to give somebody your email address but you don’t actually want to. Like: sure, I’d like you to email me a verification code for this download, but I don’t trust you not to spam me later! What you need is a disposable email address.3

Open Trashmail screenshot showing a subscription to Thanks for subscribing to Dan Q's Spam-Of-The-Hour List!
How do you feel about having infinite email addresses that you can make up on-demand (without even having access to a computer), subscribe to by RSS, and never have to see unless you specifically want to.

You just need to install Open Trashmail, point the MX records of a few domain names or subdomains (you’ve got some spare domain names lying around, right? if not; they’re pretty cheap…) at it, and it will now accept email to any address on those domains. You can make up addresses off the top of your head, even away from an Internet connection when using a paper-based form, and they work. You can check them later if you want to… or ignore them forever.

Couple it with an RSS reader, or Huginn, or Slack, and you can get a notification or take some action when an email arrives!

  • Need to give that escape room your email address to get a copy of your “team photo”? Give them a throwaway, pick up the picture when you get home, and then forget you ever gave it to them.
  • Company give you a freebie on your birthday if you sign up their mailing list? Sign up 366 times with them and write a Huginn workflow that puts “today’s” promo code into your Obsidian notetaking app (Sync’d over Syncthing) but filters out everything else.
  • Suspect some organisation is selling your email address on to third parties? Give them a unique email address that you only give to them and catch them in a honeypot.

YOURLS

Finally: a URL shortener. The Internet’s got lots of them, but they’re all at the mercy of somebody else (potentially somebody in a country that might not be very-friendly with yours…).

YOURLS screenshot (Your Own URL Shortener).
It isn’t pretty, but… it doesn’t need to be! Nobody actually sees the admin interface except you anyway.

Plus, it’s just kinda cool to be able to brand your shortlinks with your own name, right? If you follow only one link from this post, let it be to watch this video that helps explain why this is important: danq.link/url-shortener-highlights.

I run many, many other Docker containers and virtual machines on my NAS. These five aren’t even the “top five” that I use… they’re just five that are great starters because they’re easy and pack a lot of joy into their learning curve.

And if your NAS can’t do all the above… consider Unraid for your next NAS!

Footnotes

1 I wrote the beginnings of this post on my phone while in the Channel Tunnel and then carried on using my desktop computer once I was home. Sync is magic.

2 I can’t share or recommend one reverse proxy guide in particular because I set my own up because I can configure Nginx in my sleep, but I did a quick search and found several that all look good so I imagine you can do the same. You don’t have to do it on day one, though!

3 Obviously there are lots of approachable to on-demand disposable email addresses, including the venerable “plus sign in a GMail address” trick, but Open Trashmail is just… better for many cases.

× × × × × × ×

BBC News… without the crap

Did I mention recently that I love RSS? That it brings me great joy? That I start and finish almost every day in my feed reader? Probably.

I used to have a single minor niggle with the BBC News RSS feed: that it included sports news, which I didn’t care about. So I wrote a script that downloaded it, stripped sports news, and re-exported the feed for me to subscribe to. Magic.

RSS reader showing duplicate copies of the news story "Barbie 2? 'We'd love to,' says Warner Bros boss", and an entry from BBC Sounds.
Lately my BBC News feed has caused me some annoyance and frustration.

But lately – presumably as a result of technical changes at the Beeb’s side – this feed has found two fresh ways to annoy me:

  1. The feed now re-publishes a story if it gets re-promoted to the front pagebut with a different <guid> (it appears to get a #0 after it when first published, a #1 the second time, and so on). In a typical day the feed reader might scoop up new stories about once an hour, any by the time I get to reading them the same exact story might appear in my reader multiple times. Ugh.
  2. They’ve started adding iPlayer and BBC Sounds content to the BBC News feed. I don’t follow BBC News in my feed reader because I want to watch or listen to things. If you do, that’s fine, but I don’t, and I’d rather filter this content out.

Luckily, I already have a recipe for improving this feed, thanks to my prior work. Let’s look at my newly-revised script (also available on GitHub):

#!/usr/bin/env ruby
require 'bundler/inline'

# # Sample crontab:
# # At 41 minutes past each hour, run the script and log the results
# */20 * * * * ~/bbc-news-rss-filter-sport-out.rb > ~/bbc-news-rss-filter-sport-out.log 2>>&1

# Dependencies:
# * open-uri - load remote URL content easily
# * nokogiri - parse/filter XML
gemfile do
  source 'https://rubygems.org'
  gem 'nokogiri'
end
require 'open-uri'

# Regular expression describing the GUIDs to reject from the resulting RSS feed
# We want to drop everything from the "sport" section of the website, also any iPlayer/Sounds links
REJECT_GUIDS_MATCHING = /^https:\/\/www\.bbc\.co\.uk\/(sport|iplayer|sounds)\//

# Load and filter the original RSS
rss = Nokogiri::XML(open('https://feeds.bbci.co.uk/news/rss.xml?edition=uk'))
rss.css('item').select{|item| item.css('guid').text =~ REJECT_GUIDS_MATCHING }.each(&:unlink)

# Strip the anchors off the <guid>s: BBC News "republishes" stories by using guids with #0, #1, #2 etc, which results in duplicates in feed readers
rss.css('guid').each{|g|g.content=g.content.gsub(/#.*$/,'')}

File.open( '/www/bbc-news-no-sport.xml', 'w' ){ |f| f.puts(rss.to_s) }
It’s amazing what you can do with Nokogiri and a half dozen lines of Ruby.

That revised script removes from the feed anything whose <guid> suggests it’s sports news or from BBC Sounds or iPlayer, and also strips any “anchor” part of the <guid> before re-exporting the feed. Much better. (Strictly speaking, this can result in a technically-invalid feed by introducing duplicates, but your feed reader oughta be smart enough to compensate for and ignore that: mine certainly is!)

You’re free to take and adapt the script to your own needs, or – if you don’t mind being tied to my opinions about what should be in BBC News’ RSS feed – just subscribe to my copy at: https://fox.q-t-a.uk/bbc-news-no-sport.xml

×

RSS > ActivityPub

RSS is better than ActivityPub1.

Photograph of a boxing match, but with the heads of the competitors replaced with the ActivityPub and RSS logos (and "AP" or "RSS" written on their clothes, respectively). RSS is delivering a powerful uppercut to ActivityPub.
A devastating blow by RSS against a competitor 19 years his junior! For updates on this bout as it develops, don’t forget to subscribe… using either protocol.

When I subscribe to content, I want:

  • Resilient failsafes. ActivityPub has many points-of-failure. A notification might fail to complete transmission as a result of downtime, faults, or network conditions, and the receiving server might never know. A feed reader, conversely, can tell you that an address 404’d or the server was down.
  • Retroactive access. Once you fix the problem above… you still don’t get the message you missed: it’s probably gone forever – there’s no retroactive access. The same is true when your ActivityPub server connects with a peer for the first time: you only ever get new content after that point. RSS, on the other hand, provides some number of “recent” items the moment you first subscribe.
  • Simple subscriptions. RSS can be served from a statically-hosted single file, which makes it suitable to deploy anywhere as well as consume using anything. It can be read, after a fashion, in anything from Lynx upwards.

RSS ticks all these boxes. If I can choose between RSS and ActivityPub to subscribe to your content, and I don’t need a real-time update, I’m probably going to choose RSS.

About a month later, Matthias Pfefferle wrote a great post that makes a good “next stop” if you’re on a deep dive…

Footnotes

1 I feel like this statement needs a few clarifications and caveats, but my hot take looks spicier if I bury them in a footnote!

  • By RSS, I mean whichever pull-based basic HTTP you like, be that Atom, JSON Feed, h-entry, or even just properly-marked-up HTML5: did you know that the <article> element is intended to be suitable for syndication use?
  • Obviously I appreciate that RSS and ActivityPub are different tools for different jobs, and there are doubtless use-cases for which ActivityPub is clearly the superior solution.
  • I certainly don’t object to services providing both RSS and ActivityPub as syndication options, like Mastodon does, where both might be good choices.
×

They See Me (Blog)Rolling

Tracy Durnell’s post about blogrolls really spoke to me. Like her, I used to think of a blogroll as a list of people you know personally (who happen to blog)1, but the number of bloggers among my immediate in-person circle of friends has shrunk from several dozen to just a handful, and I dropped my blogroll in around 2008.

A white man wearing a spacesuit sits on a pebble beach using a laptop.
On the Internet, a blogger is only as alone as they choose to be.

But my connection to a wider circle has grown, and like Tracy I enjoy the “hardly strangers” connection I feel with the people I follow online. She writes:

While social media emphasizes the show-off stuff — the vacation in Puerto Vallarta, the full kitchen remodel, the night out on the town — on blogs it still seems that people are sharing more than signalling. These small pleasures seem to be offered in a spirit of generosity — this is too beautiful not to share.

Although I may never interact with all the folks whose blogs I follow, reading the same blogger for a long time does build a (one-sided) connection. I may not know you, author, but I am rooting for you. It’s a different modality of relationship than we may be used to in person, but it’s real: a parasocial relationship simmering with the potential for deeper connection, but also satisfying as it exists.

My first bloggy pan pal, Colin Walker, who I started exchanging emails with earlier this month, followed-up on this with an observation that really gets to the heart of the issue (speaking as somebody who’s long said that my blog’s intended audience is, first and foremost, me):

At its core, blogging is a solitary activity with many (if not most) authors claiming that their blog is for them – myself included. Yet, the implication of audience cannot be ignored. Indeed, the more an author embeds themself in the loose community of blogs, by reading and linking to others, the more that implication becomes reality even if not actively pursued via comments or email.

To that end: I’ve started publishing my blogroll again! Follow that link and you’ll see an only-lightly-curated list of all the people (plus some non-personal blogs, vlogs, and webcomics) I follow (that have updated their feeds within the last year2). Naturally, there’s an OPML version too, and I’ve open-sourced the code I used to generate it (although I can’t imagine anybody’s situation is enough like mine for it to be useful).

The page is a little flaky and there’s things I’d like to do to improve it, but I’d rather publish a basic version now and then come back to it with my gardening gloves on another time to improve it.

Maybe my blogroll has some folks on that you might recognise? Or else: maybe you’re only a single random-click away from somebody new you never heard of before!

Footnotes

1 Possibly marked up with XFN to indicate how you’re connected to one another, but I’ve always had a soft spot for XFN.

2 I often retain subscriptions to dormant feeds and it sometimes pays-off, e.g. when I recently celebrated Octopuns’ return after a 9½-year hiatus!

×

The Underground Blog

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

theunderground.blog is an experimental blog that is only available to read through a feed reader.

If you would like to read the latest posts, you can subscribe to the feed at https://theunderground.blog/feed.xml, using the feed reader of your choice.

Chris first suggested this idea in the footnote of a post that talks about something I’ve been witnessing recently: that blogging seems to be having a renaissance1. I’ve for a few years been telling people that now is the second-best time to start a blog. The best time was, of course, ~20 years ago, but if you missed out first time around (or let your blog die as big social media silos took over): now’s the time to join the growing resurgence!

Anyway, he only went and actually did it! The newest member of RSS Club is likely to be… an entire blog that’s only accessible via a feed reader2.

There’s two posts published so far, and if you want to read them you’ll need to subscribe to theunderground.blog using your feed reader. There’s tips on that page on getting an easy-to-use one if you haven’t already.

Footnotes

1 He also had interesting things to say about OPML, which is a topic close to my heart. I wonder if I ought to start sharing a partial OPML file of my subscriptions?

2 Or by reading the source code, I suppose: on the open Web, that’s always an option. The Web is, indeed, magical.

Stopping WordPress Emoji ‘Images’ in Feeds

After sharing that Octopuns has started posting again after a 9½-year hiatus earlier today, I noticed something odd: where I’d written “I ❤️ FreshRSS“, the heart emoji was huge when viewed in my favourite feed reader.

Screenshot from a web-based RSS reader application, showing recent repost "Groundhog Day". The final line contains a link with the text "I ❤️ FreshRSS", but the red heart emoji seems to be enormous compared to the next adjacent to it.
Why yes, I do subscribe to my own RSS feed. What of it?

It turns out that by default, WordPress replaces emoji in its feeds (and when sending email) with images of those emoji, using the Tweemoji set, and with the alt-text set to the original emoji. These images are hosted at https://s.w.org/images/core/emoji/…-based URLs.

For example, this heart was served with the following HTML code (the number 2764 refers to the codepoint of the emoji):

<img src="https://s.w.org/images/core/emoji/14.0.0/72x72/2764.png"
     alt="❤"
   class="wp-smiley"
   style="height: 1em; max-height: 1em;"
/>

I can see why this functionality was added: what if the feed reader didn’t support Unicode or didn’t have a font capable of showing the appropriate emoji?

But I can also see reasons why it might not be desirable to everybody. For example:

  1. Downloading an image will always be slower than rendering an emoji.
  2. The code to include an image is always more-verbose than simply including an emoji.
  3. As seen above: a feed reader which imposes a minimum size on embedded images might well render one “wrong”.
  4. It’s marginally more-verbose for screen reader users to say “Image: heart emoji” than just “heart emoji”, I imagine.
  5. Serving an third-party image when a feed item is viewed has potential privacy implications that I try hard to avoid.
  6. Replacing emoji with images is probably unnecessary for modern feed readers anyway.

I opted to remove this functionality. I briefly considered overriding the emoji_url filter (which could be used to selfhost the emoji set) but I discovered that I could just un-hook the filters that were being added in the first place.

Here’s what I added to my theme’s functions.php:

remove_filter( 'the_content_feed', 'wp_staticize_emoji' );
remove_filter( 'comment_text_rss', 'wp_staticize_emoji' );

That’s all there is to it. Now, my feed reader shows my system’s emoji instead of a huge image:

Screenshot from a web-based RSS reader application, showing recent repost "Groundhog Day". The final line contains a link with the text "I ❤️ FreshRSS" shown correctly, with a red heart emoji at the appropriate font size.

I’m always grateful to discover that a piece of WordPress functionality, whether core or in an extension, makes proper use of hooks so that its functionality can be changed, extended, or disabled. One of the single best things about the WordPress open-source ecosystem is that you almost never have to edit somebody else’s code (and remember to re-edit it every time you install an update).

Want to hear about other ways I’ve improved WordPress’s feeds?

× ×

Groundhog Day

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

Partial comic frame showing a groundhog popping its head out of a hole and shouting "RUN!".

After a break of nine and a half years, webcomic Octopuns is back. I have two thoughts:

  1. That’s awesome. I love Octopuns and I’m glad it’s back. If you want a quick taster – a quick slice, if you will – of its kind of humour, I suggest starting with Pizza.
  2. How did I know that Octopuns was back? My RSS reader told me. RSS remains a magical way to keep an eye on what’s happening on the Internet: it’s like a subscription service that delivers you exactly what you want, as soon as it’s available.

I’ve been pleasantly surprised before when my feed reader has identified a creator that’s come back from the dead. I ❤️ FreshRSS.

×

.well-known/feeds

<link rel="alternate" type="application/rss+xml"> is fine, but I feel like there should be a standard for a site, not a page, to share a “list of feeds associated with a site”.

Last night, I dreamed about a way to achieve that: ./well-known/feeds as an OPML document. Here’s mine, and here’s a draft spec.

Interested to hear what Dave Winer thinks…

RSS Zero isn’t the path to RSS Joy

Feed overload is real

The week before last, Katie shared with me that article from last month, Who killed Google Reader? I’d read it before so I didn’t bother clicking through again, but we did end up chatting about RSS a bit1.

Screenshot: Google Reader Notifier popup advises of "461 unread items".
I ditched Google Reader several years before its untimely demise, but I can confirm “461 unread items” was a believable message.

Katie “abandoned feeds a few years ago” because they were “regularly ending up with 200+ unread items that felt overwhelming”.

Conversely: I think that dropping your feed reader because there’s too much to read is… solving the wrong problem.

A white man with dark hair, wearing jeans and a t-shirt, moves to push over a stack of carboard boxes, each smaller than the one beneath it. From bottom to top, the boxes are labelled: stress, email client, mobile pings, doomscrolling, social media silos... and the very top, very smallest box, which glows with sunbeams emitted from it, reads "rss reader".
About half way through editing this image I completely forgot what message I was trying to convey, but I figured I’d keep it anyway and let you come up with your own interpretation.

Dave Rupert last week wrote about his feed reader’s “unread” count having grown to a mammoth 2,000+ items, and his plan to reduce that.

I think that he, like Katie, might be looking at his reader in a different way than I do mine.

FreshRSS sidebar, showing 567 unread items (of which 1 are comics, 2 are friends, 186 are communities, 1 are distractions, 278 are geeky, 1 is "me", 57 are youtube, 13 are strangers, 1 is software, 7 are rss club, 29 are podcasts, and 3 are polyamory. A further 107 are marked as favourites. The "friends" and "rss club" categories are showing warning triangles.
At time of writing, I’ve got 567 unread items. And that’s fine.

RSS is not email!

I’ve been in the position that Katie and David describe: of feeling overwhelmed by the sheer volume of unread items. And I know others have, too. So let me share something I’ve learned sooner:

There’s nothing special about reaching Inbox Zero in your feed reader.

It’s not noble nor enlightened to get to the bottom of your “unread” list.

Your 👏  feed 👏 reader 👏 is 👏 not 👏 an 👏 email 👏 client. 👏

The idea of Inbox Zero as applied to your email inbox is about productivity. Any message in your email might be something that requires urgent action, and you won’t know until you filter through and categorise .

But your RSS reader doesn’t (shouldn’t?) be there to add to your to-do list. Your RSS reader is a list of things you might like to read. In an ideal world, reaching “RSS Zero” would mean that you’ve seen everything on the Internet that you might enjoy. That’s not enlightened; that’s sad!

Google Reader's "Congratulations, you've reached the End of the Internet." Easter Egg screen, shown when all your feeds are empty.
Google Reader understood this, although the word “congratulations” was misplaced.

Use RSS for joy

My RSS reader is a place of joy, never of stress. I’ve tried to boil down the principles that makes it so, and here they are:

  1. Zero is not the target.
    The numbers are to inspire about how much there is “out there” for you, not to enumerate how much work need have to do.
  2. Group your feeds by importance.
    Your feed reader probably lets you group (folder, tag…) your feeds, so you can easily check-in on what you care about and leave other feeds for a rainy day.2 This is good.
  3. Don’t read every article.
    Your feed reader gives you the convenience of keeping content in one place, but you’re not obligated to read every single one. If something doesn’t interest you, mark it as read and move on. No judgement.
  4. Keep things for later.
    Something you want to read, but not now? Find a way to “save for later” to get it out of your main feed so you. Don’t have to scroll past it every day! Star it or tag it3 or push it to your link-saving or note-taking app. I use a link shortener which then feeds back into my feed reader into a “for later” group!
  5. Let topical content expire.
    Have topical/time-dependent feeds (general news media, some social media etc.)? Have reader “purge” unread articles after a time. I have my subscription to BBC News headlines expire after 5 days: if I’ve taken that long to read a headline, it might as well disappear.4
  6. Use your feed reader deliberately.
    You don’t need popup notifications (a new article’s probably already up to an hour stale by the time it hits your reader). We’re all already slaves to notifications! Visit your reader when it suits you. I start and end every day in mine; most days I hit it again a couple of other times. I don’t need a notification: there’s always new content. The reader keeps track of what I’ve not looked at.
  7. It’s not just about text.
    Don’t limit your feed reader to just text. Podcasts are nothing more than RSS feeds with attached audio files; you can keep track in your reader if you like. Most video platforms let you subscribe to a feed of new videos on a channel or playlist basis, so you can e.g. get notified about YouTube channel updates without having to fight with The Algorithm. Features like XPath Scraping in FreshRSS let you subscribe to services that don’t even have feeds: to watch the listings of dogs on local shelter websites when you’re looking to adopt, for example.
  8. Do your reading in your reader.
    Your reader respects your preferences: colour scheme, font size, article ordering, etc. It doesn’t nag you with newsletter signup popups, cookie notices, or ads. Make the most of that. Some RSS feeds try to disincentivise this by providing only summary content, but a good feed reader can work around this for you, fetching actual content in the background.5
  9. Use offline time to catch up on your reading.
    Some of the best readers support offline mode. I find this fantastic when I’m on an aeroplane, because I can catch up on all of the interesting articles I’d not had time to yet while grounded, and my reading will get synchronised when I touch down and disable flight mode.
  10. Make your reader work for you.
    A feed reader is a tool that works for you. If it’s causing you pain, switch to a different tool6, or reconfigure the one you’ve got. And if the way you find joy from RSS is different from me, that’s fine: this is a personal tool, and we don’t have to have the same answer.

And if you’d like to put those tips in your RSS reader to digest later or at your own pace, you can:  here’s an RSS feed containing (only) these RSS tips!

Footnotes

1 You’d  be forgiven for thinking that RSS was my favourite topic, given that so-far-this-year I’ve written about improving WordPress’s feeds, about mathematical quirks in FreshRSS, on using XPath scraping as an RSS alternative (twice), and the joy of getting notified when a vlog channel is ressurected (thanks to RSS). I swear I have other interests.

2 If your feed reader doesn’t support any kind of grouping, get a better reader.

3 If your feed reader doesn’t support any kind of marking/favouriting/tagging of articles, get a better reader.

4 If your feed reader doesn’t support customisable expiry times… well that’s not too unusual, but you might want to consider getting a better reader.

5 FreshRSS calls the feature that fetches actual post content from the resulting page “Article CSS selector on original website”, which is a bit of a mouthful, but you can see what it’s doing. If your feed reader doesn’t support fetching full content… well, it’s probably not that big a deal, but it’s a good nice-to-have if you’re shopping around for a reader, in my opinion.

6 There’s so much choice in feed readers, and migrating between them is (usually) very easy, so everybody can find the best choice for them. Feedly, Inoreader, and The Old Reader are popular, free, and easy-to-use if you’re looking to get started. I prefer a selfhosted tool so I use the amazing FreshRSS (having migrated from Tiny Tiny RSS). Here’s some more tips on getting started. You might prefer a desktop or mobile tool, or even something exotic: part of the beauty of RSS feeds is they’re open and interoperable, so if for example you love using Slack, you can use Slack to push feed updates to you and get almost all the features you need to do everything in my list, including grouping (using channels) and saving for later (using Slackbot/”remind me about this”). Slack’s a perfectly acceptable feed reader for some people!

× × × ×

Better WordPress RSS Feeds

I’ve made a handful of tweaks to my RSS feed which I feel improves upon WordPress’s default implementation, at least in my use-case.1 In case any of these improvements help you, too, here’s a list of them:

Post Kinds in Titles

Since 2020, I’ve decorated post titles by prefixing them with the “kind” of post they are (courtesy of the Post Kinds plugin). I’ve already written about how I do it, if you’re interested.

Screenshot showing a Weekly Digest email from DanQ.me, with two Notes and a Repost clearly-identified.
Identifying post kinds is particularly useful for people who subscribe by email (the emails are generated off the RSS feed either daily or weekly: subscriber’s choice), who might want to see articles and videos but not care about for example checkins and reposts.

RSS Only posts

A minority of my posts are – initially, at least – publicised only via my RSS feed (and places that are directly fed by it, like email subscribers). I use a tag to identify posts to be hidden in this way. I’ve written about my implementation before, but I’ve since made a couple of additional improvements:

  1. Suppressing the tag from tag clouds, to make it harder to accidentally discover these posts by tag-surfing,
  2. Tweaking the title of such posts when they appear in feeds (using the same technique as above), so that readers know when they’re seeing “exclusive” content, and
  3. Setting a X-Robots-Tag: noindex, nofollow HTTP header when viewing such tag or a post, to discourage search engines (code for this not shown below because it’s so very specific to my theme that it’s probably no use to anybody else!).
// 1. Suppress the "rss club" tag from tag clouds/the full tag list
function rss_club_suppress_tags_from_display( string $tag_list, string $before, string $sep, string $after, int $post_id ): string {
  foreach(['rss-club'] as $tag_to_suppress){
    $regex = sprintf( '/<li>[^<]*?<a [^>]*?href="[^"]*?\/%s\/"[^>]*?>.*?<\/a>[^<]*?<\/li>/', $tag_to_suppress );
    $tag_list = preg_replace( $regex, '', $tag_list );
  }
  return $tag_list;
}
add_filter( 'the_tags', 'rss_club_suppress_tags_from_display', 10, 5 );

// 2. In feeds, tweak title if it's an RSS exclusive
function rss_club_add_rss_only_to_rss_post_title( $title ){
  $post_tag_slugs = array_map(function($tag){ return $tag->slug; }, wp_get_post_tags( get_the_ID() ));
  if ( ! in_array( 'rss-club', $post_tag_slugs ) ) return $title; // if we don't have an rss-club tag, drop out here
  return trim( "{$title} [RSS Exclusive!]" );
  return $title;
}
add_filter( 'the_title_rss', 'rss_club_add_rss_only_to_rss_post_title', 6 );

Adding a stylesheet

Adding a stylesheet to your feeds can make them much friendlier to beginner users (which helps drive adoption) without making them much less-convenient for people who know how to use feeds already. Darek Kay and Terence Eden both wrote great articles about this just earlier this year, but I think my implementation goes a step further.

Screenshot of DanQ.me's RSS feed as viewed in Firefox, showing a "Q" logo and three recent posts.
I started with Matt Webb‘s pretty-feed-v3.xsl by Matt Webb (as popularised by AboutFeeds.com) and built from there.

In addition to adding some “Q” branding, I made tweaks to make it work seamlessly with both my RSS and Atom feeds by using two <xsl:for-each> blocks and exploiting the fact that the two standards don’t overlap in their root namespaces. Here’s my full XSLT; you need to override your feed template as Terence describes to use it, but mine can be applied to both RSS and Atom.2

I’ve still got more I’d like to do with this, for example to take advantage of the thumbnail images I attach to posts. On which note…

Thumbnail images

When I first started offering email subscription options I used Mailchimp’s RSS-to-email service, which was… okay, but not great, and I didn’t like the privacy implications that came along with it. Mailchimp support adding thumbnails to your email template from your feed, but WordPress themes don’t by-default provide the appropriate metadata to allow them to do that. So I installed Jordy Meow‘s RSS Featured Image plugin which did it for me.

<item>
        <title>[Checkin] Geohashing expedition 2023-07-27 51 -1</title>
        <link>https://danq.me/2023/07/27/geohashing-expedition-2023-07-27-51-1/</link>

        ...

        <media:content url="/_q23u/2023/07/20230727_141710-1024x576.jpg" medium="image" />
        <media:description>Dan, wearing a grey Three Rings hoodie, carrying French Bulldog Demmy, standing on a path with trees in the background.</media:description>
</item>
Media attachments for RSS feeds are perhaps most-popular for podcasts, but they’re also great for post thumbnail images.

During my little redesign earlier this year I decided to go two steps further: (1) ditching the plugin and implementing the functionality directly into my theme (it’s really not very much code!), and (2) adding not only a <media:content medium="image" url="..." /> element but also a <media:description> providing the default alt-text for that image. I don’t know if any feed readers (correctly) handle this accessibility-improving feature, but my stylesheet above will, some day!

Here’s how that’s done:

function rss_insert_namespace_for_featured_image() {
  echo "xmlns:media=\"http://search.yahoo.com/mrss/\"\n";
}

function rss_insert_featured_image( $comments ) {
  global $post;
  $image_id = get_post_thumbnail_id( $post->ID );
  if( ! $image_id ) return;
  $image = get_the_post_thumbnail_url( $post->ID, 'large' );
  $image_url = esc_url( $image );
  $image_alt = esc_html( get_post_meta( $image_id, '_wp_attachment_image_alt', true ) );
  $image_title = esc_html( get_the_title( $image_id ) );
  $image_description = empty( $image_alt ) ? $image_title : $image_alt;
  if ( !empty( $image ) ) {
    echo <<<EOF
      <media:content url="{$image_url}" medium="image" />
      <media:description>{$image_description}</media:description>
    EOF;
  }
}

add_action( 'rss2_ns', 'rss_insert_namespace_for_featured_image' );
add_action( 'rss2_item', 'rss_insert_featured_image' );

So there we have it: a little digital gardening, and four improvements to WordPress’s default feeds.

RSS may not be as hip as it once was, but little improvements can help new users find their way into this (enlightened?) way to consume the Web.

If you’re using RSS to follow my blog, great! If it’s not for you, perhaps pick your favourite alternative way to get updates, from options including email, Telegram, the Fediverse (e.g. Mastodon), and more…

Update 4 September 2023: More-recently, I’ve improved WordPress RSS feeds by preventing them from automatically converting emoji into images.

Footnotes

1 The changes apply to the Atom feed too, for anybody of such an inclination. Just assume that if I say RSS I’m including Atom, okay?

2 The experience of writing this transformation/stylesheet also gave me yet another opportunity to remember how much I hate working with XSLTs. This time around, in addition to the normal namespace issues and headscratching syntax, I had to deal with the fact that I initially tried to use a feature from XSLT version 2.0 (a 22-year-old version) only to discover that all major web browsers still only support version 1.0 (specified last millenium)!

× ×