XPath Scraping with FreshRSS

I’ve been spending a while running on reduced brain capacity lately so, to ease myself back into thinking like a programmer, I upgraded my preferred feed reader FreshRSS to version 1.20.0 – which was released a couple of weeks ago – and tried out what I believe is its killer new feature: HTML + XPath scraping.

Screenshot showing Beverley Newing's weblog; two articles are visible - Paperback copy of 'Disability Visibility', edited by Alice Wong, next to a cup of tea Setting up an Accessibility Book Club, published on 1 March 2022, and Reflecting on 2021, published on 1 January 2022.
I like to keep up-to-date with my friend Bev’s blog, but they don’t have an RSS feed.

I’ve been using RSS1 for about 20 years and I love it. It feels great to be able to curate my updates based on “what I care about”, and not on “what some social network thinks I should care about”, to keep things to read later, to prioritise effectively based on my own categorisation, to consume content offline and have my to-read list synchronise later, etc.

RSS never went away, of course (what do you think a podcast is?), but it got steamrollered out of the public eye by big companies who make their money out of keeping your eyes on their platforms and off the open Web. But it feels like it’s slowly coming back: even Substack – whose entire thing is that an email client is more-convenient than a feed reader for most people – launched an RSS reader this week!

A smartphone on a wooden surface. The screen shows the FeedMe app, showing the most-recent blog post from Beverley's blog.
My day usually starts in my feed reader, accessed via the FeedMe app from my mobile (although FreshRSS provides a reasonably good responsive interface out-of-the-box!)

I love RSS so much that I routinely retrofit other people’s websites with feeds just so I can subscribe to them: I even published the tool I use to do so! Whether filtering sports headlines out of BBC News, turning retro webcomics into “reading lists” so I can track my progress, or just working around sites that really should have feeds but refuse to, I just love sidestepping these “missing feeds”. My friend Beverley has a blog without any kind of feed, so I added one so I could subscribe to it. Magic.

But with FreshRSS 1.20.0, I no longer have to maintain my own tool to get this brilliant functionality, and I’m overjoyed. Let’s look at how it works by re-subscribing to Beverley’s blog but without a middleware tool.

Screenshot showing FetchRSS being used to graphically create a feed from Beverley's blog.
This post is about to get pretty technical. If you don’t want to learn some XPath but just want to make a feed out of a web page, use a graphical tool like FetchRSS.

In the latest version of FreshRSS, when you add a new feed to your reader, a new section “Type of feed source” is available. Unfold it, and you can change from the default (“RSS / Atom”) to the new option “HTML + XPath (Web scraping)”. Put a human-readable page address rather than a feed address into the “Feed URL” field and fill these fields to tell FreshRSS how to parse the page to get the content you want. Note that it doesn’t matter if the web page isn’t valid XML (e.g. missing closing tags) because it’s going to get run through PHP’s DOMDocument anyway which will “correct” for some really sloppy code if needed.

Browser debugger running document.evaluate('//li[@class="blog__post-preview"]', document).iterateNext() on Beverley's weblog and getting the first blog entry.
You can use your browser’s debugger to help check your XPath rules: here I’ve run  document.evaluate('//li[@class="blog__post-preview"]', document).iterateNext() and got back the first blog post on the page, so I know I’m on the right track.
You’ll need to use XPath to express how to find a “feed item” on the page. Here’s the rules I used for https://webdevbev.co.uk/blog.html (many of these fields were optional – I didn’t have to do this much work):

  • Feed title: //h1
    I override this anyway in FreshRSS, so I could just have used the a string, but I wanted the XPath practice. There’s only one <h1> on the page, and it can be considered the “title” of the feed.
  • Finding items: //li[@class="blog__post-preview"]
    Each “post” on the page is an <li class="blog__post-preview">.
  • Item titles: descendant::h2
    Each post has a <h2> which is the post title. The descendant:: selector scopes the search to each post as found above.
  • Item content: descendant::p[3]
    Beverley’s static site generator template puts the post summary in the third paragraph of the <li>, which we can select like this.
  • Item link: descendant::h2/a/@href
    This expects a URL, so we need the /@href to make sure we get the value of the <h2><a href="...">, rather than its contents.
  • Item thumbnail: descendant::img[@class="blog__image--preview"]/@src
    Again, this expects a URL, which we get from the <img src="...">.
  • Item author: "Beverley Newing"
    Beverley’s blog doesn’t host any guest posts, so I just use a string literal here.
  • Item date: substring-after(descendant::p[@class="blog__date-posted"], "Date posted: ")
    This is the only complicated one: the published dates on Beverley’s blog aren’t explicitly marked-up, but part of a string that begins with the words “Date posted: “, so I use XPath’s substring-after function to strtip this. The result gets passed to PHP’s strtotime(), which is pretty tolerant of different date formats (although not of the words “Date posted:” it turns out!).
Screenshot: Adding a "HTML + XPath (Web scraping)" feed via FreshRSS.
I’d love one day for FreshRSS to provide some kind of “preview” feature here so you can see what you’ll expect to get back, as you work. That, and support for different input types (JSON, perhaps?), perhaps other selectors (I find CSS-style selectors much simpler than XPath), and maybe even an option to execute Javascript on the page before scraping (I use this in my own toolchain, but that’s just because I want to have my cake and eat it too). But this is still all pretty awesome.

I hope that this is just the beginning for this new killer feature in FreshRSS: there’s so much more it can be and do. But for now, I’m still mighty impressed that I can begin to phase-out my use of my relatively resource-intensive feed-building middleware and use my feed reader to do more and more of the heavy lifting for which I love it so much.

I also love that this functionally adds h-feed support in by the back door. I’d still prefer there to be a “h-feed” option in the “Type of feed source” drop-down, but at least I can add such support manually, now!

Beverley's blog post "Setting up an Accessibility Book Club" in FreshRSS.
The finished result: Bev’s blog posts appear directly in my feed reader, even though they don’t have a feed, and now without going through the middleware I’d set up for that purpose.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

Making an RSS feed of YOURLS shortlinks

As you might know if you were paying close attention in Summer 2019, I run a “URL shortener” for my personal use. You may be familiar with public URL shorteners like TinyURL and Bit.ly: my personal URL shortener is basically the same thing, except that only I am able to make short-links with it. Compared to public ones, this means I’ve got a larger corpus of especially-short (e.g. 2/3 letter) codes available for my personal use. It also means that I’m not dependent on the goodwill of a free siloed service and I can add exactly the features I want to it.

Diagram showing the relationships of the DanQ.me ecosystem. Highlighted is the injection of links into the "S.2" link shortener and the export of these shortened links by RSS into FreshRSS.
Little wonder then that my link shortener sat so close to me on my ecosystem diagram the other year.

For the last nine years my link shortener has been S.2, a tool I threw together in Ruby. It stores URLs in a sequentially-numbered database table and then uses the Base62-encoding of the primary key as the “code” part of the short URL. Aside from the fact that when I create a short link it shows me a QR code to I can easily “push” a page to my phone, it doesn’t really have any “special” features. It replaced S.1, from which it primarily differed by putting the code at the end of the URL rather than as part of the domain name, e.g. s.danq.me/a0 rather than a0.s.danq.me: I made the switch because S.1 made HTTPS a real pain as well as only supporting Base36 (owing to the case-insensitivity of domain names).

But S.2’s gotten a little long in the tooth and as I’ve gotten busier/lazier, I’ve leant into using or adapting open source tools more-often than writing my own from scratch. So this week I switched my URL shortener from S.2 to YOURLS.

Screenshot of YOURLS interface showing Dan Q's list of shortened links. Six are shown of 1,939 total.
YOURLs isn’t the prettiest tool in the world, but then it doesn’t have to be: only I ever see the interface pictured above!

One of the things that attracted to me to YOURLS was that it had a ready-to-go Docker image. I’m not the biggest fan of Docker in general, but I do love the convenience of being able to deploy applications super-quickly to my household NAS. This makes installing and maintaining my personal URL shortener much easier than it used to be (and it was pretty easy before!).

Another thing I liked about YOURLS is that it, like S.2, uses Base62 encoding. This meant that migrating my links from S.2 into YOURLS could be done with a simple cross-database INSERT... SELECT statement:

INSERT INTO yourls.yourls_url(keyword, url, title, `timestamp`, clicks)
  SELECT shortcode, url, title, created_at, 0 FROM danq_short.links

But do you know what’s a bigger deal for my lifestack than my URL shortener? My RSS reader! I’ve written about it a lot, but I use RSS for just about everything and my feed reader is my first, last, and sometimes only point of contact with the Web! I’m so hooked-in to my RSS ecosystem that I’ll use my own middleware to add feeds to sites that don’t have them, or for which I’m not happy with the feed they provide, e.g. stripping sports out of BBC News, subscribing to webcomics that don’t provide such an option (sometimes accidentally hacking into sites on the way), and generating “complete” archives of series’ of posts so I can use my reader to track my progress.

One of S.1/S.2’s features was that it exposed an RSS feed at a secret URL for my reader to ingest. This was great, because it meant I could “push” something to my RSS reader to read or repost to my blog later. YOURLS doesn’t have such a feature, and I couldn’t find anything in the (extensive) list of plugins that would do it for me. I needed to write my own.

Partial list of Dan's RSS feed subscriptions, including Jeremy Keith, Jim Nielson, Natalie Lawhead, Bruce Schneier, Scott O'Hara, "Yahtzee", BBC News, and several podcasts, as well as (highlighted) "Dan's Short Links", which has 5 unread items.
In some ways, subscribing “to yourself” is a strange thing to do. In other ways… shut up, I’ll do what I like.

I could have written a YOURLS plugin. Or I could have written a stack of code in Ruby, PHP, Javascript or some other language to bridge these systems. But as I switched over my shortlink subdomain s.danq.me to its new home at danq.link, another idea came to me. I have direct database access to YOURLS (and the table schema is super simple) and the command-line MariaDB client can output XML… could I simply write an XML Transformation to convert database output directly into a valid RSS feed? Let’s give it a go!

I wrote a script like this and put it in my crontab:

mysql --xml yourls -e                                                                                                                     \
      "SELECT keyword, url, title, DATE_FORMAT(timestamp, '%a, %d %b %Y %T') AS pubdate FROM yourls_url ORDER BY timestamp DESC LIMIT 30" \
    | xsltproc template.xslt -                                                                                                            \
    | xmllint --format -                                                                                                                  \
    > output.rss.xml

The first part of that command connects to the yourls database, sets the output format to XML, and executes an SQL statement to extract the most-recent 30 shortlinks. The DATE_FORMAT function is used to mould the datetime into something approximating the RFC-822 standard for datetimes as required by RSS. The output produced looks something like this:

<?xml version="1.0"?>
<resultset statement="SELECT keyword, url, title, timestamp FROM yourls_url ORDER BY timestamp DESC LIMIT 30" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <row>
        <field name="keyword">VV</field>
        <field name="url">https://webdevbev.co.uk/blog/06-2021/perfect-is-the-enemy-of-good.html</field>
        <field name="title"> Perfect is the enemy of good || Web Dev Bev</field>
        <field name="timestamp">2021-09-26 17:38:32</field>
  </row>
  <row>
        <field name="keyword">VU</field>
        <field name="url">https://webdevlaw.uk/2021/01/30/why-generation-x-will-save-the-web/</field>
        <field name="title">Why Generation X will save the web  Hi, Im Heather Burns</field>
        <field name="timestamp">2021-09-26 17:38:26</field>
  </row>

  <!-- ... etc. ... -->
  
</resultset>

We don’t see this, though. It’s piped directly into the second part of the command, which  uses xsltproc to apply an XSLT to it. I was concerned that my XSLT experience would be super rusty as I haven’t actually written any since working for my former employer SmartData back in around 2005! Back then, my coworker Alex and I spent many hours doing XML backflips to implement a system that converted complex data outputs into PDF files via an XSL-FO intermediary.

I needn’t have worried, though. Firstly: it turns out I remember a lot more than I thought from that project a decade and a half ago! But secondly, this conversion from MySQL/MariaDB XML output to RSS turned out to be pretty painless. Here’s the template.xslt I ended up making:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:template match="resultset">
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
      <channel>
        <title>Dan's Short Links</title>
        <description>Links shortened by Dan using danq.link</description>
        <link> [ MY RSS FEED URL ] </link>
        <atom:link href=" [ MY RSS FEED URL ] " rel="self" type="application/rss+xml" />
        <lastBuildDate><xsl:value-of select="row/field[@name='pubdate']" /> UTC</lastBuildDate>
        <pubDate><xsl:value-of select="row/field[@name='pubdate']" /> UTC</pubDate>
        <ttl>1800</ttl>
        <xsl:for-each select="row">
          <item>
            <title><xsl:value-of select="field[@name='title']" /></title>
            <link><xsl:value-of select="field[@name='url']" /></link>
            <guid>https://danq.link/<xsl:value-of select="field[@name='keyword']" /></guid>
            <pubDate><xsl:value-of select="field[@name='pubdate']" /> UTC</pubDate>
          </item>
        </xsl:for-each>
      </channel>
    </rss>
  </xsl:template>
</xsl:stylesheet>

That uses the first (i.e. most-recent) shortlink’s timestamp as the feed’s pubDate, which makes sense: unless you’re going back and modifying links there’s no more-recent changes than the creation date of the most-recent shortlink. Then it loops through the returned rows and creates an <item> for each; simple!

The final step in my command runs the output through xmllint to prettify it. That’s not strictly necessary, but it was useful while debugging and as the whole command takes milliseconds to run once every quarter hour or so I’m not concerned about the overhead. Using these native binaries (plus a little configuration), chained together with pipes, had already resulted in way faster performance (with less code) than if I’d implemented something using a scripting language, and the result is a reasonably elegant “scratch your own itch”-type solution to the only outstanding barrier that was keeping me on S.2.

All that remained for me to do was set up a symlink so that the resulting output.rss.xml was accessible, over the web, to my RSS reader. I hope that next time I’m tempted to write a script to solve a problem like this I’ll remember that sometimes a chain of piped *nix utilities can provide me a slicker, cleaner, and faster solution.

Update: Right as I finished writing this blog post I discovered that somebody had already solved this problem using PHP code added to YOURLS; it’s just not packaged as a plugin so I didn’t see it earlier! Whether or not I use this alternate approach or stick to what I’ve got, the process of implementing this YOURLS-database ➡ XML ➡  XSLTRSS chain was fun and informative.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

Stumbling

This article is a repost promoting content originally published elsewhere. See more things Dan's reposted.

I’ve been changing my relationship to being online.

Some of it is keeping in touch with friends who are fascinated by the same sorts of hybrid creations I am. Friends who build things. Friends in different professional communities. Paying attention when they mention some new discovery or avenue of interest.

Some of it is using an RSS reader to change the cadence and depth of my consumption—pulling away from the quick-hit likes of social media in favor of a space where I can run my thoughts to their logical conclusion (and then sit on them long enough to consider whether or not they’re true).

I wish I could get more people to see the value in the “slow Web”. The participatory Web. The creative Web. The personalised Web.

When you use an app to browse a “stream” in most social media, you’re seeing a list of posts curated to keep you watching, keep you seeing adverts, keep you on the app so that as much personal data as possible can be leeched from your behaviour. If it feels satisfying and especially if it feels addictive, the social network has done its job, but don’t be fooled: its job is not to improve social connections – it’s job is to keep you from doing anything else.

You don’t have to use the Web this way. You can subscribe to the content creators and topics that actually interest you. You can get that content on basically any device or medium you like, or across a mixture: want notifications by email? Slack? IRC? Discord? In a browser? In an app? As-it-happens or digests? You can filter for what interests you most at any given moment, save content for later, and resharing is supported thanks  to an old-school invention called a “URL“. And you’ll see fewer ads and experience less misuse of your behavioural data.

Sure, there’s a learning curve. But it’s worth it. I wish I could get more people to see that.

I’m happy to see that Lucy Belwood does.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

Watched the pilot of Webbed Briefs by @heydonworks (of Every Layout fame). It’s a sarcastic independent vlog about web technologies, so I immediately fell in love and subscribed to the feed…

Just kidding. It doesn’t have a feed! (Yet?)

Webbed Briefs logo

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

Identifying Post Kinds in WordPress RSS Feeds

I use the Post Kinds plugin to streamline the management of the different types of posts I make on my blog, based on the IndieWeb post types list: articles, like this one, are “conventional” blog posts, but I also publish notes (which are analogous to “tweets”), reposts (“shares” of things I’ve found online, sometimes with commentary), checkins (mostly chronicling my geocaching/geohashing), and others: I’ve extended Post Kinds to facilitate comics and reviews, for example.

But for people who subscribe (either directly or indirectly) to everything I post, I imagine it must be a little frustrating to sometimes be unable to identify the type of a post before clicking-through. So I’ve added the following code, which I’m sharing here and on GitHub in case it’s of any use to anybody else, to my theme’s functions.php:

// Make titles in RSS feed be prefixed by the Kind of the post.
function add_kind_to_rss_post_title(){
	$kinds = wp_get_post_terms( get_the_ID(), 'kind' );
	if( ! isset( $kinds ) || empty( $kinds ) ) return get_the_title(); // sanity-check.
	$kind = $kinds[0]->name;
	$title = get_the_title();
	return trim( "[{$kind}] {$title}" );
}
add_filter( 'the_title_rss', 'add_kind_to_rss_post_title', 4 ); // priority 4 to ensure it happens BEFORE default escaping filters.

This decorates the titles of my posts, but only in my feeds, so it’s easier for people to tell at-a-glance what’s going on:

Rendered RSS feed showing Post Kinds prefixes

Down the line I might expand this so that it doesn’t show if the subscriber is, for example, asking only for articles (e.g. via this feed); I’m coming up with a huge list of things I’d like to do at IndieWebCamp London! But for now, this feels like a nice simple improvement to a plugin I love that helps it to fit my specific needs.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

Subscribing to The Far Side (by RSS!)

Prior to his retirement in 1995 I managed to amass a collection of almost all of Gary Larson’s The Far Side books as well as a couple of calendars and other thingamabobs. After 24 years of silence I didn’t expect to hear anything more from him and so I was as surprised as most of the Internet was when he re-emerged last year with a brand new on his first ever website. Woah.

The Far Side by Gary Larson

Larson’s hinted that there might be new and original content there someday, but for the time being I’m just loving that I can read The Far Side comments (legitimately) via the Web for the first time! The site’s currently publishing a “Daily Dose” of classic strips, which is awesome. But… I don’t want to have to go to a website to get comics every day. Nor do I want to have to remember which days I’ve caught-up with, yet. That’s a job for computers, right? And it’s a solved problem: RSS (which has been around for almost as long as Larson hasn’t) and similar technologies allow a website to publicise that it’s got updates available in a way that people can “subscribe” to, so I should just use that, right?

Except… the new The Far Side website doesn’t have an RSS feed. Boo! Luckily, I’m not above automating the creation of feeds for websites that I wish had them, even (or perhaps especially) where that involves a little reverse-engineering of online comics. So with a little thanks to my RSS middleware RSSey… I can now read daily The Far Side comics in the way that’s most-convenient to me: right alongside my other subscriptions in my feed reader.

The Far Side comics in Dan's RSS reader.
How screen scrapers are made.

I’m afraid I’m not going to publicly*-share a ready-to-go feed URL for this one, unlike my BBC News Without The Sport feed, because a necessary side-effect of the way it works is that the ads are removed. And if I were to republish a feed containing The Far Side website cartoons but with the ads stripped I’d be guilty of, like, all the ethical and legal faults that Larson was trying to mitigate by putting his new website up in the first place! I love The Far Side and I certainly don’t want to violate its copyright!

But – at least until Larson’s web developer puts up a proper feed (with or without ads) – for those of us who like our comics delivered fresh to us every morning, here’s the source code (as an RSSey feed definition) you could use to run your own personal-use-only “give me The Far Side Daily Dose as an RSS feed” middleware.

Thanks for deciding to join us on the Internet, Gary. I hear it’s going to be a big thing, someday!

* Friends are welcome to contact me off-blog for an address if they like, if they promise to be nice and ethical about it.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

Nice simple design, @typehut, but with no RSS, Atom, nor h-feed you’ve made a blog platform to which readers can only subscribe by email. 😢

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

Subscribe by Email

For the last few months, I’ve been running an alpha test of an email-based subscription to DanQ.me with a handful of handpicked testers. Now, I’d like to open it up to a slightly larger beta test group. If you’d like to get the latest from this site directly in your inbox, just provide your email address below:

Subscribe by email!

Who’s this for?

Some people prefer to use their email inbox to subscribe to things. If that’s you: great!

What will I receive?

You’ll get a “daily digest”, no more than once per day, summarising everything I’ve published within the last 24 hours. It usually works: occasionally but not often it misses things. You can unsubscribe with one click at any time.

How else can I subscribe?

You can still subscribe in a variety of other ways. Personally, I recommend using a feed reader which lets you choose exactly which kinds of content you’re interested in, but there are plenty of options including Facebook and Twitter (for those of such an inclination).

Didn’t you do this before?

Yes, I ran a “subscribe by email” system back in 2007 but didn’t maintain it. Things might be better this time around. Maybe.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

LABS Comic RSS Archive

Yesterday I recommended that you go read Aaron Uglum‘s webcomic LABS which had just completed its final strip. I’m a big fan of “completed” webcomics – they feel binge-able in the same way as a complete Netflix series does! – but Spencer quickly pointed out that it’s annoying for we enlightened modern RSS users who hook RSS up to everything to have to binge completed comics in a different way to reading ongoing ones: what he wanted was an RSS feed covering the entire history of LABS.

LABS comic adapted to show The Robot literally "feeding" RSS
With apologies to Aaron Uglum who I hope won’t mind me adapting his comic in this way.

So naturally (after the intense heatwave woke me early this morning anyway) I made one: complete RSS feed of LABS. And, of course, I open-sourced the code I used to generate it so that others can jumpstart their projects to make static RSS feeds from completed webcomics, too.

Even if you’re not going to read it via this medium, you should go read LABS.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

DanQ.me Ecosystem

Diagram illustrating the relationships between DanQ.me and the satellite services with which it interacts.

With IndieWebCamp Oxford 2019 scheduled to take place during the Summer of Hacks, I drew a diagram (click to embiggen) of the current ecosystem that powers and propogates the content on DanQ.me. It’s mostly for my own benefit – to be able to get a big-picture view of the ways my website talks to the world and plan for what improvements I might be able to make in the future… but it also works as a vehicle to explain what my personal corner of the IndieWeb does and how it does it. Here’s a summary:

DanQ.me

Since fifteen years ago today, DanQ.me has been powered by a self-hosted WordPress installation. I know that WordPress isn’t “hip” on the IndieWeb this week and that if you’re not on the JAMstack you’re yesterday’s news, but at 15 years and counting my love affair with WordPress has lasted longer than any romantic relationship I’ve ever had with another human being, so I’m sticking with it. What’s cool in Web technologies comes and goes, but what’s important is solid, dependable tools that do what you need them to, and between WordPress, half a dozen off-the-shelf plugins and about a dozen homemade ones I’ve got everything I need right here.

Castle of the Four Winds, launched in 1998, with a then-fashionable black background.
I’d been “blogging” – not that we called it that, yet – since late 1998, but my original collection of content-mangling Perl scripts wasn’t all that. More history…

I write articles (long posts like this) and notes (short, “tweet-like” updates) directly into the site, and just occasionally other kinds of content. But for the most part, different kinds of content come from different parts of the ecosystem, as described below.

RSS reader

DanQ.me sits at the centre of the diagram, but it’s worth remembering that the diagram is deliberately incomplete: it only contains information flows directly relevant to my blog (and it doesn’t even contain all of those!). The last time I tried to draw a diagram like this that described my online life in general, then my RSS reader found its way to the centre. Which figures: my RSS reader is usually the first and often the last place I visit on the Internet, and I’ve worked hard to funnel everything through it.

FreshRSS with 129 unread items
129 unread items is a reasonable-sized queue: I try to process to “RSS zero”, but there are invariably things I want to return to on a second-pass and I’ve not yet reimplemented the “snooze button” I added to my previous RSS reader.

Right now I’m using FreshRSS – plus a handful of plugins, including some homemade ones – as my RSS reader: I switched from Tiny Tiny RSS about a year ago to take advantage of FreshRSS’s excellent responsive themes, among other features. Because some websites don’t have RSS feeds, even where they ought to, I use my own tool RSSey to retroactively “fix” people’s websites for them, dynamically adding feeds for my consumption. It’s also a nice reminder that open source and remixability were cornerstones of the original Web. My RSS reader collates information from a variety of sources and additionally gives me a one-click mechanism to push content I enjoy to my blog as a repost.

QTube

QTube is my video hosting platform; it’s a PeerTube node. If you haven’t seen it, that’s fine: most content on it is consumed indirectly either through my YouTube channel or directly on my blog as posts of the “video” kind. Also, I don’t actually vlog very often. When I do publish videos onto QTube, their republication onto YouTube or DanQ.me is optional: sometimes I plan to use a video inside an article post, for example, and so don’t need to republish it by itself.

QTube homepage showing Dan's videos
I recently changed the blue of my “brand colours” to improve accessibility, but this hasn’t carried over to QTube yet.

I’m gradually exporting or re-uploading my backlog of YouTube videos from my current and previous channels to QTube in an effort to recentralise and regain control over their hosting, but I’m in no real hurry. PeerTube certainly makes it easy, though!

Link Shortener

I operate a private link shortener which I mostly use for the expected purpose: to make links shorter and so easier to read out and memorise or else to make them take up less space in a chat window. But soon after I set it up, many years ago, I realised that it could also act as a mechanism to push content to my RSS reader to “read later”. And by the time I’m using it for that, I figured, I might as well also be using it to repost content to my blog from sources that aren’t things my RSS reader subscribes to. This leads to a process that’s perhaps unnecessarily complex: if I want to share a link with you as a repost, I’ll push it into my link shortener and mark it as going “to me”, then I’ll tell my RSS reader to push it to my blog and there it’ll be published to the world! But it works and it’s fast enough: I’m not in the habit of reposting things that are time-critical anyway.

Checkins

Dan geohashing
You know your sport is fringe when you need to reference another fringe sport to describe it. “Geohashing? It’s… a little like geocaching, but…”

I’ve been involved in brainstorming ways in which the act of finding (or failing to find, etc.) a geocache or reaching (or failing to reach) a geohashpoint could best be represented as a “checkin“, and last year I open-sourced my plugin for pulling logs (with as much automation as is permitted by the terms of service of some of the silos involved) from geocaching websites and posting them to WordPress blogs: effectively PESOS-for-geocaching. I’d prefer to be publishing on my own blog in the first instance, but syndicating my adventures from various silos into my blog is “good enough”.

Syndication

New notes get pushed out to my Twitter account, for the benefit of my Twitter-using friends. Articles get advertised on Facebook, Twitter and LiveJournal (yes, really) in teaser form, for the benefit of friends who prefer to get notifications via those platforms. Facebook have been fucking around with their APIs and terms of service lately and this is now less-automatic than it used to be, which is a bit of an annoyance. My RSS feeds carry copies of content out to people who prefer to subscribe via that medium, and I’ve also been using this to power an experimental MailChimp “daily digest” mailing list of “what Dan’s been up to” to a small number of friends, right in their email inboxes: I’ve not made it available to everybody yet, but if you’re happy to help test it then give me a shout and I’ll hook you up.

DanQ.me email newsletter
Most days don’t see an email sent or see an email with only one item, but some days – like this one – are busier. I still need to update the brand colours here, too!

Finally, a couple of IFTTT recipes push my articles and my reposts to Reddit communities: I don’t really use Reddit myself, any more, but I’ve got friends in a few places there who prefer to keep up-to-date with what I’m up to via that medium. For historical reasons, my reposts to Reddit don’t go directly via my blog’s RSS feeds but “shortcut” directly from my RSS reader: this is suboptimal because I don’t get to tweak post titles for Reddit but it’s not a big deal.

IFTTT recipe pushing articles to Reddit
What IFTTT does isn’t magic, but it’s often indistinguishable from it.

I used to syndicate content to Google+ (before it joined the long list of Things Google Have Killed) and to Ello (but it never got much traction there). I’ve probably historically syndicated to other places too: I’ve certainly manually-republished content to other blogs, from time to time, too.

Backfeed

I use Ryan Barrett‘s excellent Brid.gy to convert Twitter replies and likes back into Webmentions for publication as comments on my blog. This used to work for Facebook, too, but again: Facebook fucked it over. I’ve occasionally manually backfed significant Facebook comments, but it’s not ideal: I might like to look at using similar technologies to RSSey to subvert Facebook’s limitations.

Brid.gy's management of my Twitter backfeed
I’ve never had a need for Brid.gy’s “publishing” (i.e. POSSE) features, but its backfeed features “just work”, and it’s awesome.

Reintegration

I’ve routinely retroactively reintegrated content that I’ve produced elsewhere on the Web. This includes my previous blogs (which is why you can browse my archives, right here on this site, all the way back to some of the cringeworthy angsty-teenager posts I made in the 1990s) but also some Reddit posts, some replies originally posted directly to other people’s blogs, all my old del.icio.us bookmarks, long-form forum posts, posts I made to mailing lists and newsgroups, and more. As a result, there’s a lot of backdated content on this site, nowadays: almost a million words, and significantly more than the 600,000 or so I counted a few years ago, before my biggest push for reintegration!

Cumulative wordcount per day, by content type.
Cumulative wordcount per day, by content type. The lion’s share has always been articles, but reposts are creeping up as I’ve been writing more about the things I reshare, lately. It’d be interesting to graph the differentiation of this chart to see the periods of my life that I was writing the most: I have a hypothesis, and centralising my own content under my control makes it easier

Why do I do this? Because I really, really like owning my identity online! I’ve tried the “big” silo alternatives like Facebook, Twitter, Medium, Instagram etc., and they’ve eventually always lead to disappointment, either because they get shut down or otherwise made-unusable, because of inappropriately-applied “real names” policies, because they give too much power to untrustworthy companies, because they impose arbitrary limitations on my content, because they manipulate output promotion (and exacerbate filter bubbles), or because they make the walls of their walled gardens taller and stop you integrating with them how you used to.

A handful of silos have shown themselves to be more-trustworthy than the average – in particular, eschewing techniques that promote “lock-in” – and I’d love to tell you more about them and what I think you should look for in a silo, another time. But for now: suffice to say that just like I don’t use YouTube like most people do, I elect not to use Facebook or Twitter in the conventional ways either. And it’s awesome, thanks.

There are plenty of reasons that people choose to take control of their own Web presence – and everybody who puts content online ought to consider it – but I imagine that few individuals have such a complicated publishing ecosystem as I do! Now you’ve got a picture of how my digital content production workflow works, and perhaps start owning your online identity, too.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

I Don’t Watch YouTube (Like You Watch YouTube)

I was watching a recent YouTube video by Derek Muller (Veritasium), My Video Went Viral. Here’s Why, and I came to a realisation: I don’t watch YouTube like most people – probably including you! – watch YouTube. And as a result, my perspective on what YouTube is and does is fundamentally biased from the way that others probably think about it.

The Veritasium video My Video Went Viral. Here’s Why is really good and you should definitely watch at least 7 minutes of it in order to influence the algorithm.

The magic moment came for me when his video explained that the “subscribe” button doesn’t do what I’d assumed it does. I’m probably not alone in my assumptions: I’ll bet that people who use the “subscribe” button as YouTube intend don’t all realise that it works the way that it does.

Like many, I’d assumed the “subscribe” buttons says “I want to know about everything this creator publishes”. But that’s not what actually happens. YouTube wrangles your subscription list and (especially) your recommendations based on their own metrics using an opaque algorithm. I knew, of course, that they used such a thing to manage the list of recommended next-watches… but I didn’t realise how big an influence it was having on the way that most YouTube users choose what they’ll watch!

Veritasium explains how the YouTube subscriber model has changed over time
“YouTube started doing some experiments… where they would change what was recommended to your subscribers. No longer was a subscription like ‘I want to see every video by this person’; it was more of a suggestion…”

YouTube’s metrics for “what to show to you” is, of course, biased by your subscriptions. But it’s also biased by what’s “trending” (which in turn is based on watch time and click-through-rate), what people-who-watch-the-things-you-watch watch, subscription commonalities, regional trends, what your contacts are interested in, and… who knows what else! AAA YouTubers try to “game” it, but the goalposts are moving. And the struggle to stay on-top, especially after a fluke viral hit, leads to the application of increasingly desperate and clickbaity measures.

This is a battle to which I’ve been mostly oblivious, until now, because I don’t watch YouTube like you watch YouTube.

Veritasium explains the YouTube "frontpage" algorithm.
“You could be a little bit disappointed in the way the game is working right now… I challenge you to think of a better way.”
Hold my beer.

Tom Scott produced an underappreciated sci-fi short last year describing a theoretical AI which, in 2028, caused problems as a result of its single-minded focus. What we’re seeing in YouTube right now is a simpler example, but illustrates the problem well: optimising YouTube’s algorithm for any factor or combination of factors other than a user’s stated preference (subscriptions) will necessarily result in the promotion of videos to a user other than, and at the expense of, the ones by creators that they’ve subscribed to. And there are so many things that YouTube could use as influencing factors. Off the top of my head, there’s:

  • Number of views
  • Number of likes
  • Ratio of likes to dislikes
  • Number of tracked shares
  • Number of saves
  • Length of view
  • Click-through rate on advertisements
  • Recency
  • Subscriber count
  • Subscriber engagement
  • Popularity amongst your friends
  • Popularity amongst your demographic
  • Click-through-ratio
  • Etc., etc., etc.
Veritasium videos in my RSS reader
A Veritasium video I haven’t watched yet? Thanks, RSS reader.

But this is all alien to me. Why? Well: here’s how I use YouTube:

  1. Subscription: I subscribe to creators via RSS. My RSS reader doesn’t implement YouTube’s algorithm, of course, so it just gives me exactly what I subscribe to – no more, no less.It’s not perfect (for example, it pisses me off every time it tells me about an upcoming “premiere”, a YouTube feature I don’t care about even a little), but apart from that it’s great! If I’m on-the-move and can’t watch something as long as involved TheraminTrees‘ latest deep-thinker, my RSS reader remembers so I can watch it later at my convenience. I can have National Geographic‘s videos “expire” if I don’t watch them within a week but Dr. Doe‘s wait for me forever. And I can implement my own filters if a feed isn’t showing exactly what I’m looking for (like I did to strip the sport section from BBC News’ RSS feed). I’m in control.
  2. Discovery: I don’t rely on YouTube’s algorithm to find me new content. I don’t mind being a day or two behind on what’s trending: I’m not sure I care at all? I’m far more-interested in recommendations curated by a human. If I discover and subscribe to a channel on YouTube, it was usually (a) mentioned by another YouTuber or (b)  linked from a blog or community blog. I’m receiving recommendations from people I already respect, and they have a way higher hit-rate than YouTube’s recommendations.(I also sometimes discover content because it’s exactly what I searched for, e.g. I’m looking for that tutorial on how to install a fiddly damn kiddy seat into the car, but this is unlikely to result in a subscription.)
Robot with a computer.
I for one welcome our content-recommending robot overlords. (So long as their biases can be configured by their users, not the networks that create them…)

This isn’t yet-another-argument that you should use RSS because it’s awesome. (Okay, so it is. RSS isn’t dead, and its killer feature is that its users get to choose how it works. But there’s more I want to say.)

What I wanted to share was this reminder, for me, that the way you use a technology can totally blind you to the way others use it. I had no idea that many YouTube creators and some YouTube subscribers felt increasingly like they were fighting YouTube’s algorithms, whose goals are different from their own, to get what they want. Now I can see it everywhere! Why do schmoyoho always encourage me to press the notification bell and not just the subscribe button? Because for a typical YouTube user, that’s the only way that they can be sure that their latest content will be seen!

Veritasium encourages us to "ring that bell".
“There is one way… to short-circuit this effect… ring that bell.”
If I may channel Yoda for a moment: No… there is another!

Of course, the business needs of YouTube mean that we’re not likely to see any change from them. So until either we have mainstream content-curating AIs that answer to their human owners rather than to commercial networks (robot butler, anybody?) or else the video fediverse catches on – and I don’t know which of those two are least-likely! – I guess I’ll stick to my algorithm-lite subscription model for YouTube.

But at least now I’ll have a better understanding of why some of the channels I follow are changing the way they produce and market their content…

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

BBC News… without the sport

I love RSS, but it’s a minor niggle for me that if I subscribe to any of the BBC News RSS feeds I invariably get all the sports news, too. Which’d be fine if I gave even the slightest care about the world of sports, but I don’t.

Sports on the BBC News site
Down with Things Like This!

It only takes a couple of seconds to skim past the sports stories that clog up my feed reader, but because I like to scratch my own itches, I came up with a solution. It’s more-heavyweight perhaps than it needs to be, but it does the job. If you’re just looking for a BBC News (UK) feed but with sports filtered out you’re welcome to share mine: https://f001.backblazeb2.com/file/Dan–Q–Public/bbc-news-nosport.rss https://fox.q-t-a.uk/bbc-news-no-sport.xml.

If you’d like to see how I did it so you can host it yourself or adapt it for some similar purpose, the code’s below or on GitHub:

#!/usr/bin/env ruby

# # Sample crontab:
# # At 41 minutes past each hour, run the script and log the results
# 41 * * * * ~/bbc-news-rss-filter-sport-out.rb > ~/bbc-news-rss-filter-sport-out.log 2>&1

# Dependencies:
# * open-uri - load remote URL content easily
# * nokogiri - parse/filter XML
# * b2       - command line tools, described below
require 'bundler/inline'
gemfile do
  source 'https://rubygems.org'
  gem 'nokogiri'
end
require 'open-uri'

# Regular expression describing the GUIDs to reject from the resulting RSS feed
# We want to drop everything from the "sport" section of the website
REJECT_GUIDS_MATCHING = /^https:\/\/www\.bbc\.co\.uk\/sport\//

# Assumption: you're set up with a Backblaze B2 account with a bucket to which
# you'd like to upload the resulting RSS file, and you've configured the 'b2'
# command-line tool (https://www.backblaze.com/b2/docs/b2_authorize_account.html)
B2_BUCKET = 'YOUR-BUCKET-NAME-GOES-HERE'
B2_FILENAME = 'bbc-news-nosport.rss'

# Load and filter the original RSS
rss = Nokogiri::XML(open('https://feeds.bbci.co.uk/news/rss.xml?edition=uk'))
rss.css('item').select{|item| item.css('guid').text =~ REJECT_GUIDS_MATCHING }.each(&:unlink)

begin
  # Output resulting filtered RSS into a temporary file
  temp_file = Tempfile.new
  temp_file.write(rss.to_s)
  temp_file.close

  # Upload filtered RSS to a Backblaze B2 bucket
  result = `b2 upload_file --noProgress --contentType application/rss+xml #{B2_BUCKET} #{temp_file.path} #{B2_FILENAME}`
  puts Time.now
  puts result.split("\n").select{|line| line =~ /^URL by file name:/}.join("\n")
ensure
  # Tidy up after ourselves by ensuring we delete the temporary file
  temp_file.close
  temp_file.unlink
end

bbc-news-rss-filter-sport-out.rb

When executed, this Ruby code:

  1. Fetches the original BBC news (UK) RSS feed and parses it as XML using Nokogiri
  2. Filters it to remove all entries whose GUID matches a particular regular expression (removing all of those from the “sport” section of the site)
  3. Outputs the resulting feed into a temporary file
  4. Uploads the temporary file to a bucket in Backblaze‘s “B2” repository (think: a better-value competitor S3); the bucket I’m using is publicly-accessible so anybody’s RSS reader can subscribe to the feed

I like the versatility of the approach I’ve used here and its ability to perform arbitrary mutations on the feed. And I’m a big fan of Nokogiri. In some ways, this could be considered a lower-impact, less real-time version of my tool RSSey. Aside from the fact that it won’t (easily) handle websites that require Javascript, this approach could probably be used in exactly the same ways as RSSey, and with significantly less set-up: I might look into whether its functionality can be made more-generic so I can start using it in more places.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

@davejthorp the RSS feeds at dave-thorp.me.uk (e.g. for posts and comments, presumably among others) are broken

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

Your RSS is grass: Mozilla euthanizes feed reader, Atom code in Firefox browser, claims it’s old and unloved

This article is a repost promoting content originally published elsewhere. See more things Dan's reposted.

When Firefox 64 arrives in December, support for RSS, the once celebrated content syndication scheme, and its sibling, Atom, will be missing.

“After considering the maintenance, performance and security costs of the feed preview and subscription features in Firefox, we’ve concluded that it is no longer sustainable to keep feed support in the core of the product,” said Gijs Kruitbosch, a software engineer who works on Firefox at Mozilla, in a blog post on Thursday.

Not a great sign, but understandable. Live Bookmarks was never strong enough to be a full-featured RSS reader, and I don’t know about you but I haven’t really made use of bookmarks for a good few years, let alone “live” bookmarks, but the media are likely to see this (as El Reg does, in the article) as another nail in the coffin of one of the best syndication mechanisms the Web ever came up with.

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.

The Rise and Demise of RSS

This article is a repost promoting content originally published elsewhere. See more things Dan's reposted.

There are two stories here. The first is a story about a vision of the web’s future that never quite came to fruition. The second is a story about how a collaborative effort to improve a popular standard devolved into one of the most contentious forks in the history of open-source software development.

In the late 1990s, in the go-go years between Netscape’s IPO and the Dot-com crash, everyone could see that the web was going to be an even bigger deal than it already was, even if they didn’t know exactly how it was going to get there. One theory was that the web was about to be revolutionized by syndication. The web, originally built to enable a simple transaction between two parties—a client fetching a document from a single host server—would be broken open by new standards that could be used to repackage and redistribute entire websites through a variety of channels. Kevin Werbach, writing for Release 1.0, a newsletter influential among investors in the 1990s, predicted that syndication “would evolve into the core model for the Internet economy, allowing businesses and individuals to retain control over their online personae while enjoying the benefits of massive scale and scope.”1 He invited his readers to imagine a future in which fencing aficionados, rather than going directly to an “online sporting goods site” or “fencing equipment retailer,” could buy a new épée directly through e-commerce widgets embedded into their favorite website about fencing.2 Just like in the television world, where big networks syndicate their shows to smaller local stations, syndication on the web would allow businesses and publications to reach consumers through a multitude of intermediary sites. This would mean, as a corollary, that consumers would gain significant control over where and how they interacted with any given business or publication on the web.

RSS was one of the standards that promised to deliver this syndicated future. To Werbach, RSS was “the leading example of a lightweight syndication protocol.”3 Another contemporaneous article called RSS the first protocol to realize the potential of XML.4 It was going to be a way for both users and content aggregators to create their own customized channels out of everything the web had to offer. And yet, two decades later, RSS appears to be a dying technology, now used chiefly by podcasters and programmers with tech blogs. Moreover, among that latter group, RSS is perhaps used as much for its political symbolism as its actual utility. Though of course some people really do have RSS readers, stubbornly adding an RSS feed to your blog, even in 2018, is a reactionary statement. That little tangerine bubble has become a wistful symbol of defiance against a centralized web increasingly controlled by a handful of corporations, a web that hardly resembles the syndicated web of Werbach’s imagining.

The future once looked so bright for RSS. What happened? Was its downfall inevitable, or was it precipitated by the bitter infighting that thwarted the development of a single RSS standard?

I’ve always been a huge fan of RSS, and I use it for just about everything (I’ll even hack-it-in to services that don’t supply it natively, just to make them fit around my workflow). But even I’ve got to admit that – outside of podcasts – it’s not done well at retaining mainstream appeal, especially after the death of Google Reader. Right now, most people seem content to get their updates from their social media circles, and take a manual approach (ugh) to reading content in the few other places that matter to them. That’s problematic for all kinds of reasons, and I’m perfectly happy to be one of those old fuddy-duddies who likes his web standards open and independent!

Footnotes

1 When I say RSS, I mean feed. Most of the feeds I subscribe to are RSS feeds, but some are Atom feeds, h-feed, etc. But I can’t get over the old-fashioned name, and I don’t care to try.