How You Read My Content (The Answers)

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

Reading type pie chart

What this tells me?

Well, quite a lot, actually. It tells me that there’s loads of you fine people reading the content on this site, which is very heart-warming. It also tells me that RSS is by far the main way people consume my content. Which is also fantastic, as I think RSS is very important and should always be a first class citizen when it comes to delivering content to people.

I didn’t get a chance to participate in Kev’s survey because, well, I don’t target “RSS Zero” and I don’t always catch up on new articles – even by authors I follow closely – until up to a few weeks after they’re published1. But needless to say, I’d have been in the majority: I follow Kev via my feed reader2.

But I was really interested by this approach to understanding your readership: like Kev, I don’t run any kind of analytics on my personal sites. But he’s onto something! If you want to learn about people, why not just ask them?

Okay, there’s going to be a bias: maybe readers who subscribe by RSS are simply more-likely to respond to a survey? Or are more-likely to visit new articles quickly, which was definitely a factor in this short-lived survey? It’s hard to be certain whether these or other factors might have thrown-off Kev’s results.

But then… what isn’t biased? Were Kev running, say, Google Analytics (or Fathom, or Strike, or Hector, or whatever)… then I wouldn’t show up in his results because I block those trackers3 – another, different, kind of bias.

We can’t dodge such bias: not using popular analytics platforms, and not by surveying users. But one of these two options is, at least, respectful of your users’ privacy and bandwidth.

I’m tempted to run a similar survey myself. I might wait until after my long-overdue redesign – teased here – launches, though. Although perhaps that’s just a procrastination stemming from my insecurity that I’ll hear, like, an embarrassingly-low number of responses like three or four and internalise it as failing some kind of popularity contest4! Needs more thought.

Footnotes

1 I’m happy with this approach: I enjoy being able to treat my RSS reader as sort-of a “magazine”, using my categorisations of feeds – which are partially expressed on my Blogroll page – as a theme. Like: “I’m going to spend 20 minutes reading… tech blogs… or personal blogs by people I know personally… or indieweb-centric content… or news (without the sports, of course)…” This approach makes consuming content online feel especially deliberate and intentional: very much like being in control of what I read and when.

2 In fact, it’s by doing so – with a little help from Matthias Pfefferle – that I was inspired to put a “thank you” message in my RSS feed, among other “secret” features!

3 In fact, I block all third-party JavaScript (and some first-party JavaScript!) except where explicitly permitted, but even for sites that I do allow to load all such JavaScript I still have to manually enable analytics trackers if I want them, which I don’t. Also… I sandbox almost all cookies, and I treat virtually all persistent cookies as session cookies and I delete virtually all session cookies 15 seconds after I navigate away from a its sandbox domain or close its tab… so I’m moderately well-anonymised even where I do somehow receive a tracking cookie.

4 Perhaps something to consider after things have gotten easier and I’ve caught up with my backlog a bit.

BBC Sports News (without the crap)

For the last few years I’ve been running a proxy of the BBC News RSS feeds (https://bbc-feeds.danq.dev) that strips out duplicate content, non-news content, and (optionally) sports news.

This weekend, for the first time, somebody asked if it could produce an edition that included only the sports content. Which turned out to be slightly more difficult, because it’s the kind of scope-creep that my “uninterested-in-sports-news” brain couldn’t conceive that anybody would want! But I got there in the end.

If anybody’s looking for their fill of “BBC News Feeds… But Better!”, give it a look.

Pushing RSS to WhatsApp (for free?)

I’m always keen to experiment with new ways to subscribe to my blogObviously RSS is the best and everybody who can use it should. But some people, for their own reasons1, prefer to learn about updates to their favourite sites via the Fediverse, or Facebook, or Telegram, or… I don’t know… LiveJournal or something (yes, those are all places you can follow this blog, if you really wanted to).

But I’m pretty sure there are some people who’d rather receive updates to my blog via WhatsApp. And now, they can. Here’s how I set up an RSS-to-WhatsApp gateway, in case you want to run one of your own2.

Diagram showing how GitHub Actions request and receive an RSS feed, push an update to Whapi, which pushes the update to WhatsApp.

Prerequisites

You will need:

  1. A GitHub account – a free one is fine
  2. A Whapi account connected to your WhatsApp account3 – when you set up an account you’ll get a free trial; when it ends you need to find the link to say that you want to carry on with the free tier (or upgrade to the paid tier if you expect to send more messages than the free tier’s limit)
  3. A WhatsApp channel to which you want to push your RSS feed: I’d recommend that you make a newsletter (from the Updates tab in WhatsApp, press the kekab menu then Create Channel) rather than a traditional group: groups are designed for multiple people to talk and discuss and everybody can see one another’s identity, but a newsletter keeps everybody’s identity private and only allows the administrator(s) permission to post updates.
A WhatsApp 'Subscription Channel'/'Newsletter' group for DanQ.me, showing some recent posts, on a mobile screen.
You probably want to use the kind of channel that’s for one-to-many ‘push’ communication, not a discussion group.

Instructions

  1. Fork the Dan-Q/rss-to-whapi.cloud repository into your own GitHub account.
  2. In Settings > Secrets and Variables > Actions, add two new Repository Secrets:
    • WHATSAPP_API_TOKEN: set to the token on your Whapi dashboard
    • WHATSAPP_CHANNEL: set to your newsletter ID (will look like 123456789012345678@newsletter) or group ID (will look like 123456789012345678@g.us): you can get this from the Newsletters or Groups section of Whapi by executing a test GET /newsletters or GET /groups request4.
  3. Make a feeds.json file (a feeds.json.example is provided as a guide) containing the URLs of the RSS feeds you’d like to subscribe to.
  4. Do a test run: from the Actions tab select the “Process feeds” action and click “Run workflow”. If it finishes successfully (and you get the WhatsApp message), you’re done! If it fails, click on the failed action and drill-in to the failed task to see the error message and correct accordingly.

By default, the processor will run on-demand and every 30 minutes, but you can modify that in .github/workflows/process-feeds.yml. It’s configured to send the single oldest un-sent item in any of the RSS feeds it’s subscribed to, on each run (it tracks which ones it’s sent already by their guids, in a "seen": [...] array in feeds.json): sending a single link per run ensures that WhatsApp’s link previews work as expected. At that rate, you could theoretically run it once every 10 minutes and never hit the 150-messages-per-day limit of Whapi’s free tier5), but you’ll want to work out your own optimal rate based on the anticipated update frequency of your feeds and the number of RSS-to-WhatsApp channels you’re running.

You can, of course, run it on your own infrastructure in a similar way. Just check out the repository to your local system with Ruby 3.2+ running, run bundle to install the dependencies, then set up a cron job or some other automation to run ./process_feeds.rb. Doing this could be used to hook it up to your RSS feed updating pipeline, for example, to check for new feed items right after a new post is published.

Footnotes

1 Their own incomprehensible, illogical, weird reasons.

2 I hope that the title gives it away, but you can do this completely for free. So long as you keep your fork of the GitHub repository open-source then you can run GitHub Actions for free, and so long as you’re pushing out no more than 150 updates per day to no more than 5 different channels in a month then you can do it within Whapi’s free tier: that’s probably fine for a personal blogger, and there’s a reasonable pricing structure (plus some value-added extras) for companies that want to use this same workflow as part of a grander WhatsApp offering.

3 Setting this up requires giving Whapi access to your WhatsApp account. If you don’t like the security implications of that, you could get a cheap eSIM, set that up with WhatsApp, and use that account: if you do this, just remember to “warm up” your new WhatsApp account with some conversations with yourself so it doesn’t look so much like a spammer! Also note that the way Whapi works “uses up” one of the ~4 devices on which you can simultaneously use WhatsApp Web/WhatsApp Desktop etc.

4 Prefer the command-line? So long as you’ve got curl and jq then you can get a list of your newsletters (or groups) and their IDs with curl -H 'Authorization: Bearer YOUR_API_TOKEN' -H 'accept: application/json' https://gate.whapi.cloud/newsletters?count=100 | jq '.newsletters[] | { id: .id, name: .name }' or curl -H 'Authorization: Bearer YOUR_API_TOKEN' -H 'accept: application/json' https://gate.whapi.cloud/groups?count=100 | jq '.groups[] | { id: .id, name: .name }', respectively.

5 Going beyond the free tier would require sending one message, on average, every 9 minutes and 36 seconds.

Ten Pointless Facts About Me

This has been doing the rounds; I last saw it on Kev’s blog. I like that the social blogosphere’s doing this kind of fun activity again, these days1.

1. Do you floss your teeth?

Umm… sometimes? Not as often as I should. Don’t tell my dentist!

Usually at least once a month, never more than once a week. I really took to heart some advice that if you’re using a fluoridated mouthwash then you shouldn’t do it close to when you brush your teeth (or you counteract the benefits), so my routine is that… when I remember and can be bothered to floss… I’ll floss and mouthwash, but like in the middle of the day.

And since I moved my bedroom (and bathroom) one floor further up our house, it’s harder to find the motivation to do so! So I’m probably flossing less. The unanticipated knock-on effect of extending your house!

2. Tea, coffee, or water?

I love a coffee to start a workday, but I have to be careful how much I consume because caffeine hits me pretty hard, even after a concentrated effort over the last 10 years or so to gradually increase my tolerance. I can manage a couple of mugs in the morning and be fine, now, but three coffees… or any in the mid-afternoon onwards… and I’m at risk of throwing off my ability to sleep later2.

I keep a bottle of water wherever I work to try to encourage myself to hydrate, because I’ve got medical evidence to show that I don’t drink enough water! It sometimes works.

3. Footwear preference?

Basic trainers for everyday use; comfortable boots for hiking; slippers for when I’m working. Nothing special.

I wear holes in footwear (and everything else I wear) faster than anybody I know, so nowadays I go for good-value comfort over any other considerations when buying shoes.

A French Bulldog looks-on guiltily at a hand holding the remains of a pair of slippers that have been thoroughly shredded.
One time it was the dog’s fault that my footwear fell apart, but usually they do so by themselves.

4. Favourite dessert?

Varies, but if we’re eating out, I’m probably going to be ordering the most-chocolatey dessert on the menu.

5. The first thing you do when you wake up?

The very first thing I do when I wake up is check how long it is before I need to get up, and make a decision about when I’m going to do so. I almost never need my alarm to wake me: I routinely wake up half an hour or so before my alarm would go off, most mornings. But exactly how early I wake directly impacts what I do next. If I’m well-rested and it’s early enough, I’ll plan on getting up and doing something productive: an early start to work, or some voluntary work for Three Rings, or some correspondence. If it’s close to the time I need to get up I’ll more-often just stay in bed and spend longer doing the actual answer I should give…

…because the “real” answer is probably: pick up my phone, and open up FreshRSSalmost always the first and last thing I do online in a day! I’ll skim the news and blogosphere and “set aside” for later anything I’d like to re-read or look at later on.

6. Age you’d like to stick at?

Honestly, I’m good where I am, thanks.

Sure, I was fitter and healthier in my 20s, and I had more free time in my early 30s… and there are certainly things I miss and get nostalgic about in any era of my life. But conversely: it took me a long, long time to “get my shit together” to the level I have now, and I wouldn’t want to have to go through all of the various bits of self-growth, therapy, etc. all over again!

So… sure, I’d be happy to transplant my intellect into 20-year-old me and take advantage of my higher energy level of the time for an extra decade or so3. But I wouldn’t go back even a decade if it meant that I had to go relearn and go through everything from that decade another time, no thanks!

7. How many hats do you own?

Four. Ish.

Composite of four images of Dan, a white man with long hair and a beard. He's wearing a hoodie with a picture of Fluttershy (from My Little Pony: Friendship is Magic) wearing the iconic armour from the Elder Scrolls: Skyrim video game. In each of the four pictures he's wearing a different hat: a rainbow-striped bandana, a blackcap with the word 'GEEK' on the front in white lettering, a warm furry hat, and a purple woolen hat with a "Woo" logo.

They are:

  1. A bandana. Actually, I own maybe half a dozen bandanas, mostly in Pride rainbow colours. Bandanas are amazingly versatile: they fold small which suits my love of travelling light these last few years, they can function as headgear, dust mask, neckerchief, flannel, etc.4, and they do a pretty good job of keeping my head cool and protecting my growing bald spot from the fierce rays of the summer sun.
  2. A “geek” hat. Okay, I’ve actually got three of these, too, in slightly different designs. When they first started appearing at Oxford Geek Nights, I just kept winning them! I’m not a huge fan of caps, so mostly the kids wear them… although I do put one on when I’m collecting takeaway food so I can get away with just putting e.g. “geek hat” in the “name” field, rather than my name5.
  3. A warm hat that comes out only when the weather is incredibly cold, or when I’m skiing. As I was reminded while skiing on my recent trip to Finland, I should probably switch to wearing a helmet when I ski, but I’ve been skiing for three to four decades without one and I find the habit hard to break.6
  4. A wooly hat that I was given by a previous employer at a meetup in Mexico last year. I wore it a couple of times last winter but it’s otherwise not seen much use.

8. Describe the last photo you took?

The last photo I took was of myself wearing a “geek” hat. You’ve seen it, it’s above!

But the one before that was this picture of an extremely large bottle of champagne, with a banana for scale, that was delivered to my house earlier today:

A six-litre bottle of champagne, wrapped in bubble wrap and surrounded by packing peanuts, in a wooden transport case, with a banana resting atop it.
A 6-litre champagne bottle is properly-termed a Methuselah, after Noah’s grandad I guess.

Ruth and JTA celebrate their anniversary every few years with the “next size up” of champagne bottle, and this is the one they’re up to. This year, merely asking me to help them drink it probably won’t be sufficient (that’d still be two litres each!) so we’re probably going to have to get some friends over.

I took the photo to send to Ruth to reassure her that the bottle had arrived safely, after the previous attempt went… less well. I added the banana “for scale” before sharing the photo with some other friends, too.

A wooden case containing a completely smashed 6-litre champagne bottle.
The previous delivery… didn’t go so well. 😱

9. Worst TV show?

PAW Patrol. No doubt.

You know all those 1980s kids TV shows that basically existed for no other purpose than as a marketing vehicle for a range of toys? I’m talking He-Man (and She-Ra), TransformersG.I. JoeCare BearsM.A.S.K.Rainbow Brite, and My Little Pony. Well, those shows look good compared to PAW Patrol.

3D render of a boy and six dogs (each dressed as a representative of a different service) - the PAW Patrol. Ugh.
Six pups, each endowed with exactly one personality trait7 but a plethora of accessories and vehicles which expands every season so that no matter how many toys you’ve got, y0u’re always behind the curve.

I was delighted when our kids graduated from PAW Patrol to My Little Pony: Friendship is Magic because it’s an enormously better show (the songs kick ass, too) and we could finally shake off the hollow, pointless, internally-inconsistent advertisement that is PAW Patrol.

10. As a child, what was your aspiration for adulthood?

This is the single most-boring thing about me, and I’ve doubtless talked about it before. At some point between the age of about six and eight years old, I decided that I wanted to grow up and become… a computer programmer.

And then I designed the entirety of the rest of my education around that goal. I learned a variety of languages and paradigms under my own steam while setting myself up for a GCSE in IT, and then A-Levels in Maths and Computing, and then a Degree in Computer Science, and by the time I’d done all of that I was already working in the industry: self-actualised by 21.

Like I said: boring!

Your turn!

You should give this pointless quiz a go too. Ping/Webmention me if you do (or comment below, I suppose); I’d love to read what you write.

Footnotes

1 They’re internet memes, in the traditional sense, but sadly people usually use “meme” nowadays exclusively to describe image memes, and not other kinds of memetic Internet content. Just another example of our changing Internet language, which I’ve written about before. Sometimes they were silly quizzes (wanna know what Meat Loaf song I am?); sometimes they were about you and your friends. But images, they weren’t: that came later.

2 Or else I’ll get a proper jittery heart-flutter going!

3 I wouldn’t necessarily even miss the always-on, in-your-pocket, high-speed Internet of today: the Internet was pretty great back then, too!

4 Obviously an intergalactic hitch-hiker should include a bandana, perhaps as well as an equally-versatile towel, in their toolkit.

5 It’s not about privacy, although that’s a fringe benefit I suppose: mostly it’s about getting my food quicker! If I walk into Dominos wearing a geek hat and they’ve got pizza on the counter with a label on it that says it’s for “geek hat”, they’ll just hand it over, no questions, and I’m in-and-out in seconds.

6 JTA observed that similar excuses were used by people who resisted the rollout of mandatory seatbelt usage in cars, so possibly I’m the “bad guy” here.

7 From left to right, the single personality traits for each of the pups are (a) doesn’t like water, (b) is female, (c) likes naps, (d) is allergic to cats, (e) is clumsy, and (f) is completely fucking pointless.

× × × × ×

Unread 4.5.2

Last month, my friend Gareth observed that the numbered lists in my blog posts “looked wrong” in his feed reader. I checked, and I decided I was following the standards correctly and it must have been his app that was misbehaving.

App Store page for Unread: An RSS Reader by Golden Hill Software. What's New for version 4.5.2 includes a bullet point, highlighted, which reads 'This update improves display of articles from danq.me."

So he contacted the authors of Unread, his feed reader, and they fixed it. Pretty fast, I’ve got to say. And I was amused to see that I’m clearly now a test case because my name’s in their release notes!

×

Feed Readers Beat Doomscrolling

The news has, in general, been pretty terrible lately.

Like many folks, I’ve worked to narrow the focus of the things that I’m willing to care deeply about, because caring about many things is just too difficult when, y’know, nazis are trying to destroy them all.

I’ve got friends who’ve stopped consuming news media entirely. I’ve not felt the need to go so far, and I think the reason is that I already have a moderately-disciplined relationship with news. It’s relatively easy for me to regulate how much I’m exposed to all the crap news in the world and stay focussed and forward-looking.

The secret is that I get virtually all of my news… through my feed reader (some of it pre-filtered, e.g. my de-crappified BBC News feeds).

FreshRSS screenshot showing a variety of feeds categorised as Communities, Distractions, Geeky, YouTube, News, Strangers, etc. Posts from yesterday and today are visible.
I use FreshRSS and I love it. But really: any feed reader can improve your relationship with the Web.

Without a feed reader, I can see how I might feel the need to “check the news” several times a day. Pick up my phone to check the time… glance at the news while I’m there… you know how to play that game, right?

But with a feed reader, I can treat my different groups of feeds like… periodicals. The news media I subscribe to get collated in my feed reader and I can read them once, maybe twice per day, just like a daily newspaper. If an article remains unread for several days then, unless I say otherwise, it’s configured to be quietly archived.

My current events are less like a firehose (or sewage pipe), and more like a bottle of (filtered) water.

Categorising my feeds means that I can see what my friends are doing almost-immediately, but I don’t have to be disturbed by anything else unless I want to be. Try getting that from a siloed social network!

Maybe sometimes I see a new breaking news story… perhaps 12 hours after you do. Is that such a big deal? In exchange, I get to apply filters of any kind I like to the news I read, and I get to read it as a “bundle”, missing (or not missing) as much or as little as I like.

On a scale from “healthy media consumption” to “endless doomscrolling”, proper use of a feed reader is way towards the healthy end.

If you stopped using feeds when Google tried to kill them, maybe it’s time to think again. The ecosystem’s alive and well, and having a one-stop place where you can enjoy the parts of the Web that are most-important to you, personally, in an ad-free, tracker-free, algorithmic-filtering-free space that you can make your very own… brings a special kind of peace that I can highly recommend.

×

Note #25736

After “fixing” BBC News’ RSS feeds I noticed that I was seeing less news (and, somehow, stressing less over everything happening in the USA). Turns out that in switching myself to my new system I’d subscribed to the UK edition, whereas previously I’d been on the Full edition. I’ve corrected it now in my RSS reader, but it was an interesting couple of days.

tl;dr: I accidentally stopped reading international news and I was less stressed

Anyway: if you’re not already using my improved BBC News RSS feeds, they’re at: https://bbc-feeds.danq.dev

BBC News RSS… your way!

It turns out my series of efforts to improve the BBC News RSS feeds are more-popular than I thought. People keep asking for variants of them, and it’s probably time I stopped hosting the resulting feeds on my NAS (which does a good job, but it’s in a highly-kickable place right under my desk).

Screenshot of BBC News RSS Feeds (that don't suck!).
The new site isn’t pretty. But it works.

So I’ve launched BBC-Feeds.DanQ.dev. On a 20-minute schedule, it generates both UK and World editions of the BBC News feeds, filtered to remove iPlayer, Sounds, app “nudges”, duplicates, and other junk, and optionally with the sports news filtered out too.

The entire thing is open source under an ultra-permissive license, so you can run your own copy if you don’t want to use mine.

Enjoy!

BBC News RSS… with the sport?

There’s now a much, much better version of this. Go use that instead: bbc-feeds.danq.dev.

Earlier today, somebody called Allan commented on the latest in my series of several blog posts about how I mutilate manipulate the RSS feeds of BBC News to work around their (many, and increasingly so) various shortcomings, specifically:

  1. Their inclusion of non-news content such as plugs for iPlayer and their apps,
  2. Their repeating of identical news stories with marginally-different GUIDs, and
  3. All of the sports news, which I don’t care about one jot.

Well, it turns out that some people want #3: the sport. But still don’t want the other two.

FreshRSS screenshot with many unread items, but focussing on a feed called "BBC News (with sport)" and showing a story titled: 'How England Golf's yellow cards are tackling blight of slow play'
Some people actually want to read this crap, apparently.

I shan’t be subscribing to this RSS feed, and I can’t promise I’ll fix it if it gets broken. But if “without the crap, but with the sports” is the way you like your BBC News RSS feed, I’ve got you covered:

So there you go, Allan, and anybody in a similar position. I hope that fulfils your need for sports news… without the crap.

×

Guinness in the Bath

It’s been a long day of driving around Ireland, scrambling through forests, navigating to a hashpoint, exploring a medieval castle, dodging the rain, finding a series of geocaches, getting lost up a hill in the dark, and generally having a kickass time with one of my very favourite people on this earth: my mum.

And now it’s time for a long soak in a hot bath with a pint of the black stuff and my RSS reader for company. A perfect finish.

A pint of Guinness alongside a can, on a tiled bathroom shelf.

×

XPath Scraping AdamKoszary.co.uk

Adam Koszary – whom I worked alongside at the Bodleian – the social media specialist who brought the “absolute unit” meme to the masses, started blogging earlier (again?) this year. Yay!

But he’s completely neglected to put an RSS feed on hew new blog. Boo!

Dan, wearing a VR headset, sits in an office environment, watched by Adam.
People who saw Adam and I work together might have questioned the degree to which it counted as “work”, but that’s another story.

I’ve talked at length about how I use FreshRSS‘s “XPath Scraping” feature (for Bev’s blog, Far Side, Forward, new Far Side, and Vmail, among others), but earlier this week somebody left a comment to ask me more about how I test and debug my XPath scrapers. Given that I now need to add one for Adam’s blog, I’m in a wonderful position to walk you through it!

Setting up and debugging your FreshRSS XPath Scraper

Okay, so here’s Adam’s blog. I’ve checked, and there’s no RSS feed1, so it’s time to start planning my XPath Scraper. The first thing I want to do is to find some way of identifying the “posts” on the page. Sometimes people use solid, logical id="..." and class="..." attributes, but I’m going to need to use my browser’s “Inspect Element” tool to check:

Screenshot showing Inspect Element in use on Adam's blog.
If you’re really lucky, the site you’re scraping uses an established microformat like h-feed. No such luck here, though…

The next thing that’s worth checking is that the content you’re inspecting is delivered with the page, and not loaded later using JavaScript. FreshRSS’s XPath Scraper works with the raw HTML/XML that’s delivered to it; it doesn’t execute any JavaScript2, so I use “View Source” and quickly search to see that the content I’m looking for is there, too.

HTML source code showing id="posts" highlighted.
New developers are sometimes surprised to see how different View Source and Inspect Element’s output can be3. This looks pretty promising, though.
Now it’s time to try and write some XPath queries. Luckily, your browser is here to help! If you pop up your debug console, you’ll discover that you’re probably got a predefined function, $x(...), to which you can path a string containing an XPath query and get back a NodeList of the element.

First, I’ll try getting all of the links inside the #posts section by running $x( '//*[@id="posts"]//a' )  –

A browser's debug console executes $x('//*[@id="posts"]//a') , and gets 14 results.
Once you’ve run a query, you can expand the resulting array and hover over any element in it to see it highlighted on the page. This can be used to help check that you’ve found what you’re looking for (and nothing else).
In my first attempt, I discovered that I got not only all the posts… but also the “tags” at the top. That’s no good. Inspecting the URLs of each, I noticed that the post URLs all contained /posts/, so I filtered my query down to $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]' ) which gave me the expected number of results. That gives me //*[@id="posts"]//a[contains(@href, "/posts/")] as the XPath query for “news items”:

FreshRSS XPath feed configuration page showing my new query in the appropriate field.
I like to add the rules I’ve learned to my FreshRSS configuration as I go along, to remind me what I still need to find.

Obviously, this link points to the full post, so that tells me I can put ./@href as the “item link” attribute in FreshRSS.

Next, it’s time to see what other metadata I can extract from each post to help FreshRSS along:

Inspecting the post titles shows that they’re <h3>s. Running $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]//h3' ) gets them. Within FreshRSS, everything “within” a post is referenced relative to the post, so I convert this to descendant::h3 for my “XPath (relative to item) for Item Title:” attribute.

An XPath query identifying the titles of the posts.
I was pleased to see that Adam’s using a good accessible heading cascade. This also makes my XPathing easier!

Inspecting within the post summary content, it’s… not great for scraping. The elements class names don’t correspond to what the content is4: it looks like Adam’s using a utility class library5.

Everything within the <a> that we’ve found is wrapped in a <div class="flex-grow">. But within that, I can see that the date is directly inside a <p>, whereas the summary content is inside a <p> within a <div class="mb-2">. I don’t want my code to be too fragile, and I think it’s more-likely that Adam will change the class names than the structure, so I’ll tie my queries to the structure. That gives me descendant::div/p for the date and descendant::div/div/p for the “content”. All that remains is to tell FreshRSS that Adam’s using F j, Y as his date format (long month name, space, short day number, comma, space, long year number) so it knows how to parse those dates, and the feed’s good.

If it’s wrong and I need to change anything in FreshRSS, the “Reload Articles” button can be used to force it to re-load the most-recent X posts. Useful if you need to tweak things. In my case, I’ve also set the “Article CSS selector on original website” field to article so that the full post text can be pulled into my reader rather than having to visit the actual site. Then I’m done!

Adam's blog post "Content of the Week #7: 200 Creators" viewed in FreshRSS.
Yet another blog I can read entirely from my feed reader, despite the fact that it doesn’t offer a “feed”.

Takeaways

  • Use Inspect Element to find the elements you want to scrape for.
  • Use $x( ... ) to test your XPath expressions.
  • Remember that most of FreshRSS’s fields ask for expressions relative to the news item and adapt accordingly.
  • If you make a mistake, use “Reload Articles” to pull them again.

Footnotes

1 Boo again!

2 If you need a scraper than executes JavaScript, you need something more-sophisticated. I used to use my very own RSSey for this purpose but nowadays XPath Scraping is sufficient so I don’t bother any more, but RSSey might be a good starting point for you if you really need that kind of power!

3 If you’ve not had the chance to think about it before: View Source shows you the actual HTML code that was delivered from the web server to your browser. This then gets interpreted by the browser to generate the DOM, which might result in changes to it: for example, invalid elements might be removed, ambiguous markup will have an interpretation applied, and so on. The DOM might further change as a result of JavaScript code, browser plugins, and whatever else. When you Inspect Element, you’re looking at the DOM (represented “as if” it were HTML), not the actual underlying HTML

4 The date isn’t in a <time> element nor does it have a class like .post--date or similar.

5 I’ll spare you my thoughts on utility class libraries for now, but they’re… not positive. I can see why people use them, and I’ve even used them myself before… but I don’t think they’re a good thing.

× × × × × × ×

Vmail via FreshRSS

It’s time for… Dan Shares Yet Another FreshRSS XPath Scraping Recipe!

Vmail

I’m a huge fan of the XPath scraping feature of FreshRSS, my favourite feed reader (and one of the most important applications in my digital ecosystem). I’ve previously demonstrated how to use the feature to subscribe to Forward, reruns of The Far Side, and new The Far Side content, despite none of those sites having “official” feeds.

Signup form for VMail from Vole.WTF
Sure, I could have used my selfhosted OpenTrashMail server to convert email into RSS, but I figured XPath scraping would be more-elegant…

Vmail is cool. It’s vole.wtf’s (of ARCC etc. fame) community newsletter, and it’s as batshit crazy as you’d expect if you were to get the kinds of people who enjoy that site and asked them all to chip in on a newsletter.

Totes bonkers.

But email’s not how I like to consume this kind of media. So obviously, I scraped it.

Screenshot showing VMail subscription in FreshRSS
I’m not a monster: I want Vmail’s stats to be accurate. So I signed up with an unmonitored OpenTrashMail account as well. I just don’t read it (except for the confirmation link email). It actually took me a few attempts because there seems to be some kind of arbitrary maximum length validation on the signup form. But I got there in the end.

Recipe

Want to subscribe to Vmail using your own copy of FreshRSS? Here’s the settings you’re looking for –

  • Type of feed source: HTML + XPath (Web scraping)
  • XPath for finding news items: //table/tbody/tr
    It’s just a table with each row being a newsletter; simple!
  • XPath for item title: descendant::a
  • XPath for item content: .
  • XPath for item link (URL): descendant::a/@href
  • XPath for item date: descendant::td[1]
  • Custom date/time format: d M *y
    The dates are in a format that’s like 01 May ’24 – two-digit days with leading zeros, three-letter months, and a two-digit year preceded by a curly quote, separated by spaces. That curl quote screws up PHP’s date parser, so we have to give it a hint.
  • XPath for unique item ID: descendant::th
    Optional, but each issue’s got its own unique ID already anyway; we might as well use it!
  • Article CSS selector on original website: #vmail
    Optional, but recommended: this option lets you read the entire content of each newsletter without leaving FreshRSS.

So yeah, FreshRSS continues to be amazing. And lately it’s helped me keep on top of the amazing/crazy of vole.wtf too.

× ×

Somewhat-Effective Spam Filters

I’ve tried a variety of unusual strategies to combat email spam over the years.

Here are some of them (each rated in terms the geekiness of its implementation and its efficacy), in case you’d like to try any yourself. They’re all still in use in some form or another:

Spam filters

Geekiness: 1/10
Efficacy: 5/10

A colander filters spam email out of a stream of emails.

Your email provider or your email software probably provides some spam filters, and they’re probably pretty good. I use Proton‘s and, when I’m at my desk, Thunderbird‘s. Double-bagging your spam filter only slightly reduces the amount of spam that gets through, but increases your false-positive rate and some non-spam gets mis-filed.

A particular problem is people who email me for help after changing their name on FreeDeedPoll.org.uk, probably because they’re not only “new” unsolicited contacts to me but because by definition many of them have strange and unusual names (which is why they’re emailing me for help in the first place).

Frankly, spam filters are probably enough for many people. Spam filtering is in general much better today than it was a decade or two ago. But skim the other suggestions in case they’re of interest to you.

Unique email addresses

Geekiness: 3/10
Efficacy: 8/10

If you give a different email address to every service you deal with, then if one of them misuses it (starts spamming you, sells your data, gets hacked, whatever), you can just block that one address. All the addresses come to the same inbox, for your convenience. Using a catch-all means that you can come up with addresses on-the-fly: you can even fill a paper form with a unique email address associated with the company whose form it is.

On many email providers, including the ever-popular GMail, you can do this using plus-sign notation. But if you want to take your unique addresses to the next level and you have your own domain name (which you should), then you can simply redirect all email addresses on that domain to the same inbox. If Bob’s Building Supplies wants your email address, give them bobs@yourname.com, which works even if Bob’s website erroneously doesn’t accept email addresses with plus signs in them.

This method actually works for catching people misusing your details. On one occasion, I helped a band identify that their mailing list had been hacked. On another, I caught a dodgy entrepreneur who used the email address I gave to one of his businesses without my consent to send marketing information of a different one of his businesses. As a bonus, you can set up your filtering/tagging/whatever based on the incoming address, rather than the sender, for the most accurate finding, prioritisation, and blocking.

Emails to multiple email addresses reach the same inbox. Spam emails are blocked based on the addresses they're sent to.

Also, it makes it easy to have multiple accounts with any of those services that try to use the uniqueness of email addresses to prevent you from doing so. That’s great if, like me, you want to be in each of three different Facebook groups but don’t want to give Facebook any information (not even that you exist at the intersection of those groups).

Signed unique email addresses

Geekiness: 10/10
Efficacy: 2/10

Unique email addresses introduce two new issues: (1) if an attacker discovers that your Dreamwidth account has the email address dreamwidth@yourname.com, they can probably guess your LinkedIn email, and (2) attackers will shotgun “likely” addresses at your domain anyway, e.g. admin@yourname.com, management@yourname.com, etc., which can mean that when something gets through you get a dozen copies of it before your spam filter sits up and takes notice.

What if you could assign unique email addresses to companies but append a signature to each that verified that it was legitimate? I came up with a way to do this and implemented it as a spam filter, and made a mobile-friendly webapp to help generate the necessary signatures. Here’s what it looked like:

  1. The domain directs all emails at that domain to the same inbox.
  2. If the email address is on a pre-established list of valid addresses, that’s fine.
  3. Otherwise, the email address must match the form of:
    • A string (the company name), followed by
    • A hyphen, followed by
    • A hash generated using the mechanism described below, then
    • The @-sign and domain name as usual

The hashing algorithm is as follows: concatenate a secret password that only you know with a colon then the “company name” string, run it through SHA1, and truncate to the first eight characters. So if my password were swordfish1 and I were generating a password for Facebook, I’d go:

  1. SHA1 ( swordfish1 : facebook) [ 0 ... 8 ] = 977046ce
  2. Therefore, the email address is facebook-977046ce@myname.com
  3. If any character of that email address is modified, it becomes invalid, preventing an attacker from deriving your other email addresses from a single point (and making it hard to derive them given multiple points)

I implemented the code, but it soon became apparent that this was overkill and I was targeting the wrong behaviours. It was a fun exercise, but ultimately pointless. This is the one method on this page that I don’t still use.

Honeypots

Geekiness: 8/10
Efficacy: ?/10

Emails to multiple email addresses reach an inbox, but senders who reach a "honeypot" inbox are blocked from reaching the real inbox.

A honeypot is a “trap” email address. Anybody who emails it get aggressively marked as a spammer to help ensure that any other messages they send – even to valid email addresses – also get marked as spam.

I litter honeypots all over the place (you might find hidden email addresses on my web pages, along with text telling humans not to use them), but my biggest source of honeypots is formerly-valid unique addresses, or “guessed” catch-all addresses, which already attract spam or are otherwise compromised!

I couldn’t tell you how effective it is without looking at my spam filter’s logs, and since the most-effective of my filters is now outsourced to Proton, I don’t have easy access to that. But it certainly feels very satisfying on the occasions that I get to add a new address to the honeypot list.

Instant throwaways

Geekiness: 5/10
Efficacy: 6/10

OpenTrashmail is an excellent throwaway email server that you can deploy in seconds with Docker, point some MX records at, and be all set! A throwaway email server gives you an infinite number of unique email addresses, like other solutions described above, but with the benefit that you never have to see what gets sent to them.

Emails are delivered to an inbox and to a trash can, depending on the address they're sent to. The inbox subscribes to the trash can using RSS.

If you offer me a coupon in exchange for my email address, it’s a throwaway email address I’ll give you. I’ll make one up on the spot with one of my (several) trashmail domains at the end of it, like justgivemethedamncoupon@danstrashmailserver.com. I can just type that email address into OpenTrashmail to see what you sent me, but then I’ll never check it again so you can spam it to your heart’s content.

As a bonus, OpenTrashmail provides RSS feeds of inboxes, so I can subscribe to any email-based service using my feed reader, and then unsubscribe just as easily (without even having to tell the owner).

Summary

With the exception of whatever filters your provider or software comes with, most of these options aren’t suitable for regular folks. But you’re only a domain name (assuming you don’t have one already) away from being able to give unique email addresses to everybody you deal with, and that’s genuinely a game-changer all by itself and well worth considering, in my opinion.

× × × ×