XPath Scraping AdamKoszary.co.uk

Adam Koszary – whom I worked alongside at the Bodleian – the social media specialist who brought the “absolute unit” meme to the masses, started blogging earlier (again?) this year. Yay!

But he’s completely neglected to put an RSS feed on hew new blog. Boo!

Dan, wearing a VR headset, sits in an office environment, watched by Adam.
People who saw Adam and I work together might have questioned the degree to which it counted as “work”, but that’s another story.

I’ve talked at length about how I use FreshRSS‘s “XPath Scraping” feature (for Bev’s blog, Far Side, Forward, new Far Side, and Vmail, among others), but earlier this week somebody left a comment to ask me more about how I test and debug my XPath scrapers. Given that I now need to add one for Adam’s blog, I’m in a wonderful position to walk you through it!

Setting up and debugging your FreshRSS XPath Scraper

Okay, so here’s Adam’s blog. I’ve checked, and there’s no RSS feed1, so it’s time to start planning my XPath Scraper. The first thing I want to do is to find some way of identifying the “posts” on the page. Sometimes people use solid, logical id="..." and class="..." attributes, but I’m going to need to use my browser’s “Inspect Element” tool to check:

Screenshot showing Inspect Element in use on Adam's blog.
If you’re really lucky, the site you’re scraping uses an established microformat like h-feed. No such luck here, though…

The next thing that’s worth checking is that the content you’re inspecting is delivered with the page, and not loaded later using JavaScript. FreshRSS’s XPath Scraper works with the raw HTML/XML that’s delivered to it; it doesn’t execute any JavaScript2, so I use “View Source” and quickly search to see that the content I’m looking for is there, too.

HTML source code showing id="posts" highlighted.
New developers are sometimes surprised to see how different View Source and Inspect Element’s output can be3. This looks pretty promising, though.
Now it’s time to try and write some XPath queries. Luckily, your browser is here to help! If you pop up your debug console, you’ll discover that you’re probably got a predefined function, $x(...), to which you can path a string containing an XPath query and get back a NodeList of the element.

First, I’ll try getting all of the links inside the #posts section by running $x( '//*[@id="posts"]//a' )  –

A browser's debug console executes $x('//*[@id="posts"]//a') , and gets 14 results.
Once you’ve run a query, you can expand the resulting array and hover over any element in it to see it highlighted on the page. This can be used to help check that you’ve found what you’re looking for (and nothing else).
In my first attempt, I discovered that I got not only all the posts… but also the “tags” at the top. That’s no good. Inspecting the URLs of each, I noticed that the post URLs all contained /posts/, so I filtered my query down to $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]' ) which gave me the expected number of results. That gives me //*[@id="posts"]//a[contains(@href, "/posts/")] as the XPath query for “news items”:
FreshRSS XPath feed configuration page showing my new query in the appropriate field.
I like to add the rules I’ve learned to my FreshRSS configuration as I go along, to remind me what I still need to find.

Obviously, this link points to the full post, so that tells me I can put ./@href as the “item link” attribute in FreshRSS.

Next, it’s time to see what other metadata I can extract from each post to help FreshRSS along:

Inspecting the post titles shows that they’re <h3>s. Running $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]//h3' ) gets them. Within FreshRSS, everything “within” a post is referenced relative to the post, so I convert this to descendant::h3 for my “XPath (relative to item) for Item Title:” attribute.

An XPath query identifying the titles of the posts.
I was pleased to see that Adam’s using a good accessible heading cascade. This also makes my XPathing easier!

Inspecting within the post summary content, it’s… not great for scraping. The elements class names don’t correspond to what the content is4: it looks like Adam’s using a utility class library5.

Everything within the <a> that we’ve found is wrapped in a <div class="flex-grow">. But within that, I can see that the date is directly inside a <p>, whereas the summary content is inside a <p> within a <div class="mb-2">. I don’t want my code to be too fragile, and I think it’s more-likely that Adam will change the class names than the structure, so I’ll tie my queries to the structure. That gives me descendant::div/p for the date and descendant::div/div/p for the “content”. All that remains is to tell FreshRSS that Adam’s using F j, Y as his date format (long month name, space, short day number, comma, space, long year number) so it knows how to parse those dates, and the feed’s good.

If it’s wrong and I need to change anything in FreshRSS, the “Reload Articles” button can be used to force it to re-load the most-recent X posts. Useful if you need to tweak things. In my case, I’ve also set the “Article CSS selector on original website” field to article so that the full post text can be pulled into my reader rather than having to visit the actual site. Then I’m done!

Adam's blog post "Content of the Week #7: 200 Creators" viewed in FreshRSS.
Yet another blog I can read entirely from my feed reader, despite the fact that it doesn’t offer a “feed”.

Takeaways

  • Use Inspect Element to find the elements you want to scrape for.
  • Use $x( ... ) to test your XPath expressions.
  • Remember that most of FreshRSS’s fields ask for expressions relative to the news item and adapt accordingly.
  • If you make a mistake, use “Reload Articles” to pull them again.

Footnotes

1 Boo again!

2 If you need a scraper than executes JavaScript, you need something more-sophisticated. I used to use my very own RSSey for this purpose but nowadays XPath Scraping is sufficient so I don’t bother any more, but RSSey might be a good starting point for you if you really need that kind of power!

3 If you’ve not had the chance to think about it before: View Source shows you the actual HTML code that was delivered from the web server to your browser. This then gets interpreted by the browser to generate the DOM, which might result in changes to it: for example, invalid elements might be removed, ambiguous markup will have an interpretation applied, and so on. The DOM might further change as a result of JavaScript code, browser plugins, and whatever else. When you Inspect Element, you’re looking at the DOM (represented “as if” it were HTML), not the actual underlying HTML

4 The date isn’t in a <time> element nor does it have a class like .post--date or similar.

5 I’ll spare you my thoughts on utility class libraries for now, but they’re… not positive. I can see why people use them, and I’ve even used them myself before… but I don’t think they’re a good thing.

× × × × × × ×

Quesapizza Lunch

After a morning of optimising a nonprofit’s reverse proxy configuration, I feel like I’ve earned my lunch! Four cheese, mushroom and jalapeño quesapizzas, mmm…

Gas stovetop.a frying pan contains a tortilla wrap topped with tomato sauce, cheese, mushrooms, and jalapeños. Beside its a plate containing a completed quesapizza: two crispy tortilla wraps sandwiching their contents.

×

Build Colors from Colors with CSS Relative Color Syntax

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

The feature here is that you can take a color you already have and manipulate its components. Which things you can change vary by the color space you choose, so for an RGB color you can change the red, green, blue, and alpha channels, for an HSL color you can change hue, saturation, lightness, and alpha, and for my beloved OKLCH you can change lightness, chroma, hue, and yes, opacity.

The syntax if you wanted to use this and not change anything about the color is:

oklch(from var(--color) l c h / 1)

But of course you can change each component, either swapping them entirely as with this which sets the lightness to 20%:

oklch(from var(--color) 20% c h / 1)

This is really something. I was aware that new colour functions were coming to CSS but kinda dropped the ball and didn’t notice that oklch(...) is, for the most part, usable in any modern browser. That’s a huge deal!

The OKLCH colour model makes more sense than RGB, covers a wider spectrum than HSL, and – on screens that support it – describes a (much) larger spectrum, providing access to a wider array of colours (with sensible fallbacks where they’re not supported). But more than that, the oklch(...) function provides good colour adaptation.

If you’ve ever used e.g. Sass’s darken(...) function and been disappointed when it seems to have a bigger impact on some colours than others… that’s because simple mathematical colour models don’t accurately reflect the complexities of human vision: some colours just look brighter, to us, thanks quirks of biochemistry, psychology, and evolution!

This colour vision curve feels to me a little like how pianos aren’t always tuned to equal-temper – i.e. how the maths of harmonics says that should be – but are instead tuned so that the lowest notes are tuned slightly flat and the highest notes slightly sharp to compensate for inharmonicity resulting from the varying stiffness of the strings. This means that their taut length alone doesn’t dictate what note humans think they hear: my understanding is that at these extremes, the difference in the way the wave propagates within the string results in an inharmonic overtone that makes these notes sound out-of-tune with the rest of the instrument unless compensated for with careful off-tuning! Humans experience something other than what the simple maths predicts, and so we compensate for it! (The quirk isn’t unique to the piano, but it’s most-obvious in plucked or struck strings, rather than in bowed strings, and for instruments with a wide range, of which a piano is of course both!)

OKLCH is like that. And with it as a model (and a quick calc(...) function), you can tell your CSS “make this colour 20% lighter” and get something that, for most humans, will actually look “20% lighter”, regardless of the initial hue. That’s cool.

I spent way too long playing with this colour picker while I understood this concept. And now I want to use it everywhere!

My Car is Sad

All! The! Warning! Lights! By the time a car needs to pahinate the error messages it wants to display, something has definitely gone wrong. 😬

Car dashboard with seven different warning lights on, plus a message advising that ABS and ESC are not working.

×

Go back to bed!

Things my children have gotten out of bed to say to me tonight:

  • I don’t want to go to school tomorrow
  • I can’t find [name of toy]
  • I want [name of toy I lent to my sibling] back
  • if I’m ill, I don’t have to go to school tomorrow, right?
  • I can’t sleep
  • I might be ill: I don’t think I should go to school tomorrow
  • I want a hot water bottle
  • I’m too hot
  • I’ve lost my hot water bottle
  • I spilt my water1
  • I went to the toilet because I thought I was going to throw up but I didn’t but I think I’m too ill to go to school tomorrow
  • my book is wet
  • I forgot to brush my teeth
  • I don’t like these pyjamas
  • I still can’t sleep

Footnotes

1 it later turned out to have been spilled on an electrical extension socket! 😱

Note #24906

As the kids grow older… someday our final soft play session – something we used to do all the time, and now do only rarely – will be in the past.

A mug of coffee held in front of a view of a multicoloured soft playground.

But for now, at least, it remains a chaotic way to tire them out on a morning!

×

Sabbatical Magic

A couple of weeks ago, I kicked off my first sabbatical since starting at Automattic a little over five years ago1.

Dan sits in front of two laptops (one of which shows a photo of an echidna for some reason), in a meeting room full of casually-dressed volunteers.
The first weekend of my sabbatical might have set the tone for a lot of the charity hacking that will follow, being dominated by a Three Rings volunteering weekend.

The first fortnight of my sabbatical has consisted of:

  1. Three Rings CIC’s AGM weekend and lots of planning for the future of the organisation and how we make it a better place to volunteer, and better value for our charity users,
  2. building a first draft of Three Rings’ new server architecture, which turns out to mostly work but still needs some energy thrown at it,
  3. a geohashing expedition with the dog, and
  4. a family holiday to Catalonia, Spain.
Dan, Ruth, and JTA with their children and a tour guide called Julie, enjoying churros in a Barcelona cafe.
You’d be amazed how many churros these children can put away.

The trip to Spain followed a model for European family breaks that we first tried in Paris last year2, but was extended to give us a feel for more of the region than a simple city break would. Ultimately, we ended up in three separate locations:

  1. Barcelona, where we stayed in a wonderful skyscraper hotel with fantastic breakfasts and, after I was able to get enough sleep, explored the obvious touristy bits of the city (e.g. la Sagrada Família3 and other Gaudían architecture, the chocolate museum, the fort at Montjuic, and because it’s me, of course, a widely varied handful of geocaches).
  2. The PortAventura World theme park, whose accommodation was certainly a gear shift after the 5-star hotel we’d come from4 but whose rides kept us and the kids delighted for a couple of days (Shambhala was a particular hit with the eldest kid and me).
  3. A villa in el Vilosell – a village of only 190 people – at which the kids mostly played in the outdoor pool (despite the sometimes pouring rain) but we did get the chance to explore the local area a little. Also, of course, some geocaching: some local caches are 1-2 years old and yet had so few finds that I was able to be only the tenth or even just the third person to sign the logbooks!
Dan and the kids atop the remains of a castle tower.
All that remains of the Castell del Vilosell is part of a single tower, but it affords excellent views over the rest of the village as well as being home to a wonderfully-placed geocache.

I’d known – planned – that my sabbatical would involve a little travel. But it wasn’t until we began to approach the end of this holiday that I noticed a difference that a holiday on sabbatical introduces, compared to any other holiday I’ve taken during my adult life…

Perhaps because of the roles I’ve been appointed to – or maybe as a result of my personality – I’ve typically found that my enjoyment of the last day or two of a week-long trip are marred somewhat by intrusive thoughts of the work week to follow.

Dan sits at a laptop in a hotel bar, a view of Barcelona out of the window behind him, a beer bottle alongside him.
I’m not saying that I didn’t write code while on holiday. I totally did, and I open-sourced it too. But programming feels different when your paycheque doesn’t depend on it.

If I’m back to my normal day job on Monday, then by Saturday I’m already thinking about what I’ll need to be working on (in my case, it’s usually whatever I left unfinished right before I left), contemplating logging-in to work to check my email or Slack, and so on5.

But this weekend, that wasn’t even an option. I’ve consciously and deliberately cut myself off from my usual channels of work communication, and I’ve been very disciplined about not turning any of them back on. And even if I did… my team aren’t expecting me to sign into work for about another 11 weeks anyway!

Dan, standing in an airport departure lounge, mimes "mind blown" to the camera.
🤯🤯🤯

Monday and Tuesday are going to mostly be split between looking after the children, and voluntary work for Three Rings (gotta fix that new server architecture!). Probably. Wednesday? Who knows.

That’s my first taste of the magic of a sabbatical, I think. The observation that it’s possible to unplug from my work life and, y’know, not start thinking about it right away again.

Maybe I can use this as a vehicle to a more healthy work/life balance next year.

Footnotes

1 A sabbatical is a perk offered to Automatticians giving them three months off (with full pay and benefits) after each five years of work. Mine coincidentally came hot on the tail of my last meetup and soon after a whole lot of drama and a major shake-up, so it was a very welcome time to take a break… although of course it’s been impossible to completely detach from bits of the drama that have spilled out onto the open Web!

2 I didn’t get around to writing about Paris, but I did write about how the hotel we stayed at introduced our eldest, and by proxy re-introduced me, to Wonder Boy, ultimately leading to me building an arcade cabinet on which I finally, beat the game, 35 years after first playing it.

3 Whose construction has come on a lot since the last time I toured inside it.

4 Although alcohol helped with that.

5 I’m fully aware that this is a symptom of poor work/life balance, but I’ve got two decades of ingrained bad habits working against me now; don’t expect me to change overnight!

× × × × ×

3D Workers Island

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

Fake screenshot of Internet Explorer 6 showing 3dwiscr.com/what.html, a web page about a freeware screensaver.

If you’ve come across Tony Domenico’s work before it’s probably as a result of web horror video series Petscop.

3D Workers Island… isn’t quite like that (though quick content warning: it does vaugely allude to child domestic abuse). It’s got that kind of alternative history/”found footage webpage” feel to it that I enjoyed so much about the Basilisk collection. It’s beautifully and carefully made in a way that brings its world to life, and while I found the overall story slightly… incomplete?… I enjoyed the application of its medium enough to make up for it.

Best on desktop, but tolerable on a large mobile in landscape mode. Click each “screenshot” to advance.

Bluesky and enshittification

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

Any system where users can leave without pain is a system whose owners have high switching costs and whose users have none. An owner who makes a bad call – like removing the block function say, or opting every user into AI training – will lose a lot of users. Not just those users who price these downgrades highly enough that they outweigh the costs of leaving the service. If leaving the service is free, then tormenting your users in this way will visit in swift and devastating pain upon you.

There’s a name for this dynamic, from the world of behavioral economics. It’s called a “Ulysses Pact.” It’s named for the ancient hacker Ulysses, who ignored the normal protocol for sailing through the sirens’ sea. While normie sailors resisted the sirens’ song by filling their ears with wax, Ulysses instead had himself lashed to the mast, so that he could hear the sirens’ song, but could not be tempted into leaping into the sea, to be drowned by the sirens.

Whenever you take a measure during a moment of strength that guards against your own future self’s weakness, you enter into a Ulysses Pact – think throwing away the Oreos when you start your diet.

Wise words from Cory about why he isn’t on Bluesky, which somewhat echo my own experience. If you’ve had the experience in recent memory of abandoning an enshittified Twitter (and if you didn’t yet… why the fuck not?), TikTok, or let’s face it Reddit… and you’ve looked instead to services like Bluesky or arguably Threads… then you haven’t learned your lesson at all.

Freedom to exit is fundamental, and I’m a big fan of systems with a built-in Ulysses Pact. In non-social or unidirectionally-social software it’s sufficient for the tools to be open source: this allows me to host a copy myself if a hosted version falls to enshittification. But for bidirectional social networks, it’s also necessary for them to be federated, so that I’m not disadvantaged by choosing to drop any particular provider in favour of another or my own.

Bluesky keeps promising a proper federation model, but it’s not there yet. And I’m steering clear until it is.

I suppose I also enjoyed this post of Cory’s because it helped remind me of where I myself am failing to apply the Ulysses Pact. Right now, Three Rings is highly-centralised, and while I and everybody else involved with it know our exit strategy should the project have to fold (open source it, help charities migrate to their own instances, etc.) right now that plan is less “tie ourselves to the mast” than it is “trust one another to grab us if we go chasing sirens”. We probably ought to fix that.

Dan Q found GCAAD6C El pont sobre el riu #14 Set a La Pobla Cérvoles

This checkin to GCAAD6C El pont sobre el riu #14 Set a La Pobla Cérvoles reflects a geocaching.com log entry. See more of Dan's cache logs.

My family and I are staying in El Vilosell, so this morning I borrowed a bike to take the trail over the hill from there to here. The ride was beautiful, and the easy downhill rides between cliffs and terraces more than made up for the tiring and bumpy sections of uphill pedalling. At the highest point, I met a fox, on his way to bed I guess as the sun crested the hilltops.

Despite the recent heavy rain the Set riverbed was almost dry: just a trickle of a stream. From the description, I was initially worried that the cache might be underneath, which looked challenging, but a peep at the hint reassured me that there were more-likely hiding spots.

A little finger-work and the cache was in hand. Nice spot! That’s when I discovered that there was a hole in my pocket and my pen had escaped! Oh no! I hope a photolog will be sufficient to show that I found this cache.

Grey shorts with a hole in the pocket.

TFTC/GPC. Greetings from Oxfordshire, UK.

Dan, holding a small geocache logbook. A bike and a road bridge can be seen in the background.

× ×

Note #24882

Okay, is every company using AI to fuck up their mailing lists, now?

For the second time this week I’ve received an email from a company whom I’d explicitly demanded cease processing my PII. This time, in February they outright told me that they’d deleted my data and sent screenshots to “prove” it (which was already after they’d failed to unsubscribe me from their mailing lists in the first place).

Dan Q found GCAA274 Garrigues #23 – El Vilosell

This checkin to GCAA274 Garrigues #23 - El Vilosell reflects a geocaching.com log entry. See more of Dan's cache logs.

The geokids and I are staying nearby and came out for a walk this morning to discover this under-appreciated cache. What an amazing location and such a great view! We searched many “obvious” locations without luck, then translated some logs to get a clue. We should have checked the attributes! A little danger later and the cache was in hand. SL, TFTC/GPC! FP awarded – thanks so much for bringing us here. Greetings from Oxfordshire, UK!

Dan and two kids look excited atop a castle in rural Catalonia.

×

The Time Travel Movie That Doesn’t Move

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

When I saw the title of this piece by The Nerdwriter pop up in my RSS reader, the first words that grabbed my attention were “time travel movie”. I’ve a bit of a thing for time travel stories in any medium, and I love a good time travel movie1. Could I be about to be introduced to one I’m not familiar with, I wondered?

Before the thumbnail loaded2, I processed the rest of the title: the movie doesn’t move. At first my brain had assumed that this was a reference to the story spanning time but not space, but now suddenly it clicked:

We’re talking about La Jetée, aren’t we?

Like many people (outside of film students), I imagine, I first came across La Jetée after seeing it mentioned in the credits of Twelve Monkeys, which adapts its storyline in several ways. And like most people who then went on to see it, I imagine, I was moved by that unforgettable experience – there’s nothing quite like it in the history of film (if we’re to call it a film, that is: its creator famously doesn’t).

Anyway: Nerdwriter1’s take on it doesn’t say anything that hasn’t already been said, but it’s a beautiful introduction to interpreting this fantastic short film and it’s highly-accessible whether or not you’ve seen La Jetée3. Give his video essay a watch.

Footnotes

1 Okay, let’s be honest, my feelings go deeper than that. Time travel movies are, for me, like pizza: I love a good time travel movie, but I’ll also happily enjoy a pretty trashy time travel movie too.

2 Right now I’m in a rural farm building surrounded by olive groves in an out-of-the-way bit of Spain, and my Internet access isn’t always the fastest. D’ya remember how sometimes Web pages used to load the text and then you’d wait while the images loaded? They still do that, here.

3 There’s spoilers, but by the time a film is 60 years old, I think that’s fair game, right?

Note #24871

Just received an unsolicited marketing email from a company that I asked to remove me from their systems eleven and a half years ago (!).

I haven’t heard from them since then.

The marketing email is sent using a platform that “uses AI to help you seamlessly connect with your customers”. By which I assume the platform means “we’ll crawl your email history to find anybody we can possibly spam”.

Just gonna go grab my GDPR/DPA2018 beating-stick. This is gonna be a fun one.