XPath Scraping AdamKoszary.co.uk

Adam Koszary – whom I worked alongside at the Bodleian – the social media specialist who brought the “absolute unit” meme to the masses, started blogging earlier (again?) this year. Yay!

But he’s completely neglected to put an RSS feed on hew new blog. Boo!

Dan, wearing a VR headset, sits in an office environment, watched by Adam.
People who saw Adam and I work together might have questioned the degree to which it counted as “work”, but that’s another story.

I’ve talked at length about how I use FreshRSS‘s “XPath Scraping” feature (for Bev’s blog, Far Side, Forward, new Far Side, and Vmail, among others), but earlier this week somebody left a comment to ask me more about how I test and debug my XPath scrapers. Given that I now need to add one for Adam’s blog, I’m in a wonderful position to walk you through it!

Setting up and debugging your FreshRSS XPath Scraper

Okay, so here’s Adam’s blog. I’ve checked, and there’s no RSS feed1, so it’s time to start planning my XPath Scraper. The first thing I want to do is to find some way of identifying the “posts” on the page. Sometimes people use solid, logical id="..." and class="..." attributes, but I’m going to need to use my browser’s “Inspect Element” tool to check:

Screenshot showing Inspect Element in use on Adam's blog.
If you’re really lucky, the site you’re scraping uses an established microformat like h-feed. No such luck here, though…

The next thing that’s worth checking is that the content you’re inspecting is delivered with the page, and not loaded later using JavaScript. FreshRSS’s XPath Scraper works with the raw HTML/XML that’s delivered to it; it doesn’t execute any JavaScript2, so I use “View Source” and quickly search to see that the content I’m looking for is there, too.

HTML source code showing id="posts" highlighted.
New developers are sometimes surprised to see how different View Source and Inspect Element’s output can be3. This looks pretty promising, though.
Now it’s time to try and write some XPath queries. Luckily, your browser is here to help! If you pop up your debug console, you’ll discover that you’re probably got a predefined function, $x(...), to which you can path a string containing an XPath query and get back a NodeList of the element.

First, I’ll try getting all of the links inside the #posts section by running $x( '//*[@id="posts"]//a' )  –

A browser's debug console executes $x('//*[@id="posts"]//a') , and gets 14 results.
Once you’ve run a query, you can expand the resulting array and hover over any element in it to see it highlighted on the page. This can be used to help check that you’ve found what you’re looking for (and nothing else).
In my first attempt, I discovered that I got not only all the posts… but also the “tags” at the top. That’s no good. Inspecting the URLs of each, I noticed that the post URLs all contained /posts/, so I filtered my query down to $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]' ) which gave me the expected number of results. That gives me //*[@id="posts"]//a[contains(@href, "/posts/")] as the XPath query for “news items”:
FreshRSS XPath feed configuration page showing my new query in the appropriate field.
I like to add the rules I’ve learned to my FreshRSS configuration as I go along, to remind me what I still need to find.

Obviously, this link points to the full post, so that tells me I can put ./@href as the “item link” attribute in FreshRSS.

Next, it’s time to see what other metadata I can extract from each post to help FreshRSS along:

Inspecting the post titles shows that they’re <h3>s. Running $x( '//*[@id="posts"]//a[contains(@href, "/posts/")]//h3' ) gets them. Within FreshRSS, everything “within” a post is referenced relative to the post, so I convert this to descendant::h3 for my “XPath (relative to item) for Item Title:” attribute.

An XPath query identifying the titles of the posts.
I was pleased to see that Adam’s using a good accessible heading cascade. This also makes my XPathing easier!

Inspecting within the post summary content, it’s… not great for scraping. The elements class names don’t correspond to what the content is4: it looks like Adam’s using a utility class library5.

Everything within the <a> that we’ve found is wrapped in a <div class="flex-grow">. But within that, I can see that the date is directly inside a <p>, whereas the summary content is inside a <p> within a <div class="mb-2">. I don’t want my code to be too fragile, and I think it’s more-likely that Adam will change the class names than the structure, so I’ll tie my queries to the structure. That gives me descendant::div/p for the date and descendant::div/div/p for the “content”. All that remains is to tell FreshRSS that Adam’s using F j, Y as his date format (long month name, space, short day number, comma, space, long year number) so it knows how to parse those dates, and the feed’s good.

If it’s wrong and I need to change anything in FreshRSS, the “Reload Articles” button can be used to force it to re-load the most-recent X posts. Useful if you need to tweak things. In my case, I’ve also set the “Article CSS selector on original website” field to article so that the full post text can be pulled into my reader rather than having to visit the actual site. Then I’m done!

Adam's blog post "Content of the Week #7: 200 Creators" viewed in FreshRSS.
Yet another blog I can read entirely from my feed reader, despite the fact that it doesn’t offer a “feed”.

Takeaways

  • Use Inspect Element to find the elements you want to scrape for.
  • Use $x( ... ) to test your XPath expressions.
  • Remember that most of FreshRSS’s fields ask for expressions relative to the news item and adapt accordingly.
  • If you make a mistake, use “Reload Articles” to pull them again.

Footnotes

1 Boo again!

2 If you need a scraper than executes JavaScript, you need something more-sophisticated. I used to use my very own RSSey for this purpose but nowadays XPath Scraping is sufficient so I don’t bother any more, but RSSey might be a good starting point for you if you really need that kind of power!

3 If you’ve not had the chance to think about it before: View Source shows you the actual HTML code that was delivered from the web server to your browser. This then gets interpreted by the browser to generate the DOM, which might result in changes to it: for example, invalid elements might be removed, ambiguous markup will have an interpretation applied, and so on. The DOM might further change as a result of JavaScript code, browser plugins, and whatever else. When you Inspect Element, you’re looking at the DOM (represented “as if” it were HTML), not the actual underlying HTML

4 The date isn’t in a <time> element nor does it have a class like .post--date or similar.

5 I’ll spare you my thoughts on utility class libraries for now, but they’re… not positive. I can see why people use them, and I’ve even used them myself before… but I don’t think they’re a good thing.

× × × × × × ×

Reactions

No time to comment? Send an emoji with just one click!

0 comments

    Reply here

    Your email address will not be published. Required fields are marked *

    Reply on your own site

    Reply elsewhere

    You can reply to this post on Mastodon (@blog@danq.me).

    Reply by email

    I'd love to hear what you think. Send an email to b24930@danq.me; be sure to let me know if you're happy for your comment to appear on the Web!