Did I mention recently that I love RSS? That it brings me great joy? That I start and finish almost every day in my feed reader? Probably.
I used to have a single minor niggle with the BBC News RSS feed: that it included sports news, which I didn’t care about. So I wrote a script that downloaded it, stripped sports news, and re-exported the feed for me to subscribe to. Magic.
data:image/s3,"s3://crabby-images/6fb92/6fb929c0bc201ff042c286a61e6b02ec527778da" alt="RSS reader showing duplicate copies of the news story "Barbie 2? 'We'd love to,' says Warner Bros boss", and an entry from BBC Sounds."
But lately – presumably as a result of technical changes at the Beeb’s side – this feed has found two fresh ways to annoy me:
-
The feed now re-publishes a story if it gets re-promoted to the front page… but with a different
<guid>
(it appears to get a #0 after it when first published, a #1 the second time, and so on). In a typical day the feed reader might scoop up new stories about once an hour, any by the time I get to reading them the same exact story might appear in my reader multiple times. Ugh. - They’ve started adding iPlayer and BBC Sounds content to the BBC News feed. I don’t follow BBC News in my feed reader because I want to watch or listen to things. If you do, that’s fine, but I don’t, and I’d rather filter this content out.
Luckily, I already have a recipe for improving this feed, thanks to my prior work. Let’s look at my newly-revised script (also available on GitHub):
#!/usr/bin/env ruby require 'bundler/inline' # # Sample crontab: # # At 41 minutes past each hour, run the script and log the results # */20 * * * * ~/bbc-news-rss-filter-sport-out.rb > ~/bbc-news-rss-filter-sport-out.log 2>>&1 # Dependencies: # * open-uri - load remote URL content easily # * nokogiri - parse/filter XML gemfile do source 'https://rubygems.org' gem 'nokogiri' end require 'open-uri' # Regular expression describing the GUIDs to reject from the resulting RSS feed # We want to drop everything from the "sport" section of the website, also any iPlayer/Sounds links REJECT_GUIDS_MATCHING = /^https:\/\/www\.bbc\.co\.uk\/(sport|iplayer|sounds)\// # Load and filter the original RSS rss = Nokogiri::XML(open('https://feeds.bbci.co.uk/news/rss.xml?edition=uk')) rss.css('item').select{|item| item.css('guid').text =~ REJECT_GUIDS_MATCHING }.each(&:unlink) # Strip the anchors off the <guid>s: BBC News "republishes" stories by using guids with #0, #1, #2 etc, which results in duplicates in feed readers rss.css('guid').each{|g|g.content=g.content.gsub(/#.*$/,'')} File.open( '/www/bbc-news-no-sport.xml', 'w' ){ |f| f.puts(rss.to_s) }
That revised script removes from the feed anything whose <guid>
suggests it’s sports news or from BBC Sounds or iPlayer, and also strips any “anchor” part of the
<guid>
before re-exporting the feed. Much better. (Strictly speaking, this can result in a technically-invalid feed by introducing duplicates, but your feed reader
oughta be smart enough to compensate for and ignore that: mine certainly is!)
You’re free to take and adapt the script to your own needs, or – if you don’t mind being tied to my opinions about what should be in BBC News’ RSS feed – just subscribe to my copy at: https://fox.q-t-a.uk/bbc-news-no-sport.xml