I love RSS, but it’s a minor niggle for me that if I subscribe to any of the BBC News RSS feeds I invariably get all the sports news, too. Which’d be fine if I gave even the slightest care about the world of sports, but I don’t.
data:image/s3,"s3://crabby-images/ebaa5/ebaa577590c159c2926802cc440ec8ff0e770cf7" alt="Sports on the BBC News site"
It only takes a couple of seconds to skim past the sports stories that clog up my feed reader, but because I like to scratch my own
itches, I came up with a solution. It’s more-heavyweight perhaps than it needs to be, but it does the job. If you’re just looking for a BBC News (UK) feed but with sports filtered
out you’re welcome to share mine: https://f001.backblazeb2.com/file/Dan–Q–Public/bbc-news-nosport.rss https://fox.q-t-a.uk/bbc-news-no-sport.xml.
If you’d like to see how I did it so you can host it yourself or adapt it for some similar purpose, the code’s below or on GitHub:
#!/usr/bin/env ruby # # Sample crontab: # # At 41 minutes past each hour, run the script and log the results # 41 * * * * ~/bbc-news-rss-filter-sport-out.rb > ~/bbc-news-rss-filter-sport-out.log 2>&1 # Dependencies: # * open-uri - load remote URL content easily # * nokogiri - parse/filter XML # * b2 - command line tools, described below require 'bundler/inline' gemfile do source 'https://rubygems.org' gem 'nokogiri' end require 'open-uri' # Regular expression describing the GUIDs to reject from the resulting RSS feed # We want to drop everything from the "sport" section of the website REJECT_GUIDS_MATCHING = /^https:\/\/www\.bbc\.co\.uk\/sport\// # Assumption: you're set up with a Backblaze B2 account with a bucket to which # you'd like to upload the resulting RSS file, and you've configured the 'b2' # command-line tool (https://www.backblaze.com/b2/docs/b2_authorize_account.html) B2_BUCKET = 'YOUR-BUCKET-NAME-GOES-HERE' B2_FILENAME = 'bbc-news-nosport.rss' # Load and filter the original RSS rss = Nokogiri::XML(open('https://feeds.bbci.co.uk/news/rss.xml?edition=uk')) rss.css('item').select{|item| item.css('guid').text =~ REJECT_GUIDS_MATCHING }.each(&:unlink) begin # Output resulting filtered RSS into a temporary file temp_file = Tempfile.new temp_file.write(rss.to_s) temp_file.close # Upload filtered RSS to a Backblaze B2 bucket result = `b2 upload_file --noProgress --contentType application/rss+xml #{B2_BUCKET} #{temp_file.path} #{B2_FILENAME}` puts Time.now puts result.split("\n").select{|line| line =~ /^URL by file name:/}.join("\n") ensure # Tidy up after ourselves by ensuring we delete the temporary file temp_file.close temp_file.unlink end
bbc-news-rss-filter-sport-out.rb
When executed, this Ruby code:
- Fetches the original BBC news (UK) RSS feed and parses it as XML using Nokogiri
- Filters it to remove all entries whose GUID matches a particular regular expression (removing all of those from the “sport” section of the site)
- Outputs the resulting feed into a temporary file
- Uploads the temporary file to a bucket in Backblaze‘s “B2” repository (think: a better-value competitor S3); the bucket I’m using is publicly-accessible so anybody’s RSS reader can subscribe to the feed
I like the versatility of the approach I’ve used here and its ability to perform arbitrary mutations on the feed. And I’m a big fan of Nokogiri. In some ways, this could be considered a lower-impact, less real-time version of my tool RSSey. Aside from the fact that it won’t (easily) handle websites that require Javascript, this approach could probably be used in exactly the same ways as RSSey, and with significantly less set-up: I might look into whether its functionality can be made more-generic so I can start using it in more places.
0 comments