EEBO-TCP Hackathon

Last month I got the opportunity to attend the EEBO-TCP Hackfest, hosted in the (then still very-much under construction) Weston Library at my workplace. I’ve done a couple of hackathons and similar get-togethers before, but this one was somewhat different in that it was unmistakably geared towards a different kind of geek than the technology-minded folks that I usually see at these things. People like me, with a computer science background, were remarkably in the minority.

Dan Q in Blackwell Hall, at the Weston Library.
Me in the Weston Library (still under construction, as evidenced by the scaffolding in the background).

Instead, this particular hack event attracted a great number of folks from the humanities end of the spectrum. Which is understandable, given its theme: the Early English Books Online Text Creation Partnership (EEBO-TCP) is an effort to digitise and make available in marked-up, machine-readable text formats a huge corpus of English-language books printed between 1475 and 1700. So: a little over three centuries of work including both household names (like Shakespeare, Galileo, Chaucer, Newton, Locke, and Hobbes) and an enormous number of others that you’ll never have heard of.

Dan Q talks to academic Stephen Gregg
After an introduction to the concept and the material, attendees engaged in a speed-networking event to share their thoughts prior to pitching their ideas.

The hackday event was scheduled to coincide with and celebrate the release of the first 25,000 texts into the public domain, and attendees were challenged to come up with ways to use the newly-available data in any way they liked. As is common with any kind of hackathon, many of the attendees had come with their own ideas half-baked already, but as for me: I had no idea what I’d end up doing! I’m not particularly familiar with the books of the 15th through 17th centuries and I’d never looked at the way in which the digitised texts had been encoded. In short: I knew nothing.

Dan Q and Liz McCarthy listen as other attendees pitch their ideas.
The ideas pitch session quickly showed some overlap between different project ideas, and teams were split and reformed a few times as people found the best places for themselves.

Instead, I’d thought: there’ll be people here who need a geek. A major part of a lot of the freelance work I end up doing (and a lesser part of my work at the Bodleian, from time to time) involves manipulating and mining data from disparate sources, and it seemed to me that these kinds of skills would be useful for a variety of different conceivable projects.

Dan Q explains what the spreadsheet he's produced 'means'.
XML may have been our interchange format, but everything fell into Excel in the end for speedy management even by less-technical team members.

I paired up with a chap called Stephen Gregg, a lecturer in 18th century literature from Bath Spa University. His idea was to use this newly-open data to explore the frequency (and the change in frequency over the centuries) of particular structural features in early printed fiction: features like chapters, illustrations, dedications, notes to the reader, encomia, and so on). This proved to be a perfect task for us to pair-up on, because he had the domain knowledge to ask meaningful questions, and I had the the technical knowledge to write software that could extract the answers from the data. We shared our table with another pair, who had technically-similar goals – looking at the change in the use of features like lists and tables (spoiler: lists were going out of fashion, tables were coming in, during the 17th century) in alchemical textbooks – and ultimately I was able to pass on the software tools I’d written to them to adapt for their purposes, too.

Dan Q with two academic-minded humanities folks, talking about their hackathon projects.
A quick meeting on the relative importance of ‘chapters’ as a concept in 16th century literature. Half of the words that the academics are saying go over my head, but I’m formulating XPath queries in my head while I wait.

And here’s where I made a discovery: the folks I was working with (and presumably academics of the humanities in general) have no idea quite how powerful data mining tools could be in giving them new opportunities for research and analysis. Within two hours we were getting real results from our queries and were making amendments and refinements in our questions and trying again. Within a further two hours we’d exhausted our original questions and, while the others were writing-up their findings in an attractive way, I was beginning to look at how the structural differences between fiction and non-fiction might be usable as a training data set for an artificial intelligence that could learn to differentiate between the two, providing yet more value from the dataset. And all the while, my teammates – who’d been used to looking at a single book at a time – were amazed by the possibilities we’d uncovered for training computers to do simple tasks while reading thousands at once.

Laptop showing a map of the area around Old St. Paul's Cathedral.
The area around Old St. Paul’s Cathedral was the place to be if you were a 16th century hipster looking for a new book.

Elsewhere at the hackathon, one group was trying to simulate the view of the shelves of booksellers around the old St. Paul’s Cathedral, another looked at the change in the popularity of colour and fashion-related words over the period (especially challenging towards the beginning of the timeline, where spelling of colours was less-standardised than towards the end), and a third came up with ways to make old playscripts accessible to modern performers.

A graph showing the frequency of colour-related words in English-language books printed over three centuries.
Aside from an increase in the relative frequency of the use of colour words to describe yellow things, there’s not much to say about this graph.

At the end of the session we presented our findings – by which I mean, Stephen explained what they meant – and talked about the technology and its potential future impact – by which I mean, I said what we’d like to allow others to do with it, if they’re so-inclined. And I explained how I’d come to learn over the course of the day what the word encomium meant.

Dan Q presents findings in Excel.
Presenting our findings in amazing technicolour Excel.

My personal favourite contribution from the event was by Sarah Cole, who adapted the text of a story about a witch trial into a piece of interactive fiction, powered by Twine/Twee, and then allowed us as an audience to collectively “play” her game. I love the idea of making old artefacts more-accessible to modern audiences through new media, and this was a fun and innovative way to achieve this. You can even play her game online!

(by the way: for those of you who enjoy my IF recommendations: have a look at Detritus; it’s a delightful little experimental/experiential game)

Output from the interactive fiction version of a story about a witch trial.
Things are about to go very badly for Joan Buts.

But while that was clearly my favourite, the judges were far more impressed by the work of my teammate and I, as well as the team who’d adapted my software and used it to investigate different features of the corpus, and decided to divide the cash price between the four of us. Which was especially awesome, because I hadn’t even realised that there was a prize to be had, and I made the most of it at the Drinking About Museums event I attended later in the day.

Members of the other team, who adapted my software, were particularly excited to receive their award.
Cold hard cash! This’ll be useful at the bar, later!

If there’s a moral to take from all of this, it’s that you shouldn’t let your background limit your involvement in “hackathon”-like events. This event was geared towards literature, history, linguistics, and the study of the book… but clearly there was value in me – a computer geek, first and foremost – being there. Similarly, a hack event I attended last year, while clearly tech-focussed, wouldn’t have been as good as it was were it not for the diversity of the attendees, who included a good number of artists and entrepreneurs as well as the obligatory hackers.

Stephen and Dan give a congratulatory nod to one another.
“Nice work, Stephen.” “Nice work, Dan.”

But for me, I think the greatest lesson is that humanities researchers can benefit from thinking a little bit like computer scientists, once in a while. The code I wrote (which uses Ruby and Nokogiri) is freely available for use and adaptation, and while I’ve no idea whether or not it’ll ever be useful to anybody again, what it represents is the research benefits of inter-disciplinary collaboration. It pleases me to see things like the “Library Carpentry” (software for research, with a library slant) seeming to take off.

And yeah, I love a good hackathon.

Update 2015-04-22 11:59: with thanks to Sarah for pointing me in the right direction, you can play the witch trial game in your browser.

× × × × × × × × × × ×

QR Codes of the Bodleian

The Treasures of the Bodleian exhibition opened today, showcasing some of the Bodleian Libraries‘ most awe-inspiring artefacts: fragments of original lyrics by Sappho, charred papyrus from Herculaneum prior to the eruption of Mt. Vesuvius in 79 CE, and Conversation with Smaug, a watercolour by J. R. R. Tolkien to illustrate The Hobbit are three of my favourites. Over the last few weeks, I’ve been helping out with the launch of this exhibition and its website.

Photograph showing a laptop running Ubuntu, in front of a partially-constructed exhibition hall in which museum artefacts are being laid-out in glass cases.
From an elevated position in the exhibition room, I run a few tests of the technical infrastructure whilst other staff set up, below.

In particular, something I’ve been working on are the QR codes. This experiment – very progressive for a sometimes old-fashioned establishment like the Bodleian – involves small two-dimensional barcodes being placed with the exhibits. The barcodes are embedded with web addresses for each exhibit’s page on the exhibition website. Visitors who scan them – using a tablet computer, smartphone, or whatever – are directed to a web page where they can learn more about the item in front of them and can there discuss it with other visitors or can “vote” on it: another exciting new feature in this exhibition is that we’re trying quite hard to engage academics and the public in debate about the nature of “treasures”: what is a treasure?

Close-up photograph showing a small plaque with a QR code alongside interprative text, in an exhibition case.
A QR code in place at the Treasures of the Bodleian exhibition.

In order to improve the perceived “connection” between the QR code and the objects, to try to encourage visitors to scan the codes despite perhaps having little or no instruction, we opted to embed images in the QR codes relating to the objects they related to. By cranking up the error-correction level of a QR code, it’s possible to “damage” them quite significantly and still have them scan perfectly well.

One of my “damaged” QR codes. This one corresponds to The Laxton Map, a 17th Century map of common farming land near Newark on Trent.

We hope that the visual association between each artefact and its QR code will help to make it clear that the code is related to the item (and isn’t, for example, some kind of asset tag for the display case or something). We’re going to be monitoring usage of the codes, so hopefully we’ll get some meaningful results that could be valuable for future exhibitions: or for other libraries and museums.

Rolling Your Own

If you’re interested in making your own QR codes with artistic embellishment (and I’m sure a graphic designer could do a far better job than I did!), here’s my approach:

  1. I used Google Infographics (part of Chart Tools) to produce my QR codes. It’s fast, free, simple, and – crucially – allows control over the level of error correction used in the resulting code. Here’s a sample URL to generate the QR code above:

https://chart.googleapis.com/chart?chs=500×500&cht=qr&chld=H|0&chl=HTTP://TREASURES.BODLEIAN.OX.AC.UK/T7

  1. 500×500 is the size of the QR code. I was ultimately producing 5cm codes because our experiments showed that this was about the right size for our exhibition cabinets, the distance from which people would be scanning them, etc. For laziness, then, I produced codes 500 pixels square at a resolution of 100 pixels per centimetre.
  2. H specifies that we want to have an error-correction level of 30%, the maximum possible. In theory, at least, this allows us to do the maximum amount of “damage” to our QR code, by manipulating it, and still have it work; you could try lower levels if you wanted, and possibly get less-complex-looking codes.
  3. 0 is the width of the border around the QR code. I didn’t want a border (as I was going to manipulate the code in Photoshop anyway), so I use a width of 0.
  4. The URL – HTTP://TREASURES.BODLEIAN.OX.AC.UK/T7  – is presented entirely in capitals. This is because capital letters use fewer bits when encoded as QR codes. “http” and domain names are case-insensitive anyway, and we selected our QR code path names to be in capitals. We also shortened the URL as far as possible: owing to some complicated technical and political limitations, we weren’t able to lean on URL-shortening services like bit.ly, so we had to roll our own. In hindsight, it’d have been nice to have set up the subdomain “t.bodleian.ox.ac.uk”, but this wasn’t possible within the time available. Remember: the shorter the web address, the simpler the code, and simpler codes are easier and faster to read.
  5. Our short URLs redirect to the actual web pages of each exhibit, along with an identifying token that gets picked up by Google Analytics to track how widely the QR codes are being used (and which ones are most-popular amongst visitors).
By now, you’ll have a QR code that looks a little like this.
  1. Load that code up in Photoshop, along with the image you’d like to superimpose into it. Many of the images I’ve had to work with are disturbingly “square”, so I’ve simply taken them, given them a white or black border (depending on whether they’re dark or light-coloured). With others, though, I’ve been able to cut around some of the more-attractive parts of the image in order to produce something with a nicer shape to it. In any case, put your image in as a layer on top of your QR code.
  2. Move the image around until you have something that’s aesthetically-appealing. With most of my square images, I’ve just plonked them in the middle and resized them to cover a whole number of “squares” of the QR code. With the unusually-shaped ones, I’ve positioned them such that they fit in with the pattern of the QR code, somewhat, then I’ve inserted another layer in-between the two and used it to “white out” the QR codes squares that intersect with my image, giving a jagged, “cut out” feel.
  3. Test! Scan the QR code from your screen, and again later from paper, to make sure that it’s intact and functional. If it’s not, adjust your overlay so that it covers less of the QR code. Test in a variety of devices. In theory, it should be possible to calculate how much damage you can cause to a QR code before it stops working (and where it’s safe to cause the damage), but in practice it’s faster to use trial-and-error. After a while, you get a knack for it, and you almost feel as though you can see where you need to put the images so that they just-barely don’t break the codes. Good luck!
Another of my “damaged” QR codes. I’m reasonably pleased with this one.

Give it a go! Make some QR codes that represent your content (web addresses, text, vCards, or whatever) and embed your own images into them to make them stand out with a style of their own.

× × × ×

How To Fix A Flat Car Battery For Beginners

Unfortunatley my plans for a nice relaxed evening over a pint were delayed somewhat by having to help to fix Claire’s car, first. In a fantastic display of sense she’d left the headlights on on Sunday night, and all through Monday, and so by Tuesday the battery was very, very dead.

So she, Bryn, Kit and I stood in a cold and rainy car park, trying to remove Bryn’s car battery to get Claire’s car going, then switch back to her battery while it’s running so we could charge it with a nice long drive. But no such luck: the considerate engineers at Vauxhall decided that to remove the battery you must either (a) own a spanner with a neck width about the size of a human hair or (b) remove the engine first.

Thankfully I was able to persuade a taxi driver at the nearby rank to drive around with some jump leads and get her going. Suddenly this made things a lot easier.

In brighter news, Bryn got offered a year in industry placement with the National Library of Wales, which means that he, too, will be living in Aberystwyth for the summer.