Note #7435

Off to #musetech17 (@ukmcg) at the @I_W_M, who also happen to be @3RingsCIC users… #wearingmanyhats

“Jump Man” from Footprints Tours Oxford making people jump around outside the Bodleian Library

If you’re a tourist on one of “Jump Man” of Footprints Tours’ tours, I’m sure that the obligatory “jump for a photograph” moment at the end is a fun novelty. However, the novelty quickly wears off when you work in one of the library offices right next to their usual spot, and the call of “3… 2… 1… JUMP!” is the loudest thing you hear all day, each day, throughout the summer season.

Also available on YouTube.

EEBO-TCP Hackathon

Last month I got the opportunity to attend the EEBO-TCP Hackfest, hosted in the (then still very-much under construction) Weston Library at my workplace. I’ve done a couple of hackathons and similar get-togethers before, but this one was somewhat different in that it was unmistakably geared towards a different kind of geek than the technology-minded folks that I usually see at these things. People like me, with a computer science background, were remarkably in the minority.

Dan Q in Blackwell Hall, at the Weston Library.
Me in the Weston Library (still under construction, as evidenced by the scaffolding in the background).

Instead, this particular hack event attracted a great number of folks from the humanities end of the spectrum. Which is understandable, given its theme: the Early English Books Online Text Creation Partnership (EEBO-TCP) is an effort to digitise and make available in marked-up, machine-readable text formats a huge corpus of English-language books printed between 1475 and 1700. So: a little over three centuries of work including both household names (like Shakespeare, Galileo, Chaucer, Newton, Locke, and Hobbes) and an enormous number of others that you’ll never have heard of.

Dan Q talks to academic Stephen Gregg
After an introduction to the concept and the material, attendees engaged in a speed-networking event to share their thoughts prior to pitching their ideas.

The hackday event was scheduled to coincide with and celebrate the release of the first 25,000 texts into the public domain, and attendees were challenged to come up with ways to use the newly-available data in any way they liked. As is common with any kind of hackathon, many of the attendees had come with their own ideas half-baked already, but as for me: I had no idea what I’d end up doing! I’m not particularly familiar with the books of the 15th through 17th centuries and I’d never looked at the way in which the digitised texts had been encoded. In short: I knew nothing.

Dan Q and Liz McCarthy listen as other attendees pitch their ideas.
The ideas pitch session quickly showed some overlap between different project ideas, and teams were split and reformed a few times as people found the best places for themselves.

Instead, I’d thought: there’ll be people here who need a geek. A major part of a lot of the freelance work I end up doing (and a lesser part of my work at the Bodleian, from time to time) involves manipulating and mining data from disparate sources, and it seemed to me that these kinds of skills would be useful for a variety of different conceivable projects.

Dan Q explains what the spreadsheet he's produced 'means'.
XML may have been our interchange format, but everything fell into Excel in the end for speedy management even by less-technical team members.

I paired up with a chap called Stephen Gregg, a lecturer in 18th century literature from Bath Spa University. His idea was to use this newly-open data to explore the frequency (and the change in frequency over the centuries) of particular structural features in early printed fiction: features like chapters, illustrations, dedications, notes to the reader, encomia, and so on). This proved to be a perfect task for us to pair-up on, because he had the domain knowledge to ask meaningful questions, and I had the the technical knowledge to write software that could extract the answers from the data. We shared our table with another pair, who had technically-similar goals – looking at the change in the use of features like lists and tables (spoiler: lists were going out of fashion, tables were coming in, during the 17th century) in alchemical textbooks – and ultimately I was able to pass on the software tools I’d written to them to adapt for their purposes, too.

Dan Q with two academic-minded humanities folks, talking about their hackathon projects.
A quick meeting on the relative importance of ‘chapters’ as a concept in 16th century literature. Half of the words that the academics are saying go over my head, but I’m formulating XPath queries in my head while I wait.

And here’s where I made a discovery: the folks I was working with (and presumably academics of the humanities in general) have no idea quite how powerful data mining tools could be in giving them new opportunities for research and analysis. Within two hours we were getting real results from our queries and were making amendments and refinements in our questions and trying again. Within a further two hours we’d exhausted our original questions and, while the others were writing-up their findings in an attractive way, I was beginning to look at how the structural differences between fiction and non-fiction might be usable as a training data set for an artificial intelligence that could learn to differentiate between the two, providing yet more value from the dataset. And all the while, my teammates – who’d been used to looking at a single book at a time – were amazed by the possibilities we’d uncovered for training computers to do simple tasks while reading thousands at once.

Laptop showing a map of the area around Old St. Paul's Cathedral.
The area around Old St. Paul’s Cathedral was the place to be if you were a 16th century hipster looking for a new book.

Elsewhere at the hackathon, one group was trying to simulate the view of the shelves of booksellers around the old St. Paul’s Cathedral, another looked at the change in the popularity of colour and fashion-related words over the period (especially challenging towards the beginning of the timeline, where spelling of colours was less-standardised than towards the end), and a third came up with ways to make old playscripts accessible to modern performers.

A graph showing the frequency of colour-related words in English-language books printed over three centuries.
Aside from an increase in the relative frequency of the use of colour words to describe yellow things, there’s not much to say about this graph.

At the end of the session we presented our findings – by which I mean, Stephen explained what they meant – and talked about the technology and its potential future impact – by which I mean, I said what we’d like to allow others to do with it, if they’re so-inclined. And I explained how I’d come to learn over the course of the day what the word encomium meant.

Dan Q presents findings in Excel.
Presenting our findings in amazing technicolour Excel.

My personal favourite contribution from the event was by Sarah Cole, who adapted the text of a story about a witch trial into a piece of interactive fiction, powered by Twine/Twee, and then allowed us as an audience to collectively “play” her game. I love the idea of making old artefacts more-accessible to modern audiences through new media, and this was a fun and innovative way to achieve this. You can even play her game online!

(by the way: for those of you who enjoy my IF recommendations: have a look at Detritus; it’s a delightful little experimental/experiential game)

Output from the interactive fiction version of a story about a witch trial.
Things are about to go very badly for Joan Buts.

But while that was clearly my favourite, the judges were far more impressed by the work of my teammate and I, as well as the team who’d adapted my software and used it to investigate different features of the corpus, and decided to divide the cash price between the four of us. Which was especially awesome, because I hadn’t even realised that there was a prize to be had, and I made the most of it at the Drinking About Museums event I attended later in the day.

Members of the other team, who adapted my software, were particularly excited to receive their award.
Cold hard cash! This’ll be useful at the bar, later!

If there’s a moral to take from all of this, it’s that you shouldn’t let your background limit your involvement in “hackathon”-like events. This event was geared towards literature, history, linguistics, and the study of the book… but clearly there was value in me – a computer geek, first and foremost – being there. Similarly, a hack event I attended last year, while clearly tech-focussed, wouldn’t have been as good as it was were it not for the diversity of the attendees, who included a good number of artists and entrepreneurs as well as the obligatory hackers.

Stephen and Dan give a congratulatory nod to one another.
“Nice work, Stephen.” “Nice work, Dan.”

But for me, I think the greatest lesson is that humanities researchers can benefit from thinking a little bit like computer scientists, once in a while. The code I wrote (which uses Ruby and Nokogiri) is freely available for use and adaptation, and while I’ve no idea whether or not it’ll ever be useful to anybody again, what it represents is the research benefits of inter-disciplinary collaboration. It pleases me to see things like the “Library Carpentry” (software for research, with a library slant) seeming to take off.

And yeah, I love a good hackathon.

Update 2015-04-22 11:59: with thanks to Sarah for pointing me in the right direction, you can play the witch trial game in your browser.

× × × × × × × × × × ×

Squiz CMS Easter Eggs (or: why do I keep seeing Greg’s name in my CAPTCHA?)

Anybody who has, like me, come into contact with the Squiz Matrix CMS for any length of time will have come across the reasonably easy-to-read but remarkably long CAPTCHA that it shows. These are especially-noticeable in its administrative interface, where it uses them as an exaggerated and somewhat painful “are you sure?” – restarting the CMS’s internal crontab manager, for example, requires that the administrator types a massive 25-letter CAPTCHA.

Four long CAPTCHA from the Squiz Matrix CMS.
Four long CAPTCHA from the Squiz Matrix CMS.

But there’s another interesting phenomenon that one begins to notice after seeing enough of the back-end CAPTCHA that appear. Strange patterns of letters that appear in sequence more-often than would be expected by chance. If you’re a fan of wordsearches, take a look at the composite screenshot above: can you find a person’s name in each of the four lines?

Four long CAPTCHA from the Squiz Matrix CMS, with the names Greg, Dom, Blair and Marc highlighted.
Four long CAPTCHA from the Squiz Matrix CMS, with the names Greg, Dom, Blair and Marc highlighted.

There are four names – GregDomBlair and Marc – which routinely appear in these CAPTCHA. Blair, being the longest name, was the first that I noticed, and at first I thought that it might represent a fault in the pseudorandom number generation being used that was resulting in a higher-than-normal frequency of this combination of letters. Another idea I toyed with was that the CAPTCHA text might be being entirely generated from a set of pronounceable syllables (which is a reasonable way to generate one-time passwords that resist entry errors resulting from reading difficulties: in fact, we do this at Three Rings), in which these four names also appear, but by now I’d have thought that I’d have noticed this in other patterns, and I hadn’t.

Instead, then, I had to conclude that these names were some variety of Easter Egg.

In software (and other media), "Easter Eggs" are undocumented hidden features, often in the form of inside jokes.
Smiley decorated eggs. Picture courtesy Kate Ter Haar.

I was curious about where they were coming from, so I searched the source code, but while I found plenty of references to Greg Sherwood, Marc McIntyre, and Blair Robertson. I couldn’t find Dom, but I’ve since come to discover that he must be Dominic Wong – these four were, according to Greg’s blog – developers with Squiz in the early 2000s, and seemingly saw themselves as a dynamic foursome responsible for the majority of the CMS’s code (which, if the comment headers are to be believed, remains true).

Greg, Marc, Blair and Dom, as depicted in Greg's 2007 blog post.
Greg, Marc, Blair and Dom, as depicted in Greg’s 2007 blog post.

That still didn’t answer for me why searching for their names in the source didn’t find the responsible code. I started digging through the CMS’s source code, where I eventually found fudge/general/general.inc (a lot of Squiz CMS code is buried in a folder called “fudge”, and web addresses used internally sometimes contain this word, too: I’d like to believe that it’s being used as a noun and that the developers were just fans of the buttery sweet, but I have a horrible feeling that it was used in its popular verb form). In that file, I found this function definition:

/**
 * Generates a string to be used for a security key
 *
 * @param int            $key_len                the length of the random string to display in the image
 * @param boolean        $include_uppercase      include uppercase characters in the generated password
 * @param boolean        $include_numbers        include numbers in the generated password
 *
 * @return string
 * @access public
 */
function generate_security_key($key_len, $include_uppercase = FALSE, $include_numbers = FALSE) {
  $k = random_password($key_len, $include_uppercase, $include_numbers);
  if ($key_len > 10) {
    $gl = Array('YmxhaXI=', 'Z3JlZw==', 'bWFyYw==', 'ZG9t');
    $g = base64_decode($gl[rand(0, (count($gl) - 1)) ]);
    $pos = rand(1, ($key_len - strlen($g)));
    $k = substr($k, 0, $pos) . $g . substr($k, ($pos + strlen($g)));
  }
  return $k;
} //end generate_security_key()

For the benefit of those of you who don’t speak PHP, especially PHP that’s been made deliberately hard to decipher, here’s what’s happening when “generate_security_key” is being called:

  • A random password is being generated.
  • If that password is longer than 10 characters, a random part of it is being replaced with either “blair”, “greg”, “marc”, or “dom”. The reason that you can’t see these words in the code is that they’re trivially-encoded using a scheme called Base64 – YmxhaXI=Z3JlZw==, bWFyYw==, and ZG9t are Base64 representations of the four names.

This seems like a strange choice of Easter Egg: immortalising the names of your developers in CAPTCHA. It seems like a strange choice especially because this somewhat weakens the (already-weak) CAPTCHA, because an attacking robot can quickly be configured to know that a 11+-letter codeword will always consist of letters and exactly one instance of one of these four names: in fact, knowing that a CAPTCHA will always contain one of these four and that I can refresh until I get one that I like, I can quickly turn an 11-letter CAPTCHA into a 6-letter one by simply refreshing until I get one with the longest name – Blair – in it!

A lot has been written about how Easter Eggs undermine software security (in exchange for a small boost to developer morale) – that’s a major part of why Microsoft has banned them from its operating systems (and, for the most part, Apple has too). Given that these particular CAPTCHA in Squiz CMS are often nothing more than awkward-looking “are you sure?” dialogs, I’m not concerned about the direct security implications, but it does make me worry a little about the developer culture that produced them.

I know that this Easter Egg might be harmless, but there’s no way for me to know (short of auditing the entire system) what other Easter Eggs might be hiding under the surface and what they do, especially if the developers have, as in this case, worked to cover their tracks! It’s certainly the kind of thing I’d worry about if I were, I don’t know, a major government who use Squiz software, especially their cloud-hosted variants which are harder to effectively audit. Just a thought.

× × ×

Club Lounge at Gatwick airport

This self-post was originally posted to /r/MegaMasonsLounge. See more things from Dan's Reddit account.

On the way out to the French Alps for a week of skiing, and we had enough air miles to upgrade to business class on the way out, so I’m sat in the lounge enjoying complimentary gin & tonic and croissants. 10 in the morning, and I’m already buzzed: after a long and hectic few months, I’m really glad to be off on holiday!

Aaaand…. right before I left I put in an application for my boss’s job, which she vacated a few months ago. Should hear by the time I get back whether I’m being invited to interview, so that’s exciting too!

Anyway: just wanted to share my excitement with my favourite MegaMasons. If I’m not online much this week, you’ll know why! Have a great week, folks: love you all!

Rave Reviews for Your Password Sucks

Last month, I volunteered myself to run a breakout session at the 2012 UAS Conference, an annual gathering of up to a thousand Oxford University staff. I’d run a 2-minute micropresentation at the July 2011 OxLibTeachMeet called “Your Password Sucks!”, and I thought I’d probably be able to expand that into a larger 25-minute breakout session.

Your password: How bad guys will steal your identity
My expanded presentation was called “Your password: How bad guys will steal your identity”, because I wasn’t sure that I’d get away with the title “Your Password Sucks” at a larger, more-formal event.

The essence of my presentation boiled down to demonstrating four points. The first was you are a target – dispelling the myth that the everyday person can consider themselves safe from the actions of malicious hackers. I described the growth of targeted phishing attacks, and relayed the sad story of Mat Honan’s victimisation by hackers.

The second point was that your password is weak: I described the characteristics of good passwords (e.g. sufficiently long, complex, random, and unique) and pointed out that even among folks who’d gotten a handle on most of these factors, uniqueness was still the one that tripped people over. A quarter of people use only a single password for most or all of their accounts, and over 50% use 5 or fewer passwords across dozens of accounts.

You are a target. Your password is weak. Attacks are on the rise. You can protect yourself.
The four points I wanted to make through my presentation. Starting by scaring everybody ensured that I had their attention right through ’til I told them what they could do about it, at the end.

Next up: attacks are on the rise. By a combination of statistics, anecdotes, audience participation and a theoretical demonstration of how a hacker might exploit shared-password vulnerabilities to gradually take over somebody’s identity (and then use it as a platform to attack others), I aimed to show that this is not just a hypothetical scenario. These attacks really happen, and people lose their money, reputation, or job over them.

Finally, the happy ending to the story: you can protect yourself. Having focussed on just one aspect of password security (uniqueness), and filling a 25-minute slot with it, I wanted to give people some real practical suggestions for the issue of password uniqueness. These came in the form of free suggestions that they could implement today. I suggested “cloud” options (like LastPass or 1Password), hashing options (like SuperGenPass), and “offline” technical options (like KeePass or a spreadsheet bundles into a TrueCrypt volume).

I even suggested a non-technical option involving a “master” password that is accompanied by one of several unique prefixes. The prefixes live on a Post-It Note in your wallet. Want a backup? Take a picture of them with your mobile: they’re worthless without the master password, which lives in your head. It’s not as good as a hash-based solution, because a crafty hacker who breaks into several systems might be able to determine your master password, but it’s “good enough” for most people and a huge improvement on using just 5 passwords everywhere! (another great “offline” mechanism is Steve Gibson’s Off The Grid system)

"Delivery" ratings for the UAS Conference "breakout" sessions
My presentation – marked on the above chart – left people “Very Satisfied” significantly more than any other of the 50 breakout sessions.

And it got fantastic reviews! That pleased me a lot. The room was packed, and eventually more chairs had to be brought in for the 70+ folks who decided that my session was “the place to be”. The resulting feedback forms made me happy, too: on both Delivery and Content, I got more “Very Satisfied” responses than any other of the 50 breakout sessions, as well as specific comments. My favourite was:

Best session I have attended in all UAS conferences. Dan Q gave a 5 star performance.

So yeah; hopefully they’ll have me back next year.

×

Quiet On Set

Before I started working for the Bodleian, I’d never worked somewhere where there was a significant risk of a film crew coming between me and my office. But since then, it seems to happen with a startling regularity.

This morning, I was almost late for work as I fought my way past a film crew shooting The Quiet Ones, some variety of supernatural thriller B-movie.

This guy. That bridge. Listen.
This guy. That bridge. Listen.

So, when you end up watching it: wait until you get to the scene where this guy walks under the Hertford Bridge, and listen carefully for the sound of somebody walking across gravel just off camera. That’s me, putting my bike away having finally squeezed my way past all of the cameras and equipment on the way to my office.

×