Basilisk collection

This article is a repost promoting content originally published elsewhere. See more things Dan's reposted.

Basilisk collection

The basilisk collection (also known as the basilisk file or basilisk.txt) is a collection of over 125 million partial hash inversions of the SHA-256 cryptographic hash function. Assuming state-of-the art methods were used to compute the inversions, the entries in the collection collectively represent a proof-of-work far exceeding the computational capacity of the human race.[1][2] The collection was released in parts through BitTorrent beginning in June 2018, although it was not widely reported or discussed until early 2019.[3] On August 4th, 2019 the complete collection of 125,552,089 known hash inversions was compiled and published by CryTor, the cybersecurity lab of the University of Toronto.[4]

The existence of the basilisk collection has had wide reaching consequences in the field of cryptography, and has been blamed for catalyzing the January 2019 Bitcoin crash.[2][5][6]

Electronic Frontier Foundation cryptographer Brian Landlaw has said that “whoever made the basilisk is 30 years ahead of the NSA, and the NSA are 30 years ahead of us, so who is there left to trust?”[35]

This is fucking amazing, on a par with e.g. First on the Moon.

Presented in the style of an alternate-reality Wikipedia article, this piece of what the author calls “unfiction” describes the narratively believable-but-spooky (if theoretically unlikely from a technical standpoint) 2018 disclosure of evidence for a new presumed mathematical weakness in the SHA-2 hash function set. (And if that doesn’t sound like a good premise for a story to you, I don’t know what’s wrong with you! 😂)

Cryptographic weaknesses that make feasible attacks on hashing algorithms are a demonstrably real thing. But even with the benefit of the known vulnerabilities in SHA-2 (meet-in-the-middle attacks that involve up-to-halving the search space by solving from “both ends”, plus deterministic weaknesses that make it easier to find two inputs that produce the same hash so long as you choose the inputs carefully) the “article” correctly states that to produce a long list of hash inversions of the kinds described, that follow a predictable sequence, might be expected to require more computer processing power than humans have ever applied to any problem, ever.

As a piece of alternate history science fiction, this piece not only provides a technically-accurate explanation of its premises… it also does a good job of speculating what the impact on the world would have been of such an event. But my single favourite part of the piece is that it includes what superficially look like genuine examples of what a hypothetical basilisk.txt would contain. To do this, the author wrote a brute force hash finder and ran it for over a year. That’s some serious dedication. For those that were fooled by this seemingly-convincing evidence of the realism of the piece, here’s the actual results of the hash alongside the claimed ones (let this be a reminder to you that it’s not sufficient to skim-read your hash comparisons, people!):

basilisk:0000000000:ds26ovbJzDwkVWia1tINLJZ2WXEHBvItMZRxHmYhlQd0spuvPXb6cYFJorDKkqlA

claimed: 00000000000161b9f84a187cc21b1752bf678bdd4d643c17b3b786684d8e9f17
 actual: 0000000000000000000000161b9f84a187cc21b172bf68b3cb3b78684d8e9f17

basilisk:0000000001:dMHUhnoEkmLv8TSE1lnJ7nVIYM8FLYBRtzTiJCM8ziijpTj95MPptu6psZZyLBVA

claimed: 0000000000000000000000cee5fe5df2d3034fff435bb40e8651a18d69e81460
 actual: 0000000000cee5fe5df2d3034fff435bb4232f21c2efce0e8651a18d69e81460

basilisk:0000000002:aSCZwTSmH9ZtqB5gQ27mcGuKIXrghtYIoMp6aKCLvxhlf1FC5D1sZSi2SjwU9EqK

claimed: 000000000000000000000012aabd8d935757db173d5b3e7ae0f25ea4eb775402
 actual: 000000000012aabd8d935757db173d5b3ec6d38330926f7ae0f25ea4eb775402

basilisk:0000000003:oeocInD9uFwIO2x5u9myS4MKQbFW8Vl1IyqmUXHV3jVen6XCoVtuMbuB1bSDyOvE

claimed: 000000000000000000000039d50bb560770d051a3f5a2fe340c99f81e18129d1
 actual: 000000000039d50bb560770d051a3f5a2ffa2281ac3287e340c99f81e18129d1

basilisk:0000000004:m0EyKprlUmDaW9xvPgYMz2pziEUJEzuy6vsSTlMZO7lVVOYlJgJTcEvh5QVJUVnh

claimed: 00000000000000000000002ca8fc4b6396dd5b5bcf5fa80ea49967da55a8668b
 actual: 00000000002ca8fc4b6396dd5b5bcf5fa82a867d17ebc40ea49967da55a8668b

Anyway: the whole thing is amazing and you should go read it.

Ted Chiang Explains the Disaster Novel We All Suddenly Live In

This article is a repost promoting content originally published elsewhere. See more things Dan's reposted.

While there has been plenty of fiction written about pandemics, I think the biggest difference between those scenarios and our reality is how poorly our government has handled it. If your goal is to dramatize the threat posed by an unknown virus, there’s no advantage in depicting the officials responding as incompetent, because that minimizes the threat; it leads the reader to conclude that the virus wouldn’t be dangerous if competent people were on the job. A pandemic story like that would be similar to what’s known as an “idiot plot,” a plot that would be resolved very quickly if your protagonist weren’t an idiot. What we’re living through is only partly a disaster novel; it’s also—and perhaps mostly—a grotesque political satire.

What will “normal” look like after the coronavirus crisis has passed? Will it be the same normal as we’re used to? Or could we actually learn some lessons from this and progress towards something better?

I love Ted Chiang’s writing; enough to reshare this interview even though I’m only lukewarm about it!

21 Books You Don’t Have to Read (and 21 you should read instead)

This article is a repost promoting content originally published elsewhere. See more things Dan's reposted.

The 21 Most Overrated Books Ever (and 21 Books to Read Instead) (GQ)

GQ asked its favorite new authors to dunk on the classics.

We’ve been told all our lives that we can only call ourselves well-read once we’ve read the Great Books. We tried. We got halfway through Infinite Jest and halfway through the SparkNotes on Finnegans Wake. But a few pages into Bleak House, we realized that not all the Great Books have aged well. Some are racist and some are sexist, but most are just really, really boring. So we—and a group of un-boring writers—give you permission to strike these books from the canon. Here’s what you should read instead.

Personally, I quite enjoyed at least two of the books on the “books you don’t have to read” list… but this list has inspired me to look into some of the 21 “you should read instead”.

EEBO-TCP Hackathon

Last month I got the opportunity to attend the EEBO-TCP Hackfest, hosted in the (then still very-much under construction) Weston Library at my workplace. I’ve done a couple of hackathons and similar get-togethers before, but this one was somewhat different in that it was unmistakably geared towards a different kind of geek than the technology-minded folks that I usually see at these things. People like me, with a computer science background, were remarkably in the minority.

Dan Q in Blackwell Hall, at the Weston Library.
Me in the Weston Library (still under construction, as evidenced by the scaffolding in the background).

Instead, this particular hack event attracted a great number of folks from the humanities end of the spectrum. Which is understandable, given its theme: the Early English Books Online Text Creation Partnership (EEBO-TCP) is an effort to digitise and make available in marked-up, machine-readable text formats a huge corpus of English-language books printed between 1475 and 1700. So: a little over three centuries of work including both household names (like Shakespeare, Galileo, Chaucer, Newton, Locke, and Hobbes) and an enormous number of others that you’ll never have heard of.

Dan Q talks to academic Stephen Gregg
After an introduction to the concept and the material, attendees engaged in a speed-networking event to share their thoughts prior to pitching their ideas.

The hackday event was scheduled to coincide with and celebrate the release of the first 25,000 texts into the public domain, and attendees were challenged to come up with ways to use the newly-available data in any way they liked. As is common with any kind of hackathon, many of the attendees had come with their own ideas half-baked already, but as for me: I had no idea what I’d end up doing! I’m not particularly familiar with the books of the 15th through 17th centuries and I’d never looked at the way in which the digitised texts had been encoded. In short: I knew nothing.

Dan Q and Liz McCarthy listen as other attendees pitch their ideas.
The ideas pitch session quickly showed some overlap between different project ideas, and teams were split and reformed a few times as people found the best places for themselves.

Instead, I’d thought: there’ll be people here who need a geek. A major part of a lot of the freelance work I end up doing (and a lesser part of my work at the Bodleian, from time to time) involves manipulating and mining data from disparate sources, and it seemed to me that these kinds of skills would be useful for a variety of different conceivable projects.

Dan Q explains what the spreadsheet he's produced 'means'.
XML may have been our interchange format, but everything fell into Excel in the end for speedy management even by less-technical team members.

I paired up with a chap called Stephen Gregg, a lecturer in 18th century literature from Bath Spa University. His idea was to use this newly-open data to explore the frequency (and the change in frequency over the centuries) of particular structural features in early printed fiction: features like chapters, illustrations, dedications, notes to the reader, encomia, and so on). This proved to be a perfect task for us to pair-up on, because he had the domain knowledge to ask meaningful questions, and I had the the technical knowledge to write software that could extract the answers from the data. We shared our table with another pair, who had technically-similar goals – looking at the change in the use of features like lists and tables (spoiler: lists were going out of fashion, tables were coming in, during the 17th century) in alchemical textbooks – and ultimately I was able to pass on the software tools I’d written to them to adapt for their purposes, too.

Dan Q with two academic-minded humanities folks, talking about their hackathon projects.
A quick meeting on the relative importance of ‘chapters’ as a concept in 16th century literature. Half of the words that the academics are saying go over my head, but I’m formulating XPath queries in my head while I wait.

And here’s where I made a discovery: the folks I was working with (and presumably academics of the humanities in general) have no idea quite how powerful data mining tools could be in giving them new opportunities for research and analysis. Within two hours we were getting real results from our queries and were making amendments and refinements in our questions and trying again. Within a further two hours we’d exhausted our original questions and, while the others were writing-up their findings in an attractive way, I was beginning to look at how the structural differences between fiction and non-fiction might be usable as a training data set for an artificial intelligence that could learn to differentiate between the two, providing yet more value from the dataset. And all the while, my teammates – who’d been used to looking at a single book at a time – were amazed by the possibilities we’d uncovered for training computers to do simple tasks while reading thousands at once.

Laptop showing a map of the area around Old St. Paul's Cathedral.
The area around Old St. Paul’s Cathedral was the place to be if you were a 16th century hipster looking for a new book.

Elsewhere at the hackathon, one group was trying to simulate the view of the shelves of booksellers around the old St. Paul’s Cathedral, another looked at the change in the popularity of colour and fashion-related words over the period (especially challenging towards the beginning of the timeline, where spelling of colours was less-standardised than towards the end), and a third came up with ways to make old playscripts accessible to modern performers.

A graph showing the frequency of colour-related words in English-language books printed over three centuries.
Aside from an increase in the relative frequency of the use of colour words to describe yellow things, there’s not much to say about this graph.

At the end of the session we presented our findings – by which I mean, Stephen explained what they meant – and talked about the technology and its potential future impact – by which I mean, I said what we’d like to allow others to do with it, if they’re so-inclined. And I explained how I’d come to learn over the course of the day what the word encomium meant.

Dan Q presents findings in Excel.
Presenting our findings in amazing technicolour Excel.

My personal favourite contribution from the event was by Sarah Cole, who adapted the text of a story about a witch trial into a piece of interactive fiction, powered by Twine/Twee, and then allowed us as an audience to collectively “play” her game. I love the idea of making old artefacts more-accessible to modern audiences through new media, and this was a fun and innovative way to achieve this. You can even play her game online!

(by the way: for those of you who enjoy my IF recommendations: have a look at Detritus; it’s a delightful little experimental/experiential game)

Output from the interactive fiction version of a story about a witch trial.
Things are about to go very badly for Joan Buts.

But while that was clearly my favourite, the judges were far more impressed by the work of my teammate and I, as well as the team who’d adapted my software and used it to investigate different features of the corpus, and decided to divide the cash price between the four of us. Which was especially awesome, because I hadn’t even realised that there was a prize to be had, and I made the most of it at the Drinking About Museums event I attended later in the day.

Members of the other team, who adapted my software, were particularly excited to receive their award.
Cold hard cash! This’ll be useful at the bar, later!

If there’s a moral to take from all of this, it’s that you shouldn’t let your background limit your involvement in “hackathon”-like events. This event was geared towards literature, history, linguistics, and the study of the book… but clearly there was value in me – a computer geek, first and foremost – being there. Similarly, a hack event I attended last year, while clearly tech-focussed, wouldn’t have been as good as it was were it not for the diversity of the attendees, who included a good number of artists and entrepreneurs as well as the obligatory hackers.

Stephen and Dan give a congratulatory nod to one another.
“Nice work, Stephen.” “Nice work, Dan.”

But for me, I think the greatest lesson is that humanities researchers can benefit from thinking a little bit like computer scientists, once in a while. The code I wrote (which uses Ruby and Nokogiri) is freely available for use and adaptation, and while I’ve no idea whether or not it’ll ever be useful to anybody again, what it represents is the research benefits of inter-disciplinary collaboration. It pleases me to see things like the “Library Carpentry” (software for research, with a library slant) seeming to take off.

And yeah, I love a good hackathon.

Update 2015-04-22 11:59: with thanks to Sarah for pointing me in the right direction, you can play the witch trial game in your browser.