Zip It – Finding File Similarity Using Compression Utilities

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

This was an enjoyable video. Nothing cutting-edge, but a description of an imaginative use of an everyday algorithm – DEFLATE, which is what powers most of the things you consider “ZIP files” – to do pattern-matching and comparison between two files. The tl;dr is pretty simple:

  • Lossless compression works by looking for repetition, and replacing the longest/most-repeated content with references to a lookup table.
  • Therefore, the reduction-in-size from compressing a file is an indicator of the amount of repetition within it.
  • Therefore, the difference in reduction-in-size of compressing a single file to the reduction-in-size of compressing a pair of files is indicative of their similarity, because the greatest compression gains come from repetition of data that is shared across both files.
  • This can be used, for example, to compare the same document written in two languages as an indication of the similarity of the languages to one another, or to compare the genomes of two organisms as an indication of their genetic similarity (and therefore how closely-related they are).

I love it when somebody finds a clever and novel use for an everyday tool.

Zopfli Optimization: Literally Free Bandwidth

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

In 2007 I wrote about using PNGout to produce amazingly small PNG images. I still refer to this topic frequently, as seven years later, the average PNG I encounter on the Internet is very unlikely to be optimized.

For example, consider this recent Perry Bible Fellowship cartoon