Zip It – Finding File Similarity Using Compression Utilities

This is a repost promoting content originally published elsewhere. See more things Dan's reposted.

This was an enjoyable video. Nothing cutting-edge, but a description of an imaginative use of an everyday algorithm – DEFLATE, which is what powers most of the things you consider “ZIP files” – to do pattern-matching and comparison between two files. The tl;dr is pretty simple:

  • Lossless compression works by looking for repetition, and replacing the longest/most-repeated content with references to a lookup table.
  • Therefore, the reduction-in-size from compressing a file is an indicator of the amount of repetition within it.
  • Therefore, the difference in reduction-in-size of compressing a single file to the reduction-in-size of compressing a pair of files is indicative of their similarity, because the greatest compression gains come from repetition of data that is shared across both files.
  • This can be used, for example, to compare the same document written in two languages as an indication of the similarity of the languages to one another, or to compare the genomes of two organisms as an indication of their genetic similarity (and therefore how closely-related they are).

I love it when somebody finds a clever and novel use for an everyday tool.

0 comments

    Reply here

    Your email address will not be published. Required fields are marked *

    Reply on your own site

    Reply elsewhere

    You can reply to this post on Mastodon (@blog@danq.me).

    Reply by email

    I'd love to hear what you think. Send an email to b26897@danq.me; be sure to let me know if you're happy for your comment to appear on the Web!