Monday, March 24, 2008

Quicky Binary File Visual Analysis

I've been reading Greg Conti's book, Security Data Visualization. If I'm honest, I was looking for new ideas for framing and presenting data to folks outside of security. But that's not really what this book is about. It's a good introduction to visualization as an analysis tool, but there's very little polish and presentation to the graphs in Greg's book.

It's what I wasn't looking for in this book, however, that wound up catching my eye. In Chapter 2, "The Beauty of Binary File Visualization," there's a comparison (on p31) that struck me. It's a set of images that are graphical representations of Word document files protected via various password methods. It was clear by looking at the graphs which methods were thorough and effective and which ones weren't. And it struck me that this is an accessible means of evaluating crypto for someone like me who sucks at math. And, hey, I just happen to have some crypto to evaluate. More on that some other time, but here's what I did. It's exceedingly simple.

I wanted to take a binary file of no known format and calculate how many times a given byte value occurs within that file, in no particular order. I wrote a Perl script to do this:

use strict;
my $buffer = "";
my $file = $ARGV[0] or die("Usage: [filename] \n");
open(FILE, $file) or die("Could not open file : $file\n");
my $filesz = -s $file;
read(FILE, $buffer, $filesz, 0);
my @bytes = ();
foreach (split(//, $buffer)) {
for my $i (0 .. 255) {
  print "$i $bytes[$i]\n";

This script will output two columns worth of data, the first being the byte value (0-255), and the second being the number of times that byte value occurred in the file. The idea is to redirect this output to a text file and then use it to generate some graphs in gnuplot.

For my example, I analyzed a copy of nc.exe and also a symmetric-key-encrypted copy of that same file. I generated two files using the above Perl script, one called "bin.dat" and another called "crypt.dat". Then I fired up Cygwin/X and gnuplot and created some graphs using the following settings:

# the X axis range is 0-255 because those are all of
# the possible byte values

set xrange [255:0] noreverse nowriteback

set xlabel "byte value"

# the max Y axis value 1784 comes from the bin.dat file
# `cut -d\ -f2 bin.dat |sort -n |uniq |tail -1`

set yrange [1784:0] noreverse nowriteback

set ylabel "count"

I then ran:

plot "bin.dat" using 1:2 with impulses

..which generated this:

I then repeated the process with the other file:

plot "crypt.dat" using 1:2 with impulses

...which generated this:

As you can see, there's a clear difference between the encrypted file and the unencrypted file when it comes to byte count and uniqueness. Using the xrange/yrange directives in gnuplot helps emphasize this visually as well. The expectation would be that weak, or "snake-oil" crypto schemes would look more like the unencrypted binary and less like the PGP-encrypted file.


Erik Heidt said...

Paul -

First, thanks for sharing this code. I happen to need something that does just this!

A few comments on using frequency analysis as a "Snake Oil" vs good crypto test:

Please take a look at the Linux Penguin photos in this wikipedia entry: Block Cipher Modes .

Note how you can still see many of the features of the image when it is encrypted with EBC mode.

But, in another mode, like CBC, any block cipher will pass the frequency analysis test, even 8-bit XOR. (I am thinking about coding up a little example of that.)

Also, be aware that how the keys are managed and protected is often much more critical than the cipher, mode, or key-length choices.

Thanks for the great post !

Erik Heidt
Art of Information Security

PaulM said...

Thanks for the comment, Erik.

You make a very good point that this type of test is not in any way thorough, and that things like XOR-ing a file will produce a similar visual result.