UCLA ACM's First Programming Competition: File Anti-Compression

Problem

Write a file encoder whose output, when sent through gzip, is compressed by as little as possible.
(The full UCLA ACM description of this problem, problem #2, is no longer on the web.)

Results

The results:

Date: Tue, 16 Nov 2004 01:02:02 -0800
From: Adam Harmetz <aharmetz@gmail.com>
Subject: MPC Results!

Thank you all for participating in ACM's Monthly Programming Contest. 
I have graded the results for the two problems this month.

Problem 1 was open only to first and second years.  There were two
winners who correctly identified the longest palindrome in the text of
Shakepeare's Hamlet.  The winners were Shant Hovsepian and the two
person team of Brian Chang and
Grace Shih.  All three people have won a prize package.  In addition,
Brian and Grace's solution was incredibly efficient and effectively
used object oriented programming to create easy to read code.  They
deserve special accolades and won the style points for problem 1.

There was great diversity within the answer set for project two. 
Perl, C++, Java, and shell scripting were all employed to try to beat
gzip.  Two themes emerged as "most successful" solutions.   These
solutions actually expanded the file length to be greater than the
original file when "compressed."

The first solution used a random number generator (or an array of
random numbers hardcoded into the source) to XOR the bits of the
original file.  Since XOR is symmetric and random number generators
will produce the same values if given the same seed, this solution had
the interesting property that the encoder and decoder were the same
program.

The second solution was clever in its own way.  It employed gzip
itself to compress the original file.  To ensure that the encoded file
was the same length as the original, two different strategies were
used.  One used random data to pad the end of the file to the correct
length.  Another solution was to repeat copies of the gzipped data
until the correct length was reached.  Participants either stored the
length somewhere in the file, added a special character at the end of
the gzipped data, or took advantage of the fact that gzip will ignore
trailing garbage.

Either solution yielded the same expansion of the original file when gzipped.

Chris Frost, Steve VanDeBogart, Wing-kai Chan, Adam Vartanian, Nathan
Plumley, and Sean Soria were all able to implement one of the above
optimal strategies.  Other participants either employed strategies
that resulted in compression (non-random choice of digits to XOR), or
didn't fully take advantage of the various random number generators of
Java and Python (too small a subset of random numbers yielded
suboptimal results).

Chris Frost, in addition to providing a wonderful narrative on gzip
and an analysis of what an optimal solution looks like, also employed
a clever trick to beat the rest of the participants.  Since gzip
stores the name of the file you are compressing inside the file, the
longer the encoded file name the larger the compressed file.  Chris
chose the maximum filename length for his outputted encoded file. 
This showed a healthy knowledge of the inner workings of gzip and
earns him the most significant prize package (wireless keyboard and
mouse, Office XP Professional 2003, Visual Studio .NET, and Halo) as
well as special accolades on this contest.

Out of the remaining five high achievers on the contest, Nathan and
Sean were randomly selected to receive a wireless keyboard from
Microsoft.  For the rest, I have copies of Halo for you.

Thanks to everyone who participated.  I'm really happy to see the
enthusiastic response to this ACM program and a healthy desire to
experiment with programming.  Google and Real Networks will both be
involved in designing future challenges and the Engineering Alumni
Association came through with significant funding for future contests.

Expect another contest after Thanksgiving to carry us through the
holiday season.

Thanks, of course, to Dr. Eggert for helping design problem 2 and to
Microsoft for their generous support of the program.

Please send me your resume if you would like to be included in the
special packet being sent to Microsoft (it includes only those who
participated in the contest).

To pick up your prize, you are welcome to drop by my apartment if you
are itching to get it soon (I live on Kelton).  I will set up an
"office hour" next week where you can drop by and pick them up from me
on campus.

As a final note, I'd like to let everyone know about the next major
upcoming competition: Imagine Cup.  Sponsored by Microsoft, there are
a number of cool challenges and puzzles.  Anyone who participated in
our contest should find competing in Imagine Cup right up their alley.
 Email me if you want more info, or visit imagecup.com.

You can email me if you would like more info on how your submission
was graded or other questions.

Take care,
Adam

Solution

And if you're interested, my solution's description/analysis:

ACM Monthly Programming Contest
Novermber 3rd Second Problem
Chris Frost <frost@cs.ucla.edu>


** Building and Requirements
GNU's make will build frontend.

This program requires: some bits and pieces of posix (eg system()),
gzip, gunzip, cp, cat, echo, and touch.


** Usage
Running frontend will output a help summary.
To encode a file:
frontend -e <input_filename>

To decode a file:
frontend -d <encoded_filename> <output_filename>

Note: the encoded filename should not be changed. frontend will attempt
to detect any important modifications, but it does not detect all changes.


** Algorithm
The goal of this program is to encode a given file so that when this
encoded file is given to gzip, it is larger than any other person's entry :).
Two primary questions come to mind:
- How can I do this?
- Is there a maximum size?

My two obvious answers to how are to produce what looks like random
information (you can't compress randomness well) or to give gzip gzip's
output. While producing random noise would make decoding difficult,
cryptography has long studied how to make make information have random
properties.

Were the compressor in this contest unknown, I think I would take the route
of encryption, it would probably hold up better in general. However, knowing
gzip is the compressor, we can get larger file using gzip, I believe.

As to whether there is a maximum, at which point we can stop trying:
yes.  zlib, used by gzip, detects when "compressed" information would
be larger than its original size. In this case, the block (32KB) is
stored without any compression along with five bytes of header
information. There is an additional fixed overhead for the file header
plus the variable-sized filename in the file header. Thus, the maximum
we can make gzip expand a file is by 5B / 32KB + ~50B for the fixed
header + the encoded filename. This algorithm gets as close to the 5B
/ 32KB limit as my measuring has allowed me to inspect. (I am sure you
could inch more space out, however.) This algorithm also takes
advantage of the filename encoding as far as a filesystem will let it,
enlargining the gzipped file by another about 2^8 B. (This is os dependent.)

My primary reference for gzip has been http://www.gzip.org/zlib/zlib_tech.html,
as well as their rfcs.

*** Achieving 5B / 32KB Inflation
Our goal is 5B / 32KB, this is how we try to get there.
After a bit of experimentation, I found that running gzip on gzip output
reaches a local maxima fairly quickly, typically in less than four rounds.
What we want is for the judge's gzip run to inflate the file by as much as
possible. Thus, we should compress the file one less time than this.

However, things aren't quite this simple because we have to have the same
file length as the input file. Gzip has one sweet property we take advantage
of: it detects non-gzip data at the end of the file and ignores it.

Our approach is thus this: gzip the original file a few times to get us
one step away from the local inflation maxima, write a little non-gzip
data, then write hard-to-compress data until the file is long enough.

It would be more involved in the decoding stage if we determined the
number of gzip runs used on a perfile basis, so we gzip the input file
three times. (What a little testing showed to work well for a couple files.)
We then write a few byte string as the non-gzip "magic" data.
We then concatentate a copy of the compressed input file onto itself
until the compressed input + magic + this footer has a size at least that
of the input. The footer is then compressed twice (this turns out to do
a fair bit), and then all three parts are merged into one file. This single
file is then truncated to the length of the original file.

Finally, we output the encoded file. However, gzip stores the
compressed file's name in the compressed file. We can take advantage
of this and give the encoded file a long file name, which will thus
increase the size of the judge's gzip run by this much more. How long
a filename can be is dependent on the os, with linux and ext2
filenames are limited to 255 characters. (There are C constants giving
maximum filename lengths, but they are often much longer than what the
os's filesystem actually supports. We thus determine this experimentally
at runtime.)

A couple snags:
- Exactly as we are doing to gzip, we could be given a file that we could
inflate. We detect this in all cases and fall back to "encoding" the input
by copying it. We transfer this fact to the decoder through the filename.
- We assume that the data we will be dealing with is larger than around
32KB. If this assume fails, our copies will be within gzip's blocksize
and the judge's gzip run will compress our encoding. I'm assuming/hoping
this won't be the case, as I think people will be more likely to take the
mindset of thinking large files are difficult, and the problem statement
states "a rather long text file" will be used.

A side note: One interesting thing I found was that we actually want
to *avoid* as much header information stored in the encode phase as possible!
Headers are not compressed, so every byte in a header is a byte not helping
push the number of 32KB blocks higher. Thus the encode phase tells gzip
not to store original filenames and timestamps. On a 165KB file,
this makes the judge's gzipped file perhaps 2B larger. :)

encode() contains more detailed documentation on the encoding.

*** Decoding
decode() explains this pretty straightfowardly, we just reverse the
operations of encode(). This turns out to be simpler since we gzip throws
the footer and magic away for us and we never deal with them.

*** Room for Improvement
- Be able to analytically calculate how many times to gzip data and allow
  this number to vary per file.

- With the above, we could run gzip on what is now the encoded file.
  From experiments, running gzip on a concatentated file will push the judge's
  filesize a little more.

- Support file encoding even when the input can be compressed and cated
  a few times but not for the full process.

- Perhaps use zlib rather than gzip, which I believe stores less header data.


** Problem

Summary: Write an encoder and decoder such that, given an arbitrary
text file, the encoder's output (which must be the same size as the
original) is compressed the least by the standard gzip utility. You
must be able to use the decoder on the encoded file to get back the
exact original.

Eligibility: All UCLA students

Due date: November 3rd

Further description: To explain this in more detail, we will explain
how we will be testing your encoder/decoder. First, we will choose a
rather long text file, call it filea.txt. Next, we will run your
encoder on this text file. The result of this process should be a
separate file that is the exact same length as the original, call it
fileaencoded.txt. You may produce fileaencoded.txt in any manner that
you see fit. However, we must be able to run your decoder on
fileaencoded.txt to produce filea.txt again exactly (We will use diff
to check). To determine the winner of the contest, we will run gzip on
fileaencoded.txt to produce fileaencoded.gz. Whichever competitor?s
fileaencoded.gz is the largest will be the winner of the
competition. Note that we may run this process on several different
files and take the weighted average file size of the trials.

What to turn in: You must turn in the source of your encoder and
decoder, written in the language of your choice. Both files must be
able to compile (if necessary) and run on a SEASnet machine. Please
also turn in a README file if appropriate. Email your files to
ucla.acm at gmail.com

What determines the winner: Whoever has the largest average file size
across all our test cases will win. If there are multiple entries that
fit this criterion, all winners will be entered in a random drawing
for available prizes. In addition, ?style points? will be awarded for
clever, efficient, or otherwise unique answers. These contestants will
receive additional prizes.