Assignment #5: Huffman Codes
(Based on an assignment given by Owen Astrachan at Duke University.)
You are urged to work in groups of two or three. Each group should
submit ONE program per group.
Introduction
This assignment involves implementing a compression/uncompression program
based on the greedy Huffman algorithm discussed in the text and in class.
The resulting program will be a complete and useful compression program
although not, perhaps, as powerful as the standard Windows utilities such
as winzip which use different algorithms permitting a higher degree
of compression than Huffman coding. You should take care to develop the
program in an incremental fashion. If you try to write the whole program
at once, you probably will not get a completely working program! Design,
develop, and test so that you have a working program at each step.
Because the compression scheme involves reading and writing in a bits-at-a-time
manner as opposed to a char-at-a-time manner, the program can be hard to
debug. In order to facilitate the design/code/debug cycle, an iterative
enhancement development of the program is suggested below. You are
by no means constrained to adopt this approach, but it might prove useful
if your final program doesn't work to show useful and working programs
along the way.
You are to write two programs:
-
huff.cpp This
program compresses a file
-
unhuff.cpp
This program uncompresses a (previously huffed/compressed) file
Both programs may eventually read from a file specified by the user on
the command line, but initially the user can be prompted for the file name
(or you could hardwire the filename into the program, but this is strongly
discouraged). The program huff.cpp should write its (compressed)
output to a file also specified in some way (by user, on command line,
or to cout). The line below might compress the file foo.txt storing
the compressed version in foo.txt.H.
huff foo.txt
Later you might provide the user the option of naming the output file explicitly:
huff foo.txt -o foo.txt-huffed
As described below in the development section,
you MUST use a class-based approach to this problem. This means, for example,
that you should implement at least one class. Some suggestions are provided
in the development section.
Assignment Details
The files for this assignment include the following:
You should copy these files to get started. They contain wrapper classes
that allow you to perform bit-at-a-time I/O. Wrapper.h is used to wrap
a pointer variable with meaningful comparison semantics.
The program huff.cpp
You must write a program huff.cpp that will compress a file so that
it can be uncompressed by unhuff.cpp. In writing huff.cpp
you'll implement several classes, usually a class will take care of one
part of the Huffman algorithm. These classes will also provide support
for the decompression program unhuff.cpp.
Before getting into particulars of the program, we'll outline the steps
in the Huffman algorithm.
Steps in the Huffman Algorithm
Your implementation of Huffman coding has four principle steps:
-
Count how many times every character occurs in a file. These counts are
used to build weighted nodes that will be leaves in the Huffman tree. Here
we use character to mean a conceptual character, not a C++ char
value, although it might be.
-
From the counts, build the Huffman tree. To do this, first create one node
per character, weighted with the number of times the character occurs,
insert each node into a priority queue. Then choose two minimal nodes,
join these nodes together as children of a newly created node, and insert
the newly created node into the priority queue. The new node is weighted
with the sum of the two minimal nodes taken from the priority queue. Continue
this process until only one node is left, this is the root of the Huffman
tree.
-
By traversing the Huffman tree, create a code for each leaf/character.
The code is formed from the path from the root of the Huffman tree to each
leaf, going left adds a zero to the path, going right adds a one. At each
leaf, the character at the leaf is mapped to the sequence/path of zeros
and ones used to reach the leaf. All characters/encoding bit pairs are
stored in some kind of table or map. Depending on how you store the path
of bits you may need the number of bits for each character as well. The
number of bits may be needed since the bit patterns 00101 and 0101 are
different, but are both represented as numbers by 5. However, if you store
the bit patterns as strings, e.g., "00101" and "0101" then the string's
length can be used to determine the number of encoding bits.
-
Read the input file a second time. For each character read, instead of
writing the character, write the sequence of zero and one bits used to
encode the character. This sequence of bits was determined in the previous
step.
Steps in Implementing huff.cpp
Some steps that may be useful in developing the program are described below.
It's important to develop your program a few steps at a time. At each step,
you should have a functioning program, although it may not do everything
the first time it's run. By developing in stages, you'll find it easier
to isolate bugs and you'll be more likely to get a program working faster.
Do NOT write hundreds of lines of code before compiling and testing.
It's probably a good idea to create several classes. A class can be
responsible for one part of the Huffman compression. For example, a class
could be responsible for creating the initial counts of how many times
each character occurs. You could design, implement, and test the class
in isolation from the rest of the huff.cpp program. Doing this for
each of several classes will help both in developing huff.cpp and
in re-using code in writing the uncompress program unhuff.cpp.
For example, one possibility for a character counting class is shown
below:
CharCounter
cc;
cc.process( "poe.txt" );
for( int k = 0; k < 256; k++ )
{
int occs = cc.count(
k );
if( occs > 0 )
{
cout << char( k ) << " " << occs << endl;
}
}
Here the member function CharCounter::process reads and counts
all the characters in a file. The function CharCounter::count
returns the number of occurrences of a given character. You may decide
that process is not a good identifier. For example, some people
think that readSource is a better name, indicating that the function
reads a source file as opposed to a compressed file. You might use
CharCounter to read compressed-file headers too in unhuff.cpp,
for example. Note that there are 256 different 8-bit values, so you'll
need 256 different counters. (There are 257 possible characters if you
include the pseudo-EOF chararacter described below.)
Do NOT use any variables of type char! You should use int variables
when you think you might need a char everywhere in your program. The only
time you might want to use a char variable is to print for debugging purposes
-- you can cast an int to a printable char as shown in the code fragment
above.
Some other ideas for classes are given below. You may decide to combine
some of these into one class, or you may decide to implement more classes.
These are a start and some suggestions, not requirements.
-
A class to create the Huffman tree used in compression and uncompression.
The tree is created from character counts, so a Huffman-tree creating class
might use a CharCounter object in creating the Huffman tree.
-
A class that represents the table of character and encoding bit pattern
pairs. You could use a map for example to map char values to string values
that represent the sequence of zeros and ones that encode the character,
e.g., that map 'A' to "0101011".
Implementing and Debugging
You will probably find it useful to implement debugging functions in your
classes. These are functions you use in the debugging stages, but not in
the final version of the program. For example, you could implement CharCounter::print
to print all characters and frequencies read from a source file --- this
could use the code fragment shown above with the CharCounter description.
Although you could write such a printing/testing function outside the class,
encapsulating the testing/debugging functions within the class makes the
class more coherent.
Similarly, you may want to implement a function to print the Huffman
tree to help you determine if the tree you've built is correct.
Designing debugging functions as part of the original program will make
the program development go more quickly since you'll be able to verify
that pieces of the program, or certain classes, work properly.
Building in the debugging scaffolding from the start will make it
much easier to test and develop your program. When testing, use small
examples of test files maybe even as simple as "go go gophers" that help
you verify that your program and classes are functioning as intended.
You might want to write encoding bits out first as strings or printable
int values rather than as raw bits of zeros and ones which won't be readable
except to other computer programs. A Compress class, for example,
could support printAscii functions and printBits to print
in human readable or machine readable formats.
Final Stages of Implementation
When you think your huff.cpp program is compressing properly (you'll
probably need to write unhuff.cpp to really verify this) you'll
need to add a few finishing touches to the program to make it more usable
and more robust.
If compressing a file results in a file larger than the file being
compressed (this is always possible) then no compressed file should be
created and a message should be printed indicating that this is the case.
The user should have the option of invoking huff with an argument
-f which forces compression even when the compressed file is bigger.
For example:
teer4> huff myprog.c
no compression yielded, no output file written
teer4> huff -f myprog.c
Here, the second time huff is executed the compression is forced
because of the -f argument. Only the first argument to the program can
be the -f argument.
As shown in the example above, your program will need to process command
line arguments like -f or the name of the file being compressed.
To determine if compression results in a smaller file, you'll need to determine
the number of characters in the original file (your program will compute
this by determining character counts). The size of the compressed file
can be calculated from the same character counts using the size of each
character's encoded number of bits. You must also remember to calculate
the file-header information stored in the compressed program. To be more
precise, if there are 52 A's, and each A requires 4 bits to encode, then
the A's contribute 52*4 = 108 bits to the compressed file. You'll need
to make calculations like this for all characters.
Don't spend time worrying how to process command-line arguments until
you're sure the program works, this isn't the most important part of the
program. You can add command-line parameters after everything works.
The program unhuff.cpp
The program should be robust enough not to crash if it is given a program
to uncompress that wasn't compressed using the corresponding Huffman compression
program. The robustness of the program will be an important criterion in
grading the program. There are a variety of methods that you can use to
ensure this works, but it will probably always be possible to ``fool''
the program (although very unlikely).
One easy way to ensure that compression/decompression work in tandem
is to write a "magic number" at the beginning of a compressed file. This
could be any number of bits that are specifically written as the first
N bits of a huffman-compressed file (for some N). The corresponding
uncompression program first reads N bits, if the bits don't represent
the "magic number" then the compressed file is not properly formed. You
can read/write bits using the classes declared in bitstream.h.
In writing unhuff.cpp you should try to use some of the same
classes that use used in huff.cpp. For example, to uncompress you
must have the same Huffman tree that was used to compress. This tree might
be stored directly in the compressed file, or it might be created from
character counts stored in the compressed file. In either case, a class
to create a Tree could create the tree from either a CharCounter
object or directly from a compressed file. Once the tree object can be
used once it's created.
Coding and Algorithmic Details
The number of characters counted is at most 257. Although only 256 values
can be represented by 8 bits, one character is used as the pseudo-EOF character.
Every time a file is compressed the count of the the number of times
that pseudo-EOF occurs should be one --- this should be done explicitly
in the code that determines frequency counts. In other words, a pseudo-char
EOF with number of occurrences (count) of 1 must be explicitly created.
When you open files, you should assume they are binary files, not text
files. Modes that will work are
static const int READ_MODE = ios::in
| ios::binary;
static const int WRITE_MODE = ios::out | ios::binary;
Pseudo-EOF character
The operating system will buffer output, that is output to disk actually
occurs when some internal buffer is full. In particular, it is not possible
to write just one single bit to a file, all output is actually done in
"chunks", e.g., it might be done in eight-bit chunks. In any case, when
you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually
written, but you can't be sure precisely when they're written during the
execution of your program. Also, because of buffering, if all output is
done in eight-bit chunks and your program writes exactly 61 bits explicitly,
then 3 extra bits will be written so that the number of bits written is
a multiple of eight. Because of the potential for the existence of these
"extra" bits when reading one bit at a time, you cannot simply read bits
until there are no more left since your program might then read the extra
bits written due to buffering. This means that when reading a compressed
file, you'll use a pseudo-EOF character and write a loop that stops when
the pseudo-EOF character is read in (in compressed form). The
code below is pseudo-code for reading a compressed file.
int bit;
while( ( bit = in.readBit( ) ) != EOF ) ) // Should never actually hit EOF this way
{
// use the zero/one value of the bit read
// to traverse Huffman coding tree
// if a leaf is reached, decode the character and print UNLESS
// if the character is pseudo-EOF, then decompression done
// loop to read more bits
}
When a compressed file is written the last bits written should be the bits
that correspond to the pseudo-EOF char. You'll have to write these bits
explicity. These bits will be recognized by the program unhuff.cpp
and used in the decompression process. In particular, when using unhuff
a well-formed compressed file will be terminated with the encoded form
of the pseudo-EOF char (see code above). This means that your decompression
program will never actually run out of bits if it's processing a properly
compressed file (think about this). In other words, when decompressing
(unhuffing) you'll read bits, traverse a tree, and eventually find a leaf-node
representing some character. When the pseudo-EOF leaf is found, the program
can terminate because all decompression is done. If reading a bit fails
because there are no more bits (the bitreading function returns false)
the compressed file is not well-formed.
Compressed Header
For decompression to work with Huffman coding, information must be stored
in the compressed file that allows the Huffman tree to be re-created so
that decompression can take place. There are many options here. You can
store all codes and lengths as normal (32 bit) C/C++ ints or you can try
to be inventive and save space. For example, it's possible to store just
character counts and recreate the codes from the counts. It's also possible
to store code-lengths and codes using bit-at-a-time operations. Any solution
to storing information in the compressed file is acceptable. If you use
a successful space-saving technique you can earn several points of extra
credit (up to 5). Space-saving techniques are defined as those using less
space than simply storing 257 counts as 32 bit ints. One useful technique
is to write the tree to the file.
You can use a 0 or 1 bit to differentiate between internal
nodes and leaves, for example.
Reading Files Twice
Finally, because Huffman coding requires two passes over the input file
you will need to rewind the underlying stream.
The effect of rewinding the
input stream is to allow it to be ``read again''.
You can do so as follows:
in.clear( );
in.seekg( 0, ios::beg ); // Rewind the stream
Priority Queues
You should use the STL priority_queue class.Assuming that you
store a tree node in an object of type HuffNode, then a respectable
declaration would be
priority_queue<Pointer<HuffNode>, vector<Pointer<HuffNode> >, less<Pointer<HuffNode> > > pq;
Note two things:
-
The STL priority queue gives the max, not the min.
-
You need a comparison function; pointers don't compare well, but if you
wrap the HuffNode in a Pointer object, then all you have
to do is define operator< for HuffNode. Note that
because of point #1 above, you will want to reverse the meaning of operator<.
Creating a Table from a Huffman-tree
To create a table or map of coded bit values for each character you'll
need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making
an entry in the table each time you reach a leaf. For example, if you reach
a leaf that stores the character 'C', following a path left-left-right-right-left,
then an entry in the 'C'-th location of the table should be set to 00110.
You'll need to make a decision about how to store the bit patterns the
table. One method is based on storing strings. This makes it easy to add
a character (using +) to a string during tree traversal and makes it possible
to store just one entry in the table: the string. An alternative is to
use a vector<int>; this makes it easy to interact with writeBits
(see next section)
Using bitstream.h and bitstream.cpp
In order to read and write in a bit-at-a-time manner, the file bitstream.cpp
and the corresponding header file bitstream.h must be used.
Two classes are specified for reading/writing bits-at-a-time: ibstream
and obstream, respectively. These classes wrap a corresponding
stream, and then provide readBit and writeBit, respectively.
Bit read/write subprograms
Here is an example that reads the bits in a file and outputs Z for zero
and O for one. Needless to say, it is short on error checks.
ifstream fin( "input.txt" );
ibstream in( fin );
int bit;
while( ( bit = in.readbit( ) ) != EOF )
{
cout << ( bit == 1 ) ? 'O' : 'Z';
}
Similarly, writebit can be called to write a single bit at a time.
Or you can use writeBits, sending to it a vector of 0s
and 1s.
When using writebits to write a specified number of bits, some
bits may not be written because of some buffering that takes place. To
ensure that all bits are written, the last bits must be explicitly flushed.
The function flush is called automatically if your program exits
properly when the obstream destructor is called. You should not need to
call it explicitly.
Command-line Arguments
When a C/C++ program is invoked, arguments are often given as command-line
arguments/parameters, i.e., options entered on the same line as the
program. To run a program from the command line, bring up an MS/DOS window.
You should find the executable in the DEBUG directory of your project.
int main( int argc, char *argv[] )
{
if( argc == 1 )
{
cout << "program
has NO command-line arguments" << endl;
cout << "name
of program is " << argv[ 0 ] << endl;
}
else
{
cout <<
"program " << argv[ 0 ] << " has arguments:" << endl;
for( int k
= 1; k < argc; k++ )
{
cout << "arg: " << k << " = " << argv[ k ] <<
endl;
}
}
return 0;
}
All programs have at least one command-line argument, the name of
the program being run. This is stored as the first entry in the array
argv (the first entry has index zero). The array argv is
an array of c-style strings. These strings are just pointers to characters,
with a special NUL-character '\0' to signify the last character in a C-style
string. You do NOT need to know this to manipulate these char *, C-style
strings. The easiest thing to do is to assign each element of argv
to a C++ string variable. Then you can use "standard" C++ string functions
to manipulate the values, e.g., you can call length(), you can
use substr(), you can concatenate strings with +, etc. None of
these operations work with C-style, char * strings. Assign each element
of argv to a C++ string variable for processing.
What to submit
Submit this program with all source code, sample output, complete documentation,
and a floppy.
Your documentation must explain how to use the program, what options
are available, and what error checks are performed. I also want a short
explanation of your basic design, and an explanation of who is in the group,
and who did what.
Include an explanation as to how information is stored in compressed
files so that uncompression works. Include information about how much time
you spent on different parts of the assignment, what was frustrating or
enjoyable about the assignment and a list of collaborators with whom you
worked (if any). Be sure to submit any other files that are needed by your
program including the bitstream files ONLY if you modified them in some
way.
Grading
The programs are worth 100 points.
Huffman Coding Grading Standards
description |
points |
compression of any text file |
30 points |
compression of any file (including binary files) |
10 points |
decompression |
15 points |
user-interface (how easy and intuitive program is to use) |
10 points |
robustness (does unhuff program crash on non-huffed files?) |
10 points |
program style (class design/implementation, program design) |
25 points |