Huffman Code (C++)

This is an implementation of the Huffman code algorithm, in the form of an encoder class (HuffmanEncoder) and a decoder class (HuffmanDecoder), based on the presentation of Huffman codes in Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, Introduction to Algorithms, 2nd ed. (Cambridge, MA: MIT Press, 2001), 385-393.

A Huffman code is an optimally efficient encoding of an alphabet into (binary) numbers based on the frequency with which each character occurs in the given text or collection of texts. Therefore, each particular input may have its own distinct Huffman code, since the frequency of characters may vary across texts. The algorithm provides that the most commonly occurring characters will have the smallest binary number representations. The size in bits of the encodings for various characters varies: for instance, if e is the most common character, it will likely be represented by the number 0, which requires only one bit, whereas some other character may have the representation 5, or 101 in binary notation, which requires three bits. The Huffman algorithm guarantees that the prefix, or start, of each binary representation is distinct, so that decoding can distinguish character encodings despite their having various lengths in bits.

In general, Huffman encodings take up far less space in memory or file storage than the standard fixed-length representations of characters such ASCII, which uses a byte (8 bits) for every character, regardless of its frequency. For this reason, Huffman codes are a kind of data compression.

To create a Huffman code for a given text, the algorithm goes through the following steps:

Count the occurrences of each character to make a frequency table.
Use the frequency information to create a tree. The leaf nodes of the tree hold the characters and their frequencies. Each internal node has two children and holds the sum of the frequencies of its two children. The root note thus holds the sum of the frequencies of all the characters in the text.
Traverse ("walk") the tree to get every path from the root to a leaf node. (Each leaf node can be reached through only one path.) For any left turn from a node to its child, record a 0; for every right turn, record a 1. The sequence of 0s and 1s constitutes the Huffman encoding for the character at the end of that path.
Create a look-up table (a map) from characters to their numeric encodings.
Using the look-up table, translate the text into a sequence of bits. This is the Huffman encoding of the text.

Although the last step is conceptually simple, it is somewhat challenging to accomplish on actual computing equipment, because computers are designed to deal in numbers of at least 8 bits (one byte) in size, not with individual bits. So the implementation must further translate the variable-sized numbers that encode the characters into a form that the computer can store, read, and manipulate.

In order to decode the Huffman code, the decoder algorithm must take the following steps:

Reconstitute the Huffman tree used to create the code.
Read the input as a sequence of instructions in traversing the tree. On every 0, go to the left child; on every 1, go to the right child--until you have reached a leaf node, which contains the character represented by the sequence of bits given so far.
Append the character from the leaf node to the evolving output text and repeat the previous step, until you have reached the end of the binary input.

Thus to enable the decoder to do its work, the encoder must record in a file, not just the binary encoding of the original text, but also some representation of the Huffman tree that it had built in order to develop the code. This presents another challenge for real computers. Normally, we build a "tree" representation using a data structure for each node and the memory addresses of the children nodes. When you traverse a tree, you follow the address of each child to find it and then find its children in turn. But the addresses of the nodes depend on memory allocation during a particular run of the program--we cannot assume that they will remain the same from one run to another. So, if we try to save the tree of a file by recording those addresses, they would likely be invalid on the next run of the program, and the program would fail. One common solution to this problem is to use a library that provides for serialization and deserialization. The current implementation takes a simpler approach: it stores the nodes in an array and uses the indices of the nodes in the array instead of the addresses to represent children nodes within their parents. Each node has an index, corresponding to where it is stored in the array, and internal nodes have the indices of their two children, left and right. The encoder saves the whole array of nodes at the beginning of the output file--no memory addresses are included. The decoder can then read the array back out of the file; the relations among the nodes, represented by the indices, are still valid.

To test and explore this Huffman code implementation for yourself, download the compressed archive file from the link below and decompress it (tar -zxfv HuffmanCode.tar.gz). Then run make from the command line--assuming that you have a C++ compiler available.

Huffman Code implementation