Huffman coding is a compression approach based on character frequencies. To code a string, we work out the frequency of each letter in the string and then build a tree where we put the letters in the leaves and structure the tree such that the most frequent letters are closest to the root. We then encode each letter by the path we must take in that tree to get to the leaf that contain that letter. We need one bit for each edge we follow; zero, say, if we go left and one if we go right.

In Haskell: the Craft of Functional Programming, there is an example of implementing Huffman coding in Haskell. In this vignette, I will adapt that to R with pmatch patterns. I have structured the vignette such that I switch between showing the original Haskell code and then the corresponding R code. If you are not familiar with Haskell, I think you should still be able to follow along.

Data types

The data structures used in the Haskell implementation are these:

data Tree = Leaf Char Int | Node Int Tree Tree
data Bit = L | R deriving (Eq, Show)
type HCode = [ Bit ]
type Table = [ (Char, HCode) ]

The first two are data types created from constructors, and we can define the corresponding R data structures using a := declaration:

The other two, HCode and Table are lists over Bit and pairs of characters and HCode, respectively. We won’t define types for these. In R, we don’t have static types so we don’t really name types. We need a list representation, however, and to make one that matches lists in Haskell, we define a linked list:

On occasions, we have to create singleton lists, and rather than writing CONS(val,NIL) for those, I will define a function for it. I will define a type to handle pairs in pattern matching.

Computing frequencies

Before we can encode a string we need to calculate the frequency of the characters in it. In the Haskell implementation this is done using two merge sorts, one that sorts the string characters alphabetically and count the number of occurrences of each and then another sort that orders the characters according to their frequency. Later on, we will build a tree from a list of characters, and by having the least frequent characters at the front of the list, and building from that end, we will propagate the less frequent characters to the bottom of the tree and the more frequent characters to the top.

The Haskell mergeSort function is implemented like this:

This function takes another function, merge, as a parameter. It is via this function we will implement the two different sort functions we need.

We don’t have where blocks in R, so we need to compute first, second and half before the recursion, but otherwise the the R implementation follows the Haskell implementation. We need some list functions first, though, for length, take and drop. We can implement them like this:

The llrev function, for reversing a linked list, is a helper function that lets us write tail-recursive take and drop functions. It isn’t terribly important here since we do not use any tail-recursion optimisation, but once you get into the habit of writing your functions tail recurse it is hard not to. Later on in the vignette I will use simpler functions rather than tail-recursive ones to make the examples easier to follow.

Anyway, with the list functions in place, the merge sort can be implemented like this:

Sorting alphabetically and counting characters

To sort characters alphabetically and at the same time count the number of occurrences for each, we need a merge function. The Haskell function looks like this:

We do not have conditions like | (p == q) in R, so we handle those using if-else-statements, but otherwise we translate the function statement by statement into this:

To test the function I want to make it a little easier to build linked lists, so I have written this helper function:

Ok, let us see if we can merge two lists:

Yup, that went okay.

What about sorting? First I want a function for concatenating two linked lists, and then I will test it on the concatenation of xs and ys from above.

Concatenation can look like this:

(Here, I did choose a simple version over a tail-recursive one).

Onwards to the test:

Yup, that works.

Computing frequencies

To compute a list of character frequencies for a string, we need to combine the sort and merge functions. In Haskell, the frequency function looks like this:

We see that we need another list function, map. I have implemented it this way:

I want to work on lists of characters, so as a helper function I have written string_to_list that translates a string into a list containing the characters in the string:

For composing functions, we can use magrittr pipelines. If we do, we just have to remember that the functions are evaluated from left-to-right, whereas in the Haskell code, they are evaluated right-to-left. So

becomes

If we call frequency on a string, we get a list of pairs of characters and integers. We should interpret that as a list of characters and the number of occurrences each have in the string.

Building trees

From a frequency list we need to build tree. This is the Haskell code:

The uncurl Leaf stuff is a technicality needed in Haskell. To translate a list of character-count pairs into a list of trees, we simply do this:

The makeCodes function looks like this in Haskell:

and we can translate it almost mechanically:

The same goes for amalgamate:

We can see that we need functions pair and ins_tree for these functions. The pair function shouldn’t be confused with the PAIR constructor—one of the reasons I put the latter in all uppercase. Other than that, the entire translation of the Haskell code is just done statement by statement:

The makeTree function was defined as a composition:

We do the same using a pipeline:

I have written some code for plotting trees, in the function plot_tree. I don’t show it here, it is a simple translation of the data structure into a table I can plot using tidy graph and ggraph, but we can use it to display trees:

#> 
#> Attaching package: 'tidygraph'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> Loading required package: ggplot2
"foobarbaz" %>% frequency() %>% make_tree() %>% plot_tree()

Code tables

The result of an encoding will be a sequence of L and R bits. We will represent these in linked lists, so to make it simpler to display these I have written these helper functions:

I do not simply set the class of a list to "code". I could, and most of the time that wouldn’t be a problem. However, in the specification of linked lists we required that cdr should be of type llist, so if we want to concatenate two codes—and I can reveal that we do later—then we need codes to have that type. I therefore make codes specialisations of llist with c("code", class(code)).

The encoding of a string will be implemented as follows: for each character in the string, we look up the code for that character, i.e. the sequence of left and right edges we need to reach the character in a tree, and we concatenate all these. In the implementation of the encoding we want a table that maps characters to their code, and in Haskell that table is computed using this function:

Since we pattern match on the tree and not the code, we will flip the order of the arguments. We can implement it in R like this:

I have added a class to tables to make them easier to print.

Coding and decoding

Now, we have everything we need in order to encode and decode strings/codes. Both encoding and decoding requires a tree, so we define a function for building a tree from a string:

codes <- . %>% frequency() %>% make_tree()

As I have already explained, encoding combines looking up the code for individual characters and concatenating all of them. The lookup is implemented in Haskell like this:

And in R like this:

The encoding combines that with concatenation. We map over the characters and look up their code. We then concatenate the result.

Our llconcat function only works on two linked lists, however, so we need to handle concatenation slightly differently. I have handled it by writing a reduce function that I can use to concatenate lists pairwise:

The code_message is very similar to the Haskell version, but I have added a translation from strings to lists before the main code (this isn’t needed in Haskell because strings are already lists of characters) and I set the class of the result to make it easier to print coded messages.

Decoding works by following paths down the tree until we get to a character, which we then add to the list as the decoding of a part of the code, and doing this as long as we have path-bits in our encoded message.

For this function, the R code closely follows the Haskell code, except that we have to put the decode_by_tree function at the beginning of the function instead of in a where block:

That’s it. Huffman coding in R.