The Mathematics of Communication



3. Optimal Codes

Let us start with the assumption, natural enough, that the base-n code is as good as possible for representing messages where in each position there are n equally likely characters. Each of these characters, which occurs with probability 1/n, costs log2n bits.

The next step is to conclude that a character occurring with probability p should cost log2(1/p) = - log2p bits, no more and no less, irrespective of the probabilities of the other characters in the alphabet. This criterion defines an optimal code. In practice, this criterion may be unsatisfiable, since each code word must use an integral number of bits.

The average cost per character of such a code can be computed by observing that out of N characters, pN will cost - log2p bits, so for m characters with probabilities p1, p2, ... , pm the average cost per character will be

- p1log2p1 - p2log2p2 - ... - pmlog2pm.

For English text the relative frequencies of letters have been tabulated. Here they are taken from Cryptography Theory and Practice by D. R. Stinson, CRC Press, Boca Raton, Florida 1996.

A .082 F .022 K .008 P .019 U .028 Z .001
B .015 G .020 L .040 Q .001 V .010   
C .028 H .061 M .024 R .060 W .023  
D .043 I .070 N .067 S .063 X .001  
E .127 J .002 O .075 T .091 Y .020   

These data are not exactly right for us because they do not consider _ (space) as a character. Feeding them into the average cost formula gives 4.18 bits/character. Correcting for _ with probability 1/6 (taking the average word length to be 5) gives 4.13 bits/character. This estimate of the cost per character in an optimal code should be compared with the 4.69 we measured for our sample of Morse Code.

Information and Entropy. The quantity H = - p1log2p1 - p2log2p2 - ... - pmlog2pm was introduced in this context by Claude Shannon in A Mathematical Theory of Communication (Bell System Technical Journal, July 1948). He called it the "entropy" of the set of probabilities p1, p2, ... , pm , in analogy with its interpretation in statistical mechanics. But he thought of it as "a quantity which will measure, in some sense, how much information is produced by ... a process, or better, at what rate information is produced." It is now usually called the information per character. It is an extremely useful measure in determining, for example, how many messages a given communications channel can carry. But the word "information" has led to some confusion since, as Bar-Hillel puts it, "it turned out to be humanly impossible not to believe that one has got some hold of this important and doubtless rather difficult concept on the basis of such a simple procedure as, say, counting frequencies of letter occurrences in English." (Language and Information, Addison-Wesley 1964, p. 285 ). One last note: in the expression H = - Σ pilog2pi it is not necessary for all the pi to be nonzero. Since limx-->0xlog2x = 0 (L'Hôpital's rule) we can substitute 0 for any term involving a zero probability without any loss of continuity.