info3

The Mathematics of Communication

3. Optimal Codes

Let us start with the assumption, natural enough, that the base-n code is as good as possible for representing messages where in each position there are n equally likely characters. Each of these characters, which occurs with probability 1/n, costs log₂n bits.

The next step is to conclude that a character occurring with probability p should cost log₂(1/p) = - log₂p bits, no more and no less, irrespective of the probabilities of the other characters in the alphabet. This criterion defines an optimal code. In practice, this criterion may be unsatisfiable, since each code word must use an integral number of bits.

The average cost per character of such a code can be computed by observing that out of N characters, pN will cost - log₂p bits, so for m characters with probabilities p₁, p₂, ... , p_m the average cost per character will be

- p₁log₂p₁ - p₂log₂p₂ - ... - p_mlog₂p_m.

For English text the relative frequencies of letters have been tabulated. Here they are taken from Cryptography Theory and Practice by D. R. Stinson, CRC Press, Boca Raton, Florida 1996.

A .082 F .022 K .008 P .019 U .028 Z .001
B .015 G .020 L .040 Q .001 V .010
C .028 H .061 M .024 R .060 W .023
D .043 I .070 N .067 S .063 X .001
E .127 J .002 O .075 T .091 Y .020

These data are not exactly right for us because they do not consider _ (space) as a character. Feeding them into the average cost formula gives 4.18 bits/character. Correcting for _ with probability 1/6 (taking the average word length to be 5) gives 4.13 bits/character. This estimate of the cost per character in an optimal code should be compared with the 4.69 we measured for our sample of Morse Code.

Information and Entropy. The quantity H = - p₁log₂p₁ - p₂log₂p₂ - ... - p_mlog₂p_m was introduced in this context by Claude Shannon in A Mathematical Theory of Communication (Bell System Technical Journal, July 1948). He called it the "entropy" of the set of probabilities p₁, p₂, ... , p_m , in analogy with its interpretation in statistical mechanics. But he thought of it as "a quantity which will measure, in some sense, how much information is produced by ... a process, or better, at what rate information is produced." It is now usually called the information per character. It is an extremely useful measure in determining, for example, how many messages a given communications channel can carry. But the word "information" has led to some confusion since, as Bar-Hillel puts it, "it turned out to be humanly impossible not to believe that one has got some hold of this important and doubtless rather difficult concept on the basis of such a simple procedure as, say, counting frequencies of letter occurrences in English." (Language and Information, Addison-Wesley 1964, p. 285 ). One last note: in the expression H = - Σ p_ilog₂p_i it is not necessary for all the p_i to be nonzero. Since lim_x-->0xlog₂x = 0 (L'Hôpital's rule) we can substitute 0 for any term involving a zero probability without any loss of continuity.

A	.082	F	.022	K	.008	P	.019	U	.028	Z	.001
B	.015	G	.020	L	.040	Q	.001	V	.010
C	.028	H	.061	M	.024	R	.060	W	.023
D	.043	I	.070	N	.067	S	.063	X	.001
E	.127	J	.002	O	.075	T	.091	Y	.020