The Mathematics of Communication



4. Constructing a most optimal code

Here are Stinson's data modified to take account of _ (space) with probability .166.

A .068 F .018 K .007 P .015 U .023 Z .0008
B .0125 G .016 L .033 Q .0008 V .008 .166
C .023 H .051 M .020 R .05 W .019  
D .043 I .070 N .067 S .063 X .001  
E .106 J .0016 O .0625 T .075 Y .016   

There is an algorithm for constructing a binary code for a set of characters with different probabilities, with cost per character as close as possible to the theoretical minimum - p1log2p1 - p2log2p2 - ... - pmlog2pm. The minimum can only be exactly attained if all the pi are exactly powers of 1/2.

Here is the process illustrated for English, with the probabilities from the table above (optimum=4.13).

              _ 000
          _E <
        /     E 001
 _ETAO <          T 010
<        TAO-----<   A 0110
 \                AO<
  \                  O 0111
   \               I 1000
    \           IN<
     \         /   N 1001
      \ INSHR-<         S 1010
       <       SHR-----<   H 10110
        \               HR<
         \                 R 10111
          \             D 11000
           \         DL<
            \       /   L 11001
             \ DLUC<         U 11010
              <     UC------<
               \             C 11011
                \      M 11100
                 \ MWF<   W 111010
                  <    WF<
                   \      F 111011
                    \         Y 111100
                     \ YG----<
                      <       G 111101
                       \ P 111110
                        < 
                         \ B 1111110
                          <
                           \ V 11111110
                            <
                             \ K 111111110
                              <     J 11111111100
                               \ JX< 
                                <   X 11111111101
                                 \     Q 11111111110                             
                                  QZ--< 
                                       Z 11111111111

The cost for this code is 4.167 bits/character.