The Mathematics of Communication



2. Encoding English text

Considering the encoding of numbers in various bases led us to the conclusion that

This analysis can be applied to the recording and transmission of messages other than numbers. Suppose the message is a text in English:

THE_QUICK_BROWN_FOX_JUMPED_OVER_THE_LAZY_DOGS_ _

We can treat this message like a number written in base 27 (letters A - Z and _ , the symbol for space). Each character would then be counted as log227 = 4.75 bits.

But there is a difference between the English alphabet and the numbers 0 - 26 that occur in base-27 arithmetic. The assumption we made, that in each place all digits are equally likely, is unrealistic for English where, for example, the letter E occurs more often than others.

Morse Code is a way of representing letters by combinations of dots and dashes. Writing 1 for dot, 2 for dash and 0 for space, Morse Code looks like this:

A 012H 02222O 0222V 01112
B 0211I 011P 01221W 0122
C 02121J 01222Q 02212X 02112
D 02111K 0212R 0121Y 02122
E 01L 01211S 0111Z 02211
F 01121M 022T 02_ 0
G 0211N 021U 0112 

Here is a message and its Morse Code (message from E. C. Cherry's The Communication of Information, as quoted in Y. Bar-Hillel, Language and Information, Addison-Wesley 1964, p. 222 ):

WE_ARE_NOT_
0122010
 
0120121010
 
0210222020
 
CONCERNED_WITH_THE_
0212102220210212101012102101021110
 
012201102022220
 
0102222010
 
MEANING_OR_TRUTH_
0110101202101102102210
 
022201210
 
0201210112020222
 
OF_MESSAGES_ _SEMANTICS_
0222011210
 
0220101110111012022101011100  
 
01110102201202102011021210111
 
LIES_OUTSIDE_THE_
012110110101110
 
0222011202011101102111010
 
0202222010
 
SCOPE_OF_MATHEMATICAL_
011102121022201221010
 
0222011210
 
0220120202222010220120201102121012012110
 
INFORMATION_THEORY_ _ 
01102101121022201210220120201102220210  
 
02022201022201210212200
 
 

The Morse encoding uses a total of 382 base-3 numbers to represent 129 characters (including the spaces). The weight in bits is then 382 x log23 = 605.4, or 4.69 bits/character. Besides being more practical than a base-27 encoding, Morse Code is cheaper in bits per character. This is because Morse Code takes advantage of the differences in relative frequency of characters in an English text.

Is it possible that a better encoding could reduce the cost in bits/character even further?