info2

$e-MATH$

The Mathematics of Communication

2. Encoding English text

Considering the encoding of numbers in various bases led us to the conclusion that

If each digit is one of b equally likely possibilities, the number of bits per digit will be log₂(b).

This analysis can be applied to the recording and transmission of messages other than numbers. Suppose the message is a text in English:

THE_QUICK_BROWN_FOX_JUMPED_OVER_THE_LAZY_DOGS_ _

We can treat this message like a number written in base 27 (letters A - Z and _ , the symbol for space). Each character would then be counted as log₂27 = 4.75 bits.

But there is a difference between the English alphabet and the numbers 0 - 26 that occur in base-27 arithmetic. The assumption we made, that in each place all digits are equally likely, is unrealistic for English where, for example, the letter E occurs more often than others.

Morse Code is a way of representing letters by combinations of dots and dashes. Writing 1 for dot, 2 for dash and 0 for space, Morse Code looks like this:

A 012	H 02222	O 0222	V 01112
B 0211	I 011	P 01221	W 0122
C 02121	J 01222	Q 02212	X 02112
D 02111	K 0212	R 0121	Y 02122
E 01	L 01211	S 0111	Z 02211
F 01121	M 022	T 02	_ 0
G 0211	N 021	U 0112

Here is a message and its Morse Code (message from E. C. Cherry's The Communication of Information, as quoted in Y. Bar-Hillel, Language and Information, Addison-Wesley 1964, p. 222 ):

WE_	ARE_	NOT_
0122010	0120121010	0210222020
CONCERNED_	WITH_	THE_
0212102220210212101012102101021110	012201102022220	0102222010
MEANING_	OR_	TRUTH_
0110101202101102102210	022201210	0201210112020222
OF_	MESSAGES_ _	SEMANTICS_
0222011210	0220101110111012022101011100	01110102201202102011021210111
LIES_	OUTSIDE_	THE_
012110110101110	0222011202011101102111010	0202222010
SCOPE_	OF_	MATHEMATICAL_
011102121022201221010	0222011210	0220120202222010220120201102121012012110
INFORMATION_	THEORY_ _
01102101121022201210220120201102220210	02022201022201210212200

The Morse encoding uses a total of 382 base-3 numbers to represent 129 characters (including the spaces). The weight in bits is then 382 x log₂3 = 605.4, or 4.69 bits/character. Besides being more practical than a base-27 encoding, Morse Code is cheaper in bits per character. This is because Morse Code takes advantage of the differences in relative frequency of characters in an English text.

Is it possible that a better encoding could reduce the cost in bits/character even further?