previous 1 Pleased to Meet up V'ir Tbg n Frperg    This document as PDF next 3 The Caesar Cipher


2 I'm a Substitute for Another Guy

The simplest ciphers are based on the idea of replacing each letter in the plaintext with a different letter. For example, instead of a we might use R, for b write A, and so on. This class of ciphers is called a substitution cipher (more precisely, a monoalphabetic substitution cipher); these are the types of ciphers you sometimes see on the crossword page of the newspaper. Note that in this case, the key is rather long (namely, just as long as the alphabet, since we must say which character gets replaced with which), and there are many possible encryptions (26! or 40,329,146,112,660,563,584,000,000). But, they are still pretty easy to crack.

Let's look at an example.

RUQDY QNTRZ AQIQB NAZNC YQQBQ WBJQR UJWXS RUNVW WENCQ TRYQK QRK

At first glance, this may seem like total gibberish. But upon a little examination, we see that the letter Q occurs 11 times, much more than any other. It would reasonable to guess that Q stands for a commonly occurring letter. In English, the letter e is the most common letter, followed by t, a, and o.4


Table: Frequency (in percent) of occurrence of letters in English text (adapted from [Lew])
Letter Frequency Letter Frequency Letter Frequency
a 8.167 j 0.153 s 6.327
b 1.492 k 0.772 t 9.056
c 2.782 l 4.025 u 2.758
d 4.253 m 2.406 v 0.978
e 12.702 n 6.749 w 2.360
f 2.228 o 7.507 x 0.150
g 2.015 p 1.929 y 1.974
h 6.094 q 0.095 z 0.074
i 6.966 r 5.987    


So, in our message, we would likely guess that Q corresponds to e, and that W, R, and N correspond to o, a, and t, although these last three might be permuted. Then we plug them in, and guess at which letters correspond to which until the message makes sense. Once we have a few letters, we can try to recognize patterns corresponding to common words ( the, and, and so on). We can also make use of the fact that the most common pairs of letters in English are (in order) th, he, in, en, nt, re, er, an, ti, es, on, at, se, nd, or, ar, al, te, co, de, to, ra, et, ed, it, sa, em, and ro.

Making use of such probabilistic guesses is called frequency analysis. As the length of the ciphertext increases, frequency analysis becomes increasingly reliable. There are records from the 9th century A.D. indicating that frequency analysis was in use in the Arab world at that time. For this method to work, typically you need a message of at least 50 characters.

So far, we have what is below (assuming the guesses for characters are correct):

RUQDY QNTRZ AQIQB NAZNC YQQBQ WBJQR UJWXS RUNVW WENCQ TRYQK QRK
a e   et a   e e  t  t   ee e o  ea   o   a t o o t e  a e  ea

It still takes quite a bit of playing around to get the rest of the message. Remember that the spaces were removed from the original; as a hint, I'll put them back in, together with our guesses. To avoid spoiling the fun for those who don't want a hint, it is in a footnote.5



Footnotes

... COLOR="#006400">o.4
This frequency sometimes shows up as etaoin shrdlu in pre-1980s print, because the keys on Linotype machines used to set type were arranged in frequency order. For the same reason, qwerty or asdf show up when people bang on keyboards.
... footnote.5
A hint:
  R UQDYQN TRZ AQ IQBN AZ NCYQQ BQWBJQ RU JWXS RU NVW WE NCQT RYQ KQRK
  a  e  et  a   e  e t    t  ee  eo  e  o   t o t e  a e  ea 
previous 1 Pleased to Meet up V'ir Tbg n Frperg    This document as PDF next 3 The Caesar Cipher
Scott Sutherland
2005-10-26