2 I'm a Substitute for Another Guy

The simplest ciphers are based on the idea of replacing each letter in the plaintext with a different letter. For example, instead of a we might use R, for b write A, and so on. This class of ciphers is called a substitution cipher (more precisely, a monoalphabetic substitution cipher); these are the types of ciphers you sometimes see on the crossword page of the newspaper. Note that in this case, the key is rather long (namely, just as long as the alphabet, since we must say which character gets replaced with which), and there are many possible encryptions (26! or 40,329,146,112,660,563,584,000,000). But, they are still pretty easy to crack.

At first glance, this may seem like total gibberish. But upon a little examination, we see that the letter Q occurs 11 times, much more than any other. It would reasonable to guess that Q stands for a commonly occurring letter. In English, the letter e is the most common letter, followed by t, a, and o.⁴

Letter	Frequency	Letter	Frequency	Letter	Frequency
`a`	8.167	`j`	0.153	`s`	6.327
`b`	1.492	`k`	0.772	`t`	9.056
`c`	2.782	`l`	4.025	`u`	2.758
`d`	4.253	`m`	2.406	`v`	0.978
`e`	12.702	`n`	6.749	`w`	2.360
`f`	2.228	`o`	7.507	`x`	0.150
`g`	2.015	`p`	1.929	`y`	1.974
`h`	6.094	`q`	0.095	`z`	0.074
`i`	6.966	`r`	5.987

So, in our message, we would likely guess that Q corresponds to e, and that W, R, and N correspond to o, a, and t, although these last three might be permuted. Then we plug them in, and guess at which letters correspond to which until the message makes sense. Once we have a few letters, we can try to recognize patterns corresponding to common words (the, and, and so on). We can also make use of the fact that the most common pairs of letters in English are (in order) th, he, in, en, nt, re, er, an, ti, es, on, at, se, nd, or, ar, al, te, co, de, to, ra, et, ed, it, sa, em, and ro.

Making use of such probabilistic guesses is called frequency analysis. As the length of the ciphertext increases, frequency analysis becomes increasingly reliable. There are records from the 9th century A.D. indicating that frequency analysis was in use in the Arab world at that time. For this method to work, typically you need a message of at least 50 characters.

So far, we have what is below (assuming the guesses for characters are correct):

RUQDY QNTRZ AQIQB NAZNC YQQBQ WBJQR UJWXS RUNVW WENCQ TRYQK QRK
a e et a e e t t ee e o ea o a t o o t e a e ea

It still takes quite a bit of playing around to get the rest of the message. Remember that the spaces were removed from the original; as a hint, I'll put them back in, together with our guesses. To avoid spoiling the fun for those who don't want a hint, it is in a footnote.⁵

2 I'm a Substitute for Another Guy

Footnotes