When people put up an encrypted text puzzle online it is frequently a simple substitution cipher whereby each letter stands for a different letter and the plaintext is revealed by discovering the correct replacements for each letter. I always wondered if it was possible to automatically decrypt these puzzles by analyzing the frequency with which each character appears and compare that to the frequency of each character in the english language. (Obviously I’m only decrypting english text, otherwise how would I know when it’s right? Regardless, this code will work equally well for any language.) Recently I decided to test the theory. The code is after the jump.
I immediately encountered a major problem with one of my assumptions: various sources disagreed on the frequency of letter occurrence in the english language. I always thought those numbers were well established since people widely regard “E” as the most common letter, and Pat Sajak has sworn by “RSTLN” for decades. Also, most ciphertexts are far too small to make frequencies very relevant. In fact, the example I originally started with didn’t even include all 26 letters.
After a few tries, this approach didn’t work. It may be a useless approach because any text large enough for statistics to matter would probably not use a simple substitution cipher. However it was a fun experiment and a negative result is still a result worth sharing.