Project Euler 59: Using a brute force attack, can you decrypt the cipher using XOR encryption?

Project Euler 59: Using a brute force attack, can you decrypt the cipher using XOR encryption?

Written by on 3 September 2011

Topics: Project Euler

People who know me would also know how much I have been looking forward to this exercise. Problem 59 of Project Euler is about encryption, and modern days encryption relies heavily on mathematics. I had much fun trying to crack the code of this problem. The problems reads:

Each character on a computer is assigned a unique code and the preferred standard is ASCII (American Standard Code for Information Interchange). For example, uppercase A = 65, asterisk (*) = 42, and lowercase k = 107.

A modern encryption method is to take a text file, convert the bytes to ASCII, then XOR each byte with a given value, taken from a secret key. The advantage with the XOR function is that using the same encryption key on the cipher text, restores the plain text; for example, 65 XOR 42 = 107, then 107 XOR 42 = 65.

For unbreakable encryption, the key is the same length as the plain text message, and the key is made up of random bytes. The user would keep the encrypted message and the encryption key in different locations, and without both “halves”, it is impossible to decrypt the message.

Unfortunately, this method is impractical for most users, so the modified method is to use a password as a key. If the password is shorter than the message, which is likely, the key is repeated cyclically throughout the message. The balance for this method is using a sufficiently long password key for security, but short enough to be memorable.

Your task has been made easy, as the encryption key consists of three lower case characters. Using cipher1.txt (right click and ‘Save Link/Target As…’), a file containing the encrypted ASCII codes, and the knowledge that the plain text must contain common English words, decrypt the message and find the sum of the ASCII values in the original text.

We could brute force through all the keys since there are only 263 = 17576. However, wWe know quite a lot about the encrypted message, but lets just have a short list of what we actually know about the encrypted text.

  1. The text is in English
  2. The key has three digits
  3. All digits are lower case English letters
  4. We know the encryption method
Since we know the language of the text as well as the encryption method it means we can relatively easy use frequency analysis on the problem. If you don’t know what that is, don’t worry I will explain it in a few seconds.
I tried to dive a bit into linq for this purpose, since it involves juggling multiple arrays around a lot, and I thought it might be easier to solve parts of it with linq.

Encryption algorithm

Before explaining how we get hold of the key, I just want to cover the necessary code for encryption/decryption of the text as well as the main method, and let us start out with the encryption method.

I wrote this one in the conventional method and my attempt at linq, to compare what was easiest for me to read and what was fastest. Basically we need a method where we provide the message and the key, and then we loop through the message and xor it with the key. The conventional method looks like this in C#:

private int[] Encrypt(int[] message, int[] key) {
    int[] encryptedMessage = new int[message.Length];

    for (int i = 0; i < message.Length; i++) {
        encryptedMessage[i] = message[i] ^ key[i%key.Length];
    }
    return encryptedMessage;
}

But wait a minute! Why are you calling it encrypt? You might ask. The reason is that this function can be used to both encode and decode the message given the correct key as the problem description also states. I have written the same function in linq and got the following piece of code out of it:

private int[] EncryptLinq(int[] message, int[] key) {
    IEnumerable repeatedKey = Enumerable
                              .Range(0, message.Length)
                              .Select(x => key[x % key.Length]);
    return message.Zip(repeatedKey, (x, y) => (x ^ y)).ToArray();
}

zip is a pretty nice function that takes two enumerables and performs a binary operation on the elements. One thing I missed with linq was a good way to repeat the key so it is long enough to cover the whole message.

I am not sure what I find most readable. I think for most purposes linq is nice, but the repeated key sort of bothers me.

Main loop

The main loop is pretty simple, it looks like

string filename = Environment.GetFolderPath(Environment.SpecialFolder.DesktopDirectory) + "\\input.txt";
const int keyLength = 3;

int[] message = File.ReadAllText(filename)
                .Split(',')
                .Select(int.Parse).ToArray();
int[] key = CryptAnalysis(message, keyLength);
int[] decryptedMessage = Encrypt(message, key);
int result = decryptedMessage.Sum();

Again I have used linq for a few things here. I have used it for reading the input file and splitting it into an array. It is really a much easier thing to do than juggling around with the conventional code I have used before. I have also used it for summing up the array at the end to get the result.

The only thing we miss is the analysis to get the key.

Frequency Analysis

Every language has it’s own fingerprint in the sense that the letters of the alphabet are not used equally often. In English the most common letter is ‘e’ which in a normal text it constitutes 12.7% of the letters. This is followed by ‘t’ which is used 9% of the time. This finger print of letters is different for each language and can be used to distinguish languages from each other. A list of distribution for several languages can be found on wiki.

But what good does that do? The message is encrypted! True, and let us consider a simple example at first where the key consists of just one letter. That means all characters in the encoded letter is replaced by another character, and thus by searching through the encoded message we find the most common character in the encrypted message and know that it is likely to represent ‘e’ if the message is in English. Thus we send ‘e’ and the found character through the encryption method and out comes the key.

If we didn’t know the language of the encrypted  message we could look at several of the most frequent letters and see if the pattern matches 12.7%, 9% and so on, or it matches some other language. That way we could even decode a message where the language wasn’t known in advance.

But our key is not one letter, ours is known to be three letters. However, we can split the encoded message into three piles. One encrypted by the first character of the key. This pile will be the characters 1,4,7,10… The second pile will be every third letter starting from the second, and so on. On each of these piles we do frequency analysis as described above and then we are likely to get the key for each pile.

This requires that the key is much shorter than the message, since we need a decent amount of data to work on if the frequency analysis has to work.

If we don’t know the key length we could try to split the text into say 2 piles and see if the character distribution matches each other, if not then we could split it into 3 piles and so on, until we get something that is similar in character distribution. This is of course out side the scope of this exercise, since we know the key size.

Wikipedia also has an article on frequency analysis with a good example of how to apply it to decrypt a message.

One last thing before coding this fine piece of cryptanalysis. The most common character to find in English is not ‘e’, it is a space character. So what often happens is that spaces are removed from the message before encryption. However, this is likely not the case here, so we will make a piece of code that split the encoded text into three piles and look for the most common character. This character is most likely representing a space character.

My C# implementation looks like:

private int[] Analysis(int[] message, int keyLength) {
    int maxsize = 0;
    for (int i = 0; i < message.Length; i++) {
        if (message[i] > maxsize) maxsize = message[i];
    }

    int[,] aMessage = new int[keyLength, maxsize+1];
    int[] key = new int[keyLength];

    for (int i = 0; i < message.Length; i++) {
        int j = i % keyLength;
        aMessage[j, message[i]]++;
        if (aMessage[j, message[i]] > aMessage[j, key[j]])
            key[j] = message[i];
    }

    int spaceAscii = 32;
    for (int i = 0; i < keyLength; i++) {
        key[i] = key[i] ^ spaceAscii;
    }

    return key;
}

Results

When I ran the code I got the following key ‘god’ out of the crypt analysis, which gave a resulting decrypted text

(The Gospel of John, chapter 1) 1 In the beginning the Word already existed. He was with God, and he was God. 2 He was in the beginning with od. 3 He created everything there is. Nothing exists that he didn’t make. 4 Life itself was in him, and this life gives light to everyone. 5 The light shines through the darkness, and the darkness can never extinguish it. 6 God sent John the Baptist 7 to tell everyone about the light so that everyone might believe because of his testimony. 8 John himself was not the light; he was only a witness to the light. 9 The one who is the true light, who gives light to everyone, was going to come into the world. 10 But although the world was made through him, the world didn’t recognize him when he came. 11 Even in his own land and among his own people, he was not accepted. 12 But to all who believed him and accepted him, he gave the right to become children of God. 13 They are reborn! This is not a physical birth resulting from human passion or plan, this rebirth comes from God.14 So the Word became human and lived here on earth among us. He was full of unfailing love and faithfulness. And we have seen his glory, the glory of the only Son of the Father

which gave a result of 107359 when it is summed up.

On the timing side I solved the problem in 5ms with the traditional operation and 13ms with the linq edition of the encryption. So there is a performance difference, though for any normal application I would prefer readability to speed, since it is easier to maintain the code that way.

Wrapping up

This was it for now. I hope you learned something new from the explanation, or just needed the last couple of hints to solve the exercise. I know other people have used other appraoches for solving the exercise. A lot of people have brute forced all keys and analysed the result in different ways. Some have checked if the major part of the decrypted message was actual letters, some have checked if the distribution of letters resembles english and some have searched if ‘the’ is frequently contained in the text. What was your approach?

The source code is available as always for you to peak in if you like to get the overview. Both the linq and non-linq versions. One question I have asked my self, is whether or not I want to use linq more in the future, and I think the answer is yes. I am likely to use it in places where

  1. It enhances readability – especially if the code looks more like the description I intend to give.
  2. It does not limit performance.

Some would argue that 1 is always true, but I don’t. My guess is I will use it for things like summing up arrays where I can avoid a for loop just for that purpose. However, for critical and repeated operations it is not likely that I will use linq, I am way too fond of loops for that.

The image is I used for this blog post is taken by William Neuheisel and shared under the creative commons license. The image  shows a sculpture called key note by Michael Christian.

7 Comments For This Post I'd Love to Hear Yours!

  1. SuprDewd says:

    The method you are using to figure out the key is pretty amazing. Lot faster than other methods I’ve seen, and way smarter. But I’m not sure how well it works in real-life encryption-cracking, but that doesn’t matter here.

    I’m glad you’re starting to take a look at LINQ. And I totally agree with your ideas on when it should and shouldn’t be used.

    And about repeating the key. It’s pretty easy to do with LINQ if you know how to… I would have created a “Repeat” extension method (which I would then stick in my library for later use) which would look something like:

    public static IEnumerable<T> Repeat<T>(this IEnumerable<T> collection)
    {
        while (true)
        {
            foreach (T item in collection)
            {
                yield return item;
            }
        }
    }
    

    And then your EncryptLinq could look like:

    private int[] EncryptLinq2(int[] message, int[] key)
    {
        return message.Zip(key.Repeat(), (x, y) => (x ^ y)).ToArray();
    }
    

    Google “yield return” and “extension methods” if you don’t already know how to use those C# features.

    Great post and awesome PE problem.

  2. Kristian says:

    Hi SuprDewd

    Thanks for letting me know about extension methods they are hugely helpful and a way to implement something I would need over and over again. I played around with this problem and reduced the analysis code to a few lines with linq.

    Regarding the decryption method and it’s real life applications. It wouldn’t work on any modern day mathematical encryption methods. It only works methods like this where you could basically perform the method by hand. However for a few thousand years encryption was based on methods like the one used in this problem.

    /Kristian

  3. Guy says:

    Hi,

    Can you please explain what you tried to achieve in the next code? (Analysis func):

    for (int i = 0; i < message.Length; i++) {
        int j = i % keyLength;
        aMessage[j, message[i]]++;
        if (aMessage[j, message[i]] > aMessage[j, key[j]])
            key[j] = message[i];
    }
     
    int spaceAscii = 32;
    for (int i = 0; i < keyLength; i++) {
        key[i] = key[i] ^ spaceAscii;
    }
    

    Thanks!

  4. Kristian says:

    Hi Guy
    Thanks for that question, which means I haven’t explained it well enough. So here is a bit of elaboration on the code.

    There are two sections of this code. The first section going from line 1 to 6.

    This section loops through the message one character at the time and counts the occurrence of each character. However, it splits it into three sections since the key length is known to be 3. Which section to increase the count in is controlled by line 3. Line 4-5 keeps track of the most common character found so far for each key index.

    The splitting into multiple piles is also explained in the post as: But our key is not one letter, ours is known to be three letters. However, we can split the encoded message into three piles. One encrypted by the first character of the key. This pile will be the characters 1,4,7,10… The second pile will be every third letter starting from the second, and so on. On each of these piles we do frequency analysis as described above and then we are likely to get the key for each pile.

    In the second section I then generate the key. I then assume that the a space is the most commonly occurring character in the text, so for each of the most commonly occurring character I xor it with the value for spacebar (ascii value 32). So now I have the three character key ready to apply to the message.

    Does this make sense? At least I hope it does, otherwise just ask again.

    /Kristian

  5. anant says:

    Hi Kristian

    Thanks for the wonderful explaination. But I think there was some typo in the answer. The final answer obtained it 107359 (and not 982). Or have I misunderstood something?

    Thanks
    Anant

  6. Kristian says:

    Hi Anant

    You are right, I have no idea where I got the 982 from. It is changed now

    /Kristian

  7. Daniel says:

    Thanks for this blog post. My method was to translate the first 6 letters, and consider only those keys which decrypted a vowel in those first six letters. This produced a bunch of results, so I added a regex to check if the translation matched [a-zA-Z ]*. However, no keys decrypted a full result which matched that regular expression. I didn’t realize there would be numbers and parentheses based on the problem description, and wouldn’t have thought to do anything past a regular expression.

Leave a Comment Here's Your Chance to Be Heard!

You can use these html tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

You can use short tags like: [code language="language"][/code]

The code tag supports the following languages: as3, bash, coldfusion, c, cpp, csharp, css, delphi, diff,erlang, groovy, javascript, java, javafx, perl, php, plain, powershell, python, ruby, scala, sql, tex, vb, xml

You can use [latex]your formula[/latex] if you want to format something using latex

Notify me of comments via e-mail. You can also subscribe without commenting.