Post by mortlach on Oct 2, 2016 18:04:36 GMT
Introduction
This will be part of a series of posts aiming to walkthrough some methods and techniques for rune decryption. This first post will outline some useful basic tools. Where possible we will share data, but advise writing your own functions to implement these techniques (and even generating your own data). There are many resources on the web that explain the following ideas in more detail, we will try and give an overview of them within the context of the runes.
Rune Transcriptions
There are numerous transcriptions of the pages of the Liber Primus. Here are some recommended ones (these should be accurate, but…it is very easy to make a mistake transcribing runes):
World Lists
It is essential to have good lists of Runeglish words. By Runeglish we mean the language the already-decrypted pages of the Liber Primus are written in:
Here are links for Runeglish words up to a length of 14 runes (the longest length word in the Liber Primus):
Length 1 to 7 words pastebin.com/uDJTW9Xe
Length 8 words pastebin.com/G0UpRFQQ
Length 9 and 10 pastebin.com/Eqb1mhWn
Length 11 to 14 pastebin.com/iERLcY8B
Text Analysis
Often we want to apply a decryption function to the cipher text and see if the resulting message is written in Runeglish. Checking decrypted text by eye is time consuming and it is more efficient if we automate the process. We do this by finding ways to compare decrypted texts to a statistical model. The comparisons we shall cover here score N-grams relative to random distributions, using the Index of Coincidence or relative to Runeglish distributions using log probabilities.
N-grams
An n-gram is a contiguous sequence of n items from a given sequence of text. The items could all be characters or words. This is a list of character-1-grams:
A,N, I,N,S,T,R,U,C,T,IO,N, C,O,M,M,A,N,D, Y,O,U,R, O,W,N, S,E,L,F
We could also have character-2-grams, word-1-grams, word-2-grams, etc:
(A,N),(N,I),(I,N),(N,S),(S,T),(T,R),(R,U),(U,C),(C,T),(T,IO),(IO,N),(N,C),(C,O),(O,M),(M,M),
(M,A),(A,N),(N,D),(D,Y),(Y,O),(O,U),(U,R),(R,O),(O,W),(W,N),(N,S),(S,E),(E,L),(L,F)
AN, INSTRUCTION, COMMAND, YOUR, OWN, SELF
(AN,INSTRUCTION), (INSTRUCTION,COMMAND), (COMMAND,YOUR), (YOUR,OWN), (OWN,SELF)
When comparing n-grams in a decryption to Runeglish, we need a sample of Runeglish text. There are many sample texts we could use, as an example here are the 1-gram and 2–gram counts from War and Peace (translated to Runeglish):
imgur.com/Vg0xewQ imgur.com/dsEF1L0
Comparison with Flat Distribution: Index of Coincidence (IoC)
en.wikipedia.org/wiki/Index_of_coincidence
The IoC compares the N-grams in a message to a uniform, flat distribution, where each item has an equal probability of occuring. The IoC provides a measure of how likely it is to draw two matching N-grams when randomly selecting from a given text and can be defined as:
IoC=(Actual coincidences)/(Expected coincidences)
For example, a truly random text will give a matching coincidence 1 in 29 times. For 2-grams the coincidence will be 1 in 29^2 times, etc. It is important to remember that the IoC can be used for more than just 1-grams, in principal it can be used for any set of “things”. The result of the IoC calculation can be normalized with respect to a flat distribution, this normalization factor can vary, for example, we can define that a flat distribution should give an IoC of exactly 1, then the normalization factor will change ensuring that the IoC will be 1 for 1-grams, 2-grams, 3 –grams etc. The Wikipedia article has a nice introduction that explains how to derive the expression for IoC. The 'Normalized' IoC for War and Peace for different n-grams are:
1-gram IoC = 1.77, 2-gram IoC = 4.97, 3-gram IoC = 20.9, 4-gram IoC = 125
These numbers are telling us that the 4-grams are very different to random, much more so than lower n-grams. This makes complete sense, groups of 4 letters are much more expressive than groups of 1 letter. We use this fact when scoring a text based on its similarity to ‘true’ Runeglish text, explained next.
Comparison with Runeglish Distribution:
Continuing our War and Peace example: if we compare the n-grams in a trial decryption to the n-grams in War and Peace we can quantify how similar to War and Peace our trial is. Trials that get high scores will be more similar to War and Peace (which we assume is similar to true Runeglish) and so worth investigating further.
n-gram frequency
We use the frequency of n-grams in a decrypted text compared to the frequency of n-grams in a sample to measure how similar they are by giving them a score. For example: The Runeglish 2-gram (E,E) occurs much more frequently than (AE,IO) if our trial decryption contains many uncommon n-grams then it would get a low score indicating it is likely to be nonsense and should be ignored. We score the n-grams by using our n-gram counts from a sample text. From these counts the probability for an n-gram is the count divided by the total number of n-grams in the text. The total probability for a trial text, Mt, comprising of i n-grams ,N1, N2,…Ni is found by multiplying the probability of each N together:
Practically, dealing with the small probabilities of some N-grams can cause problems and so often the base-10-logarithm (log_10) of the probability is used. When using log probabilities the total probability is found by summing log probabilities:
Text that has a higher log probability is more similar to Runeglish. Here are the log probabilities of each character from War and Peace:
{E, -0.901198}, {A, -1.10037}, {O, -1.11452}, {S, -1.15951}, {N, -1.17584},
{T, -1.1941}, {R, -1.20621}, {I, -1.22264}, {D, -1.30479}, {L, -1.39308},
{H, -1.40623}, {U, -1.42257}, {C, -1.45147}, {TH, -1.50994}, {M, -1.58784},
{W, -1.5886}, {F, -1.63819}, {Y, -1.71246}, {P, -1.71942}, {B, -1.83791},
{G, -1.95718}, {(I)NG, -1.97996}, {EA, -2.2527}, {IO, -2.36287}, {X, -2.73586},
{J, -2.96712}, {EO, -3.16848}, {OE, -3.67616}, {AE, -4.15242}
To convert back to the actual probability calculate 10^( Log probability value). E has a probability of 10^(-0.901198) = 0.125546, meaning about 12.5% of all runes in War and Peace are E. The log probability for all 2-grams can be found here:
pastebin.com/be4sNiw9
The 4-gram log probabilities are in this file: www.filedropper.com/4gramlogprobability
Extending this to include position in words
We can assume that the word spacing in the Liber Primus is known and exploit this extra information. Therefore, N-grams can be calculated using the concatenated runes, or by using the word delimited information. E.g. consider the 4-grams from the start of War and Peace:
WELL, PRINCE, SO, GENOA, AND, LUCCA, ARE, NOW, JUST
Concatenated 4-grams:
{W, E, L, L}, {E, L, L, P}, {L, L, P, R}, {L, P, R, I}, {P, R, I, N}, {R, I, N, C}, {I, N, C, E}, {N, C, E, S}, {C, E, S, O}, {E, S, O, G}, {S, O, G, E}, {O, G, E, N}, {G, E, N, O}, {E, N, O, A}, {N, O, A, A}, {O, A, A, N}, {A, A, N, D}, {A, N, D, L}, {N, D, L, U}, {D, L, U, C}, {L, U, C, C}, {U, C, C, A}, {C, C, A, A}, {C, A, A, R}, {A, A, R, E}, {A, R, E, N}, {R, E, N, O}, {E, N, O, W}, {N, O, W, J}, {O, W, J, U}, {W, J, U, S}, {J, U, S, T}
Word delimited 4-grams:
{W, E, L, L},{P, R, I, N}, {R, I, N, C}, {I, N, C, E},{G, E, N, O}, {E, N, O, A}, {L, U, C, C}, {U, C, C, A},{J, U, S, T}
An advantage of using the concatenated text is that there are lots more N-grams, so you can get better statistics. The advantage of using the N-grams knowing the word spacing is they include the extra information about word spacing. This could be further extended by including the word length and the position of the starting character.
We don't share any data for these lists as it is a little too complicated to share in simple way.
An optimizer
Another general tool that can be useful is an optimizer. For example: we may want to optimize some values in a key and we use the IoC and/or log probability score of the decrypted text to choose ‘optimum’ values for the key. There are many choices for optimization algorithms, from hill climbers and downhill-simplexes, stochastic methods like Simulated Annealing and Genetic algorithms and many more. As these are general problems not specific to the runes we have not covered them in more detail, but can do in future if necessary.
Summary
We have given an overview of some of the basic tools that can be used to decrypt text, and provided accompanying Runeglish data. These include word lists and the transcriptions of the runes. Basic tools for text analysis have also been introduced: The IoC and the n-gram log probability with example data. Other useful methods for analyzing text exist, for example the Hamming distance of a word compared to words in the word list. In the next posts we will use these tools to decrypt some of the pages of the Liber Primus and also discuss how to use the tools when attempting to solve the unfinished pages.
*Comments, questions, suggestions, omissions etc ? please try #cicadasolvers
MSGA
This will be part of a series of posts aiming to walkthrough some methods and techniques for rune decryption. This first post will outline some useful basic tools. Where possible we will share data, but advise writing your own functions to implement these techniques (and even generating your own data). There are many resources on the web that explain the following ideas in more detail, we will try and give an overview of them within the context of the runes.
Rune Transcriptions
There are numerous transcriptions of the pages of the Liber Primus. Here are some recommended ones (these should be accurate, but…it is very easy to make a mistake transcribing runes):
World Lists
It is essential to have good lists of Runeglish words. By Runeglish we mean the language the already-decrypted pages of the Liber Primus are written in:
Here are links for Runeglish words up to a length of 14 runes (the longest length word in the Liber Primus):
Length 1 to 7 words pastebin.com/uDJTW9Xe
Length 8 words pastebin.com/G0UpRFQQ
Length 9 and 10 pastebin.com/Eqb1mhWn
Length 11 to 14 pastebin.com/iERLcY8B
Text Analysis
Often we want to apply a decryption function to the cipher text and see if the resulting message is written in Runeglish. Checking decrypted text by eye is time consuming and it is more efficient if we automate the process. We do this by finding ways to compare decrypted texts to a statistical model. The comparisons we shall cover here score N-grams relative to random distributions, using the Index of Coincidence or relative to Runeglish distributions using log probabilities.
N-grams
An n-gram is a contiguous sequence of n items from a given sequence of text. The items could all be characters or words. This is a list of character-1-grams:
A,N, I,N,S,T,R,U,C,T,IO,N, C,O,M,M,A,N,D, Y,O,U,R, O,W,N, S,E,L,F
We could also have character-2-grams, word-1-grams, word-2-grams, etc:
(A,N),(N,I),(I,N),(N,S),(S,T),(T,R),(R,U),(U,C),(C,T),(T,IO),(IO,N),(N,C),(C,O),(O,M),(M,M),
(M,A),(A,N),(N,D),(D,Y),(Y,O),(O,U),(U,R),(R,O),(O,W),(W,N),(N,S),(S,E),(E,L),(L,F)
AN, INSTRUCTION, COMMAND, YOUR, OWN, SELF
(AN,INSTRUCTION), (INSTRUCTION,COMMAND), (COMMAND,YOUR), (YOUR,OWN), (OWN,SELF)
When comparing n-grams in a decryption to Runeglish, we need a sample of Runeglish text. There are many sample texts we could use, as an example here are the 1-gram and 2–gram counts from War and Peace (translated to Runeglish):
imgur.com/Vg0xewQ imgur.com/dsEF1L0
Comparison with Flat Distribution: Index of Coincidence (IoC)
en.wikipedia.org/wiki/Index_of_coincidence
The IoC compares the N-grams in a message to a uniform, flat distribution, where each item has an equal probability of occuring. The IoC provides a measure of how likely it is to draw two matching N-grams when randomly selecting from a given text and can be defined as:
IoC=(Actual coincidences)/(Expected coincidences)
For example, a truly random text will give a matching coincidence 1 in 29 times. For 2-grams the coincidence will be 1 in 29^2 times, etc. It is important to remember that the IoC can be used for more than just 1-grams, in principal it can be used for any set of “things”. The result of the IoC calculation can be normalized with respect to a flat distribution, this normalization factor can vary, for example, we can define that a flat distribution should give an IoC of exactly 1, then the normalization factor will change ensuring that the IoC will be 1 for 1-grams, 2-grams, 3 –grams etc. The Wikipedia article has a nice introduction that explains how to derive the expression for IoC. The 'Normalized' IoC for War and Peace for different n-grams are:
1-gram IoC = 1.77, 2-gram IoC = 4.97, 3-gram IoC = 20.9, 4-gram IoC = 125
These numbers are telling us that the 4-grams are very different to random, much more so than lower n-grams. This makes complete sense, groups of 4 letters are much more expressive than groups of 1 letter. We use this fact when scoring a text based on its similarity to ‘true’ Runeglish text, explained next.
Comparison with Runeglish Distribution:
Continuing our War and Peace example: if we compare the n-grams in a trial decryption to the n-grams in War and Peace we can quantify how similar to War and Peace our trial is. Trials that get high scores will be more similar to War and Peace (which we assume is similar to true Runeglish) and so worth investigating further.
n-gram frequency
We use the frequency of n-grams in a decrypted text compared to the frequency of n-grams in a sample to measure how similar they are by giving them a score. For example: The Runeglish 2-gram (E,E) occurs much more frequently than (AE,IO) if our trial decryption contains many uncommon n-grams then it would get a low score indicating it is likely to be nonsense and should be ignored. We score the n-grams by using our n-gram counts from a sample text. From these counts the probability for an n-gram is the count divided by the total number of n-grams in the text. The total probability for a trial text, Mt, comprising of i n-grams ,N1, N2,…Ni is found by multiplying the probability of each N together:
P( Mt ) = P( N1 ) × P( N2 ) × P( N3 ) × ….× P( Ni )
Practically, dealing with the small probabilities of some N-grams can cause problems and so often the base-10-logarithm (log_10) of the probability is used. When using log probabilities the total probability is found by summing log probabilities:
log_10 P( Mt ) = log_10 P( N1 ) + log_10 P( N2 ) + log_10 P( N3 ) + … + log_10 P( Ni )
Text that has a higher log probability is more similar to Runeglish. Here are the log probabilities of each character from War and Peace:
{E, -0.901198}, {A, -1.10037}, {O, -1.11452}, {S, -1.15951}, {N, -1.17584},
{T, -1.1941}, {R, -1.20621}, {I, -1.22264}, {D, -1.30479}, {L, -1.39308},
{H, -1.40623}, {U, -1.42257}, {C, -1.45147}, {TH, -1.50994}, {M, -1.58784},
{W, -1.5886}, {F, -1.63819}, {Y, -1.71246}, {P, -1.71942}, {B, -1.83791},
{G, -1.95718}, {(I)NG, -1.97996}, {EA, -2.2527}, {IO, -2.36287}, {X, -2.73586},
{J, -2.96712}, {EO, -3.16848}, {OE, -3.67616}, {AE, -4.15242}
To convert back to the actual probability calculate 10^( Log probability value). E has a probability of 10^(-0.901198) = 0.125546, meaning about 12.5% of all runes in War and Peace are E. The log probability for all 2-grams can be found here:
pastebin.com/be4sNiw9
The 4-gram log probabilities are in this file: www.filedropper.com/4gramlogprobability
Extending this to include position in words
We can assume that the word spacing in the Liber Primus is known and exploit this extra information. Therefore, N-grams can be calculated using the concatenated runes, or by using the word delimited information. E.g. consider the 4-grams from the start of War and Peace:
WELL, PRINCE, SO, GENOA, AND, LUCCA, ARE, NOW, JUST
Concatenated 4-grams:
{W, E, L, L}, {E, L, L, P}, {L, L, P, R}, {L, P, R, I}, {P, R, I, N}, {R, I, N, C}, {I, N, C, E}, {N, C, E, S}, {C, E, S, O}, {E, S, O, G}, {S, O, G, E}, {O, G, E, N}, {G, E, N, O}, {E, N, O, A}, {N, O, A, A}, {O, A, A, N}, {A, A, N, D}, {A, N, D, L}, {N, D, L, U}, {D, L, U, C}, {L, U, C, C}, {U, C, C, A}, {C, C, A, A}, {C, A, A, R}, {A, A, R, E}, {A, R, E, N}, {R, E, N, O}, {E, N, O, W}, {N, O, W, J}, {O, W, J, U}, {W, J, U, S}, {J, U, S, T}
Word delimited 4-grams:
{W, E, L, L},{P, R, I, N}, {R, I, N, C}, {I, N, C, E},{G, E, N, O}, {E, N, O, A}, {L, U, C, C}, {U, C, C, A},{J, U, S, T}
An advantage of using the concatenated text is that there are lots more N-grams, so you can get better statistics. The advantage of using the N-grams knowing the word spacing is they include the extra information about word spacing. This could be further extended by including the word length and the position of the starting character.
We don't share any data for these lists as it is a little too complicated to share in simple way.
An optimizer
Another general tool that can be useful is an optimizer. For example: we may want to optimize some values in a key and we use the IoC and/or log probability score of the decrypted text to choose ‘optimum’ values for the key. There are many choices for optimization algorithms, from hill climbers and downhill-simplexes, stochastic methods like Simulated Annealing and Genetic algorithms and many more. As these are general problems not specific to the runes we have not covered them in more detail, but can do in future if necessary.
Summary
We have given an overview of some of the basic tools that can be used to decrypt text, and provided accompanying Runeglish data. These include word lists and the transcriptions of the runes. Basic tools for text analysis have also been introduced: The IoC and the n-gram log probability with example data. Other useful methods for analyzing text exist, for example the Hamming distance of a word compared to words in the word list. In the next posts we will use these tools to decrypt some of the pages of the Liber Primus and also discuss how to use the tools when attempting to solve the unfinished pages.
*Comments, questions, suggestions, omissions etc ? please try #cicadasolvers
MSGA