Post by mortlach on Jun 1, 2017 17:14:45 GMT
Introduction
Here, we will recap the principal motivation, ideas and goals of this study. Then, we will show a simple yet effective application of the n-gram database that allows us to focus in on specific parts of the LP.
Motivation: When trying to decrypt the runes there are 3 main areas to consider:
1. What is the encryption algorithm?
2. What is the key?
3. What are the words that have been encrypted?
Finding out information on any of these three things can help us reduce the huge parameter space we are exploring. If we can find some likely phrases the runes decrypt to then we can then concentrate on items 1 and 2 We try and help seed our search by creating a corpus of real, existing data of actual English phrases which we use as a starting point.
Assumptions: We try to be mindful of our assumptions. The main assumptions for the work described below are:
1. There is a 1:1 correspondence between the cipher-runes and the message-runes
2. The word and sentence delimiters are a single dot and four dots
3. The text is written in English converted to runes, i.e. Runeglish.
4. The rules of the Gematria Primus are strictly followed
5. Most (if not all) of the LP word-n-grams will be common enough to exist in our corpus
For an attack on the LP - this is a relatively small set of robust assumptions. This is one reason why this method may result in some worthwhile conclusions.
A Warning: It is still going to involve long and boring trawls through many, many words.
The n-gram Data-Set
Previously, we created a corpus of words that the community thought was most likely to be in the Liber Primus (LP). Then we started trawling though the google word-n-gram data to find n-grams that have the same rune-word-lengths as the n-grams of the LP. Many of those n-grams include ‘Parts of Speech Tags’ (POS) and were tagged by us as if all the words were in_master_words or not_in_master_words and with their runeglish rune-count. To be clear we will explain what we mean by the
Runeglish-rune-count: is the number of runes in each word in the phrase. For example:
Patience is a virtue = ᛈᚪᛏᛁᛖᚾᚳᛖ ᛁᛋ ᚪ ᚢᛁᚱᛏᚢᛖ
Is a word-4-gram with rune-counts (8,2,1,6)
The word-3-gram rune-counts of this phrase are: (8,2,1), (2,1,6)
The word-2-gram rune-counts of this phrase are: (8,2), (2,1) (1,6)
The word-1-gram rune-counts of this phrase are: (8), (2), (1), (6)
At this stage we have kept every match from the Google data, no matter what words are in the n-gram. If their Runeglish equivalent had a pattern of rune-counts in the LP it was kept. We anticipate cutting the data further as we start to find ways of implementing it. As well as the 5-grams shared here we also have 1,2, and 4 -grams from the equivalent Google data (available on request). 3-grams are being worked on but they are at least an order of magnitude larger in size and are not ready to be shared yet.
As an indication of the size of the data set here are some numbers for unique entries (without including POS tags or letter –cases) for the master and not_master words:
The number of n-grams is large. Having the complete set of google-data as our starting point is deliberate. It enables us to make different cuts to the data as we learn more about how to use and apply it. For example, one way to cut the data would be to make a subset with just n-grams for a particular section, page, or even sentence of the LP.
A First Application of the Data
Before we look at the actual words in the data set we can use the counts for each n-gram gathered by Runeglish-rune-count to guide us. Some n-gram rune-counts are more common than others. This means we can use the frequency of each n-gram-rune-count in the data set to estimate how peculiar it is. We have started calling these uncommon n-grams hub-phrases. They should be a good place to start looking for possible words as they have the fewest possibilities from our corpus.
Worked Example
The first sentence after the red runes on 0.jpg is:
ᚦᛂᚷᚫ, ᚠᛂᛟ, ᚩᚾᚦ, ᚾᛖᚹᛒᚪᛋᛟᛇᛁᛝᚢ, ᚾᚫᚷᛁᚦ, ᚻᛒᚾᛡ, ᛈᛒᚾ, ᛇᛂᚦ, ᚪᛝᚣᛉ, ᛒᛞᛈ, ᛖᛡᚠᛉᚷᚠ, ᛋᛈᛏᚠᛈᚢᛝᚣᛝᛉᛡ, ᚣᚻ, ᛒᚢ, ᚷᚩᛈ, ᛝᚫᚦ, ᛁᚫᚻᛉᚦᛈᚷ, ᚣᚠᛝᚳᛂ, ᚦᚪᛗᛁᛝᛁᛡᚣ, ᚻᛇ, ᛏᚻᚫᛡ, ᛉᚣ, ᛖᚢᛝ, ᚳᚠᚾ, ᛇᚦᛂᛁᚦ, ᚦᛈ, ᚣᛝᛠ, ᚣᚾᛖᚣ, ᛞᛉᛝᚹ, ᛒᚳᛉᛞᛒᚠ, ᛗᛏᚾᛖ, ᛠᛂᚾᛚᚷᛒ, ᛉᚷᚦ
It has these rune-counts: 4, 3, 3, 11, 5, 4, 3, 3, 4, 3, 6, 11, 2, 2, 3, 3, 7, 5, 8, 2, 4, 2, 3, 3, 5, 2, 3, 4, 4, 6, 4, 6, 3
These 5-gram-rune-counts: (4, 3, 3, 11, 5), (3, 3, 11, 5, 4), (3, 11, 5, 4, 3), (11, 5, 4, 3, 3), (5, 4, 3, 3, 4), (4, 3, 3, 4, 3), (3, 3, 4, 3, 6), (3, 4, 3, 6, 11), (4, 3, 6, 11, 2), (3, 6, 11, 2, 2), (6, 11, 2, 2, 3), (11, 2, 2, 3, 3), (2, 2, 3, 3, 7), (2, 3, 3, 7, 5), (3, 3, 7, 5, 8), (3, 7, 5, 8, 2), (7, 5, 8, 2, 4), (5, 8, 2, 4, 2), (8, 2, 4, 2, 3), (2, 4, 2, 3, 3), (4, 2, 3, 3, 5), (2, 3, 3, 5, 2), (3, 3, 5, 2, 3), (3, 5, 2, 3, 4), (5, 2, 3, 4, 4), (2, 3, 4, 4, 6), (3, 4, 4, 6, 4), (4, 4, 6, 4, 6), (4, 6, 4, 6, 3)
And similar for the 4,2 –gram-rune-counts.
We now use our n-gram data and find the total number of n-grams with rune-count (4, 3, 3, 11, 5), and (3, 3, 11, 5, 4), etc. Here are the first few:
This table tells us that an 5-gram with rune-counts (4, 3, 3, 11, 5), is a factor 42 less probable than one with rune-counts (5, 4, 3, 3, 4). This is useful information and gives us hope that we can guess the start of the sentence (or at least reduce the number of possibilities to something manageable).
Results & Summary
Here is a spreadsheet with data for all the red runes and the first sentence after red runes
This analysis suggest that the easiest sections to guess the words will be the red runes and first few words on: 0.jpg, 3.jpg, 15.jpg, and 33.jpg
We have devised a method to estimate how easy it might be to guess the phrases of the LP. Of course, these estimates are statistical, they are not a guarantee of success, but they are valid as an estimate.
Here, we will recap the principal motivation, ideas and goals of this study. Then, we will show a simple yet effective application of the n-gram database that allows us to focus in on specific parts of the LP.
Motivation: When trying to decrypt the runes there are 3 main areas to consider:
1. What is the encryption algorithm?
2. What is the key?
3. What are the words that have been encrypted?
Finding out information on any of these three things can help us reduce the huge parameter space we are exploring. If we can find some likely phrases the runes decrypt to then we can then concentrate on items 1 and 2 We try and help seed our search by creating a corpus of real, existing data of actual English phrases which we use as a starting point.
Assumptions: We try to be mindful of our assumptions. The main assumptions for the work described below are:
1. There is a 1:1 correspondence between the cipher-runes and the message-runes
2. The word and sentence delimiters are a single dot and four dots
3. The text is written in English converted to runes, i.e. Runeglish.
4. The rules of the Gematria Primus are strictly followed
5. Most (if not all) of the LP word-n-grams will be common enough to exist in our corpus
For an attack on the LP - this is a relatively small set of robust assumptions. This is one reason why this method may result in some worthwhile conclusions.
A Warning: It is still going to involve long and boring trawls through many, many words.
The n-gram Data-Set
Previously, we created a corpus of words that the community thought was most likely to be in the Liber Primus (LP). Then we started trawling though the google word-n-gram data to find n-grams that have the same rune-word-lengths as the n-grams of the LP. Many of those n-grams include ‘Parts of Speech Tags’ (POS) and were tagged by us as if all the words were in_master_words or not_in_master_words and with their runeglish rune-count. To be clear we will explain what we mean by the
Runeglish-rune-count: is the number of runes in each word in the phrase. For example:
Patience is a virtue = ᛈᚪᛏᛁᛖᚾᚳᛖ ᛁᛋ ᚪ ᚢᛁᚱᛏᚢᛖ
Is a word-4-gram with rune-counts (8,2,1,6)
The word-3-gram rune-counts of this phrase are: (8,2,1), (2,1,6)
The word-2-gram rune-counts of this phrase are: (8,2), (2,1) (1,6)
The word-1-gram rune-counts of this phrase are: (8), (2), (1), (6)
At this stage we have kept every match from the Google data, no matter what words are in the n-gram. If their Runeglish equivalent had a pattern of rune-counts in the LP it was kept. We anticipate cutting the data further as we start to find ways of implementing it. As well as the 5-grams shared here we also have 1,2, and 4 -grams from the equivalent Google data (available on request). 3-grams are being worked on but they are at least an order of magnitude larger in size and are not ready to be shared yet.
As an indication of the size of the data set here are some numbers for unique entries (without including POS tags or letter –cases) for the master and not_master words:
Entries | counts | |
2-grams | 81973727 | 1335658428365 |
4-grams | 197058154 | 139681716982 |
5-grams | 43324217 | 18255827748 |
A First Application of the Data
Before we look at the actual words in the data set we can use the counts for each n-gram gathered by Runeglish-rune-count to guide us. Some n-gram rune-counts are more common than others. This means we can use the frequency of each n-gram-rune-count in the data set to estimate how peculiar it is. We have started calling these uncommon n-grams hub-phrases. They should be a good place to start looking for possible words as they have the fewest possibilities from our corpus.
Worked Example
The first sentence after the red runes on 0.jpg is:
ᚦᛂᚷᚫ, ᚠᛂᛟ, ᚩᚾᚦ, ᚾᛖᚹᛒᚪᛋᛟᛇᛁᛝᚢ, ᚾᚫᚷᛁᚦ, ᚻᛒᚾᛡ, ᛈᛒᚾ, ᛇᛂᚦ, ᚪᛝᚣᛉ, ᛒᛞᛈ, ᛖᛡᚠᛉᚷᚠ, ᛋᛈᛏᚠᛈᚢᛝᚣᛝᛉᛡ, ᚣᚻ, ᛒᚢ, ᚷᚩᛈ, ᛝᚫᚦ, ᛁᚫᚻᛉᚦᛈᚷ, ᚣᚠᛝᚳᛂ, ᚦᚪᛗᛁᛝᛁᛡᚣ, ᚻᛇ, ᛏᚻᚫᛡ, ᛉᚣ, ᛖᚢᛝ, ᚳᚠᚾ, ᛇᚦᛂᛁᚦ, ᚦᛈ, ᚣᛝᛠ, ᚣᚾᛖᚣ, ᛞᛉᛝᚹ, ᛒᚳᛉᛞᛒᚠ, ᛗᛏᚾᛖ, ᛠᛂᚾᛚᚷᛒ, ᛉᚷᚦ
It has these rune-counts: 4, 3, 3, 11, 5, 4, 3, 3, 4, 3, 6, 11, 2, 2, 3, 3, 7, 5, 8, 2, 4, 2, 3, 3, 5, 2, 3, 4, 4, 6, 4, 6, 3
These 5-gram-rune-counts: (4, 3, 3, 11, 5), (3, 3, 11, 5, 4), (3, 11, 5, 4, 3), (11, 5, 4, 3, 3), (5, 4, 3, 3, 4), (4, 3, 3, 4, 3), (3, 3, 4, 3, 6), (3, 4, 3, 6, 11), (4, 3, 6, 11, 2), (3, 6, 11, 2, 2), (6, 11, 2, 2, 3), (11, 2, 2, 3, 3), (2, 2, 3, 3, 7), (2, 3, 3, 7, 5), (3, 3, 7, 5, 8), (3, 7, 5, 8, 2), (7, 5, 8, 2, 4), (5, 8, 2, 4, 2), (8, 2, 4, 2, 3), (2, 4, 2, 3, 3), (4, 2, 3, 3, 5), (2, 3, 3, 5, 2), (3, 3, 5, 2, 3), (3, 5, 2, 3, 4), (5, 2, 3, 4, 4), (2, 3, 4, 4, 6), (3, 4, 4, 6, 4), (4, 4, 6, 4, 6), (4, 6, 4, 6, 3)
And similar for the 4,2 –gram-rune-counts.
We now use our n-gram data and find the total number of n-grams with rune-count (4, 3, 3, 11, 5), and (3, 3, 11, 5, 4), etc. Here are the first few:
5-gram | (4, 3, 3, 11, 5) | (3, 3, 11, 5, 4) | (3, 11, 5, 4, 3) | (11, 5, 4, 3, 3) | (5, 4, 3, 3, 4) |
Unique Entries | 551 | 589 | 435 | 291 | 23306 |
Counts | 154086 | 106203 | 83199 | 50310 | 6443072 |
This table tells us that an 5-gram with rune-counts (4, 3, 3, 11, 5), is a factor 42 less probable than one with rune-counts (5, 4, 3, 3, 4). This is useful information and gives us hope that we can guess the start of the sentence (or at least reduce the number of possibilities to something manageable).
Results & Summary
Here is a spreadsheet with data for all the red runes and the first sentence after red runes
This analysis suggest that the easiest sections to guess the words will be the red runes and first few words on: 0.jpg, 3.jpg, 15.jpg, and 33.jpg
We have devised a method to estimate how easy it might be to guess the phrases of the LP. Of course, these estimates are statistical, they are not a guarantee of success, but they are valid as an estimate.