Post by mortlach on May 6, 2017 16:17:35 GMT
The Words Of The Liber Primus II: 5-grams
Previously, we discussed creating a list of words that are probably in the Liber Primus (LP).
Eventually, with help from #cicadasolvers we decided on a long list of about 10,000 ‘master words.’ The chosen words can be found here.
N-grams
Once we have the individual words we must group them together into phrases, or n-grams. For this study we have considered 5-grams, phrases with 5 words (or punctuation). Google has vast amounts of publicaly available n-grams taken from books.
We can use this data to find possible n-grams that match the rune-word lengths in the LP using the procedure highlighted below. First, we need to find what are the word-lengths for each 5-gram in the LP.
The 5–grams of the LP
This is how we calculated the 5-grams in the LP:
1) The LP has been split into sections by red runes and the 5-grams for each section calculated separately. 5-grams that cross into the next red-rune section are not included.
2) We have included 5-grams with and without punctuation
a. Without punctuation, simply strip all punctuation before finding 5-grams
b. With punctuation see Punctuation and “Page Oddities”
3) Create a list of all the unique n-grams (with and without punctuation)
Punctuation and “Page Oddities”
At the end is a summary of some of the odd typesetting found in the LP. The meaning of punctuation marks ” and ‘ are obvious. Four dots is an end of sentence marker, so likely a . ? or ! and three dots … Well, we assume these could be a , ; or : (or something else?). In the LP quotation marks do not have an accompanying three-dots (such as a comma), typically when writing in general a comma would be used and so 5–grams with and without commas have been added as that is what is mostly found in the google data. British and American conventions for comma placement have been included. The spurious “7” has not been included and the weird run-on word through the red character 3 on 37.jpg line 5 has been assumed to be a typo (if it is not this is one small difference in thousands of 5-grams)
Google Data Parsing Procedure
We have:
LP_5-gram_lengths = A list of the word lengths of all possible 5-grams in the LP
master_words = A list of our preferred words
We iterate through each n-gram in the google data and test to see if it could be in the LP and if it contains words in master_words. For example:
For each n-gram in google_data:
Convert n-gram to runes
word_lengths = number of runes for each word in n-gram
If word_lengths is in LP_5-gram_lengths:
Keep n-gram
If the words in n-gram are in master_words:
Add flag is_in_masterwords
Here are some python hints and tips of processing the data in this manner. On a reasonable desktop PC with a few 100 MB/s disk read/write speed we have found a single process can parse ~5 GB per hour. Faster implementations are certainly possible.
Results
From hundreds of GB of Google data we have managed to cut it down to about 4GB, 3Gb of phrases that are in master_words and 1 GB of other words.
LINK TO compressed archive
Much of the data has 'Parts Of Speech' tags (tags indicating nouns, verbs, determiners etc.). The data also included letter case (upper and lower) so the start of sentences can be identified with capital letters. End of sentences can be found by looking for the punctuation marks. Here is the readme for this data.
There may be duplication of some n-grams, for example with and without 'Parts Of Speech' tags. This can be further processed later.
Further Studies
Now we have this raw data we can look into:
• Finding ways to search it., maybe a database, etc. create search functions
• Sophisticated searching with added cuts (e.g. word1 is “The”, word 3 IS NOT (a list of words to ignore)
• With this corpus we can implement ways of combining phrases to find higher length n-grams (say 10-grams). This can be done in a number of methods based on natural language processing techniques to be investigated.
• Start trying to find good cribs for the first sentence of each section. Maybe we will find sections that have very few plausible phrases. Having a good guess of the words should help focus the decrypting efforts. (Assuming we can decrypt before finding the deep-web-page… )
• There are 2,3 and 4 grams to add These can be fond from the -gram data
*Comments, questions, suggestions, omissions etc ? please try #cicadasolvers
Page Oddities
Where + is a four dots, and :. Is a thee-dots
4.jpg line 8, apostrophe:
ᛉᛂᛚ.ᚱᛚᚱᛒᚪᛈᛏᛉᛚᛏ.ᛗᛉᛁ'ᚹ+ᛂᛋᛟᛗᚾᚱ
5.jpg lines 5-6, 4 dots split to 3 then 1, separated by line break:
ᚳᚢ.ᚣᛠᚾᛏᚷᚳᚪ.ᛉᛡᛇ.ᚦᛂᚣᛂᛚᛟᛖᛚ.ᚣ:.
.ᛈᛡ.ᛖᚹᛟ.ᛇᚾᚪ.ᚻᛞᛇᛋ.ᚦᚣᛇᚦᛂᚦᚱᚢ.ᚳᛠ
6.jpg line 10, (open) double quotes:
ᚢᚻ.ᛈᚹᛂᛚᚷᛒ.ᛗᚢᛂᛗ.ᛇᚾᛇ+" ᚫᛚᚪᛚᚷᚪ.ᛋ
7.jpg line 1, (close) double quotes:
ᚻᛝ.ᛚᚦᛒ.ᛋᚳᚢᚳᚩᛡ".ᛚᚳᛂ.ᛉᚪᚾᛇᛉ+ᛠ
10.jpg line 8 non-rune character 7:
ᛉᚩ.ᛇᛁᛡᚠᛟᛒᚦᚠ.ᛋᛒ.ᚠᛞᛇ.ᚩᚦᛏ.7.ᚷ.ᛚᛂᛖᚫ
11.jpg lines 11-12, split 4 dots, 3 then 1, separated by line break:
ᛋᚢᛚᚪᛈᚢᚳᛖᚠᛞᛉ.ᚦᛠᛇᛝᚻ.ᚣᚱᛗ.ᛟᚾᛚ:.
.ᛈᚹᛞᚱᛂ.ᚪᛝᛞ.ᛁᚦᛏᚷᚢᚹᚳᚻᛖᚩᚪᛖ.ᛉᚪᚢ.
20.jpg lines 6-7, split 4 dots, 1 then 3, separated by line break:
ᛗ.ᛝᚦ.ᚣᛈᚠ.ᚪᛞᛚᚪᛖᛚᚩ.ᚱᚷ.ᛚᚳᛇᛏᚷᚣᛟᛗ.
:.ᚪᛁ.ᚷᛂᛒᛡᛗ.ᛞᛈᚪᚳᛠᚷᛋ.ᛏᛈ.ᚩᛋᛏᛗᚱᚣ
21.jpg line 2, apostrophe:
ᛈ.ᚻᚢᚻᚾᛏᚫᛒᛇᚩᛁᛈ.ᚫᚩ'ᚣ.ᛡᚣᛗᚷ.ᚠᚱᛡᛚ
22.jpg line 4, (open) double quotes:
ᚪᚹᛒ.ᚹᛒᛗᚱᚾᛗᚻᛗᛁᚾᚪᛞ."ᛡᛖᚩ.ᚾᚹᛡ.ᚢᛂ
22.jpg line 5, (close) double quotes:
ᚦᛠ.ᛚᚳᚷᛚᛇ.ᛟᛠᛠᚪ."+ᛇᛉᚣᚪ.ᚷᛏᚩ.ᛖ
35.jpg line 4, apostrophe:
ᚦᛚᚫ.ᚳᛋᚪᚩᚷᚹᛚ+ᛈᛖ'ᛏ.ᛂᛉᛝᛚ.ᛏᛉᚩᚣᛝ
36.jpg line 6, red character 1:
1 ᛚᚦᛇᛟ.ᚪᚫᛠ.ᛗᛉᚻᚳᛉᚪᛏᚦ.ᚫᛉ.ᚩᛋᚳᛞ
37.jpg line 1, red character 2:
2 ᚾᚣᛖᛉ.ᚾᚢᛉᛁ.ᛝᛏᛈᚹᛋᚣ.ᛏᛠᛈᛉ.ᚪᛁ
37.jpg line 5, red character 3 (especially odd, no punctuation before):
.ᚠᛋᛝᛞ.ᚳᚪᚱᛒ.ᚹᛈ.ᚾᚩᚦᚳᚦᚾᛗᚩᛖ.ᚣᛇᚾ.ᚠᛒ
3 ᛞᚢᛈ.ᚹᚾᛖᚪ.ᚱᛚᛁᚹ.ᚫᛉ.ᛝᚠᛞᚪᚠ.ᛒᛂᛉ.ᛞ
37.jpg line 10, red character 4:
4 ᛝᛂᛋᛂᛗᚱᛗ.ᚾᛒᛋᛗᛉᛞᚻᛉᛁ.ᚣᛡᚻᚣ
38.jpg line 2, 3 dots:
ᚫᚢᛠᚢᚪ:.ᚾᛝᚳ.ᛖᛈᚹᛉ.ᚢᛉᚫ.ᚾᛈᚳᚻᚱᚣ
38.jpg line 4, red character 5:
4 ᛝᛂᛋᛂᛗᚱᛗ.ᚾᛒᛋᛗᛉᛞᚻᛉᛁ.ᚣᛡᚻᚣ
40.jpg line 4, (open) double quotes:
ᚣᛗᚢᛠ.ᚱᛁᚢᛟᛂᛁ.ᛗᛖᚫ.ᚱᛋᛉᛝ+"ᛠᛈᛚ.
40.jpg line 5, (close) double quotes:
ᛞᚩᛚᛁᛉᛠᛝᛖᚱ".ᚾᛈᛖᚹᛡ.ᚾᛂᛏᚣ+ᛋᚩᛋ
40.jpg line 9, (open) double quotes:
ᚾᛏ."ᚠᛞᚱᛠᚷ.ᛈᚩᛇᚩᛗᛠᛒ.ᛂᛡ.ᛋᛗᚠ.ᛏ
40.jpg line 11, (close) double quotes:
ᚠᛉᚠᛏ.ᚳᛋᛉᛟ.ᚷᚠᛉᚾᛞ.ᛒᛏᛠᛡ"+ᛈᛡ
41.jpg line 10, apostrophe:
.ᛇᚣᛡ.ᚩᛉ.ᚪᛋᚣᛁᛝ.ᛉᛚᛂ'ᚳ.ᛖᚣᚢᛝᚦᛇᚱ.
42.jpg line 3, (open) double quotes:
ᚢᚳᚷ.ᛚᛏᛂᚹᛈ.ᚫᛗᛚ.ᛉᛚᛗᛏᛞᚠᛈᛁ+"ᚠᚳᚦ
42.jpg line 6, (close) double quotes and (open) double quotes:
ᚣᛁᛉᛇᚻᚩᛠ.ᚫᚻᛡᛝᛠᚦ.ᚾᚣ.ᚾᚠᛁᛝ"+"ᛏ
43.jpg line 1, (close) double quotes:
ᚻᛂ.ᛁᛖᛏᛡ.ᚷᛁᚩᚾ.ᚳᚢᚫᛗᛈᛋᚪᛡ.ᚷᛚᚣᚹᛟ".
46.jpg line 5, 3 dots:
ᚩᚳᛞᛒ.ᛟᚩᛠᚾᚠᚪ:.ᛚᛗᛖᛁᚦᚫᚪᛡᛂᛁᚪᚱ.ᚦᚱᛖ
53.jpg line 3, (open) double quotes:
ᛁᚢᛂ.ᚣᛡᛚᚢ.ᛂᛟ+"ᛠᛉᚣᛇᚱ.ᚩᛈᛋᚳᚫᛗ
53.jpg line 6, (close) double quotes:
ᛞᛖᛏ.ᚹᛉᛞᛚ.ᚩᚫᛂ.ᛇᚢᛒ".ᛗᛏ.ᛞᛗᛖ+ᛏ
Previously, we discussed creating a list of words that are probably in the Liber Primus (LP).
Eventually, with help from #cicadasolvers we decided on a long list of about 10,000 ‘master words.’ The chosen words can be found here.
N-grams
Once we have the individual words we must group them together into phrases, or n-grams. For this study we have considered 5-grams, phrases with 5 words (or punctuation). Google has vast amounts of publicaly available n-grams taken from books.
We can use this data to find possible n-grams that match the rune-word lengths in the LP using the procedure highlighted below. First, we need to find what are the word-lengths for each 5-gram in the LP.
The 5–grams of the LP
This is how we calculated the 5-grams in the LP:
1) The LP has been split into sections by red runes and the 5-grams for each section calculated separately. 5-grams that cross into the next red-rune section are not included.
2) We have included 5-grams with and without punctuation
a. Without punctuation, simply strip all punctuation before finding 5-grams
b. With punctuation see Punctuation and “Page Oddities”
3) Create a list of all the unique n-grams (with and without punctuation)
Punctuation and “Page Oddities”
At the end is a summary of some of the odd typesetting found in the LP. The meaning of punctuation marks ” and ‘ are obvious. Four dots is an end of sentence marker, so likely a . ? or ! and three dots … Well, we assume these could be a , ; or : (or something else?). In the LP quotation marks do not have an accompanying three-dots (such as a comma), typically when writing in general a comma would be used and so 5–grams with and without commas have been added as that is what is mostly found in the google data. British and American conventions for comma placement have been included. The spurious “7” has not been included and the weird run-on word through the red character 3 on 37.jpg line 5 has been assumed to be a typo (if it is not this is one small difference in thousands of 5-grams)
Google Data Parsing Procedure
We have:
LP_5-gram_lengths = A list of the word lengths of all possible 5-grams in the LP
master_words = A list of our preferred words
We iterate through each n-gram in the google data and test to see if it could be in the LP and if it contains words in master_words. For example:
For each n-gram in google_data:
Convert n-gram to runes
word_lengths = number of runes for each word in n-gram
If word_lengths is in LP_5-gram_lengths:
Keep n-gram
If the words in n-gram are in master_words:
Add flag is_in_masterwords
Here are some python hints and tips of processing the data in this manner. On a reasonable desktop PC with a few 100 MB/s disk read/write speed we have found a single process can parse ~5 GB per hour. Faster implementations are certainly possible.
Results
From hundreds of GB of Google data we have managed to cut it down to about 4GB, 3Gb of phrases that are in master_words and 1 GB of other words.
LINK TO compressed archive
Much of the data has 'Parts Of Speech' tags (tags indicating nouns, verbs, determiners etc.). The data also included letter case (upper and lower) so the start of sentences can be identified with capital letters. End of sentences can be found by looking for the punctuation marks. Here is the readme for this data.
There may be duplication of some n-grams, for example with and without 'Parts Of Speech' tags. This can be further processed later.
Further Studies
Now we have this raw data we can look into:
• Finding ways to search it., maybe a database, etc. create search functions
• Sophisticated searching with added cuts (e.g. word1 is “The”, word 3 IS NOT (a list of words to ignore)
• With this corpus we can implement ways of combining phrases to find higher length n-grams (say 10-grams). This can be done in a number of methods based on natural language processing techniques to be investigated.
• Start trying to find good cribs for the first sentence of each section. Maybe we will find sections that have very few plausible phrases. Having a good guess of the words should help focus the decrypting efforts. (Assuming we can decrypt before finding the deep-web-page… )
• There are 2,3 and 4 grams to add These can be fond from the -gram data
*Comments, questions, suggestions, omissions etc ? please try #cicadasolvers
Page Oddities
Where + is a four dots, and :. Is a thee-dots
4.jpg line 8, apostrophe:
ᛉᛂᛚ.ᚱᛚᚱᛒᚪᛈᛏᛉᛚᛏ.ᛗᛉᛁ'ᚹ+ᛂᛋᛟᛗᚾᚱ
5.jpg lines 5-6, 4 dots split to 3 then 1, separated by line break:
ᚳᚢ.ᚣᛠᚾᛏᚷᚳᚪ.ᛉᛡᛇ.ᚦᛂᚣᛂᛚᛟᛖᛚ.ᚣ:.
.ᛈᛡ.ᛖᚹᛟ.ᛇᚾᚪ.ᚻᛞᛇᛋ.ᚦᚣᛇᚦᛂᚦᚱᚢ.ᚳᛠ
6.jpg line 10, (open) double quotes:
ᚢᚻ.ᛈᚹᛂᛚᚷᛒ.ᛗᚢᛂᛗ.ᛇᚾᛇ+" ᚫᛚᚪᛚᚷᚪ.ᛋ
7.jpg line 1, (close) double quotes:
ᚻᛝ.ᛚᚦᛒ.ᛋᚳᚢᚳᚩᛡ".ᛚᚳᛂ.ᛉᚪᚾᛇᛉ+ᛠ
10.jpg line 8 non-rune character 7:
ᛉᚩ.ᛇᛁᛡᚠᛟᛒᚦᚠ.ᛋᛒ.ᚠᛞᛇ.ᚩᚦᛏ.7.ᚷ.ᛚᛂᛖᚫ
11.jpg lines 11-12, split 4 dots, 3 then 1, separated by line break:
ᛋᚢᛚᚪᛈᚢᚳᛖᚠᛞᛉ.ᚦᛠᛇᛝᚻ.ᚣᚱᛗ.ᛟᚾᛚ:.
.ᛈᚹᛞᚱᛂ.ᚪᛝᛞ.ᛁᚦᛏᚷᚢᚹᚳᚻᛖᚩᚪᛖ.ᛉᚪᚢ.
20.jpg lines 6-7, split 4 dots, 1 then 3, separated by line break:
ᛗ.ᛝᚦ.ᚣᛈᚠ.ᚪᛞᛚᚪᛖᛚᚩ.ᚱᚷ.ᛚᚳᛇᛏᚷᚣᛟᛗ.
:.ᚪᛁ.ᚷᛂᛒᛡᛗ.ᛞᛈᚪᚳᛠᚷᛋ.ᛏᛈ.ᚩᛋᛏᛗᚱᚣ
21.jpg line 2, apostrophe:
ᛈ.ᚻᚢᚻᚾᛏᚫᛒᛇᚩᛁᛈ.ᚫᚩ'ᚣ.ᛡᚣᛗᚷ.ᚠᚱᛡᛚ
22.jpg line 4, (open) double quotes:
ᚪᚹᛒ.ᚹᛒᛗᚱᚾᛗᚻᛗᛁᚾᚪᛞ."ᛡᛖᚩ.ᚾᚹᛡ.ᚢᛂ
22.jpg line 5, (close) double quotes:
ᚦᛠ.ᛚᚳᚷᛚᛇ.ᛟᛠᛠᚪ."+ᛇᛉᚣᚪ.ᚷᛏᚩ.ᛖ
35.jpg line 4, apostrophe:
ᚦᛚᚫ.ᚳᛋᚪᚩᚷᚹᛚ+ᛈᛖ'ᛏ.ᛂᛉᛝᛚ.ᛏᛉᚩᚣᛝ
36.jpg line 6, red character 1:
1 ᛚᚦᛇᛟ.ᚪᚫᛠ.ᛗᛉᚻᚳᛉᚪᛏᚦ.ᚫᛉ.ᚩᛋᚳᛞ
37.jpg line 1, red character 2:
2 ᚾᚣᛖᛉ.ᚾᚢᛉᛁ.ᛝᛏᛈᚹᛋᚣ.ᛏᛠᛈᛉ.ᚪᛁ
37.jpg line 5, red character 3 (especially odd, no punctuation before):
.ᚠᛋᛝᛞ.ᚳᚪᚱᛒ.ᚹᛈ.ᚾᚩᚦᚳᚦᚾᛗᚩᛖ.ᚣᛇᚾ.ᚠᛒ
3 ᛞᚢᛈ.ᚹᚾᛖᚪ.ᚱᛚᛁᚹ.ᚫᛉ.ᛝᚠᛞᚪᚠ.ᛒᛂᛉ.ᛞ
37.jpg line 10, red character 4:
4 ᛝᛂᛋᛂᛗᚱᛗ.ᚾᛒᛋᛗᛉᛞᚻᛉᛁ.ᚣᛡᚻᚣ
38.jpg line 2, 3 dots:
ᚫᚢᛠᚢᚪ:.ᚾᛝᚳ.ᛖᛈᚹᛉ.ᚢᛉᚫ.ᚾᛈᚳᚻᚱᚣ
38.jpg line 4, red character 5:
4 ᛝᛂᛋᛂᛗᚱᛗ.ᚾᛒᛋᛗᛉᛞᚻᛉᛁ.ᚣᛡᚻᚣ
40.jpg line 4, (open) double quotes:
ᚣᛗᚢᛠ.ᚱᛁᚢᛟᛂᛁ.ᛗᛖᚫ.ᚱᛋᛉᛝ+"ᛠᛈᛚ.
40.jpg line 5, (close) double quotes:
ᛞᚩᛚᛁᛉᛠᛝᛖᚱ".ᚾᛈᛖᚹᛡ.ᚾᛂᛏᚣ+ᛋᚩᛋ
40.jpg line 9, (open) double quotes:
ᚾᛏ."ᚠᛞᚱᛠᚷ.ᛈᚩᛇᚩᛗᛠᛒ.ᛂᛡ.ᛋᛗᚠ.ᛏ
40.jpg line 11, (close) double quotes:
ᚠᛉᚠᛏ.ᚳᛋᛉᛟ.ᚷᚠᛉᚾᛞ.ᛒᛏᛠᛡ"+ᛈᛡ
41.jpg line 10, apostrophe:
.ᛇᚣᛡ.ᚩᛉ.ᚪᛋᚣᛁᛝ.ᛉᛚᛂ'ᚳ.ᛖᚣᚢᛝᚦᛇᚱ.
42.jpg line 3, (open) double quotes:
ᚢᚳᚷ.ᛚᛏᛂᚹᛈ.ᚫᛗᛚ.ᛉᛚᛗᛏᛞᚠᛈᛁ+"ᚠᚳᚦ
42.jpg line 6, (close) double quotes and (open) double quotes:
ᚣᛁᛉᛇᚻᚩᛠ.ᚫᚻᛡᛝᛠᚦ.ᚾᚣ.ᚾᚠᛁᛝ"+"ᛏ
43.jpg line 1, (close) double quotes:
ᚻᛂ.ᛁᛖᛏᛡ.ᚷᛁᚩᚾ.ᚳᚢᚫᛗᛈᛋᚪᛡ.ᚷᛚᚣᚹᛟ".
46.jpg line 5, 3 dots:
ᚩᚳᛞᛒ.ᛟᚩᛠᚾᚠᚪ:.ᛚᛗᛖᛁᚦᚫᚪᛡᛂᛁᚪᚱ.ᚦᚱᛖ
53.jpg line 3, (open) double quotes:
ᛁᚢᛂ.ᚣᛡᛚᚢ.ᛂᛟ+"ᛠᛉᚣᛇᚱ.ᚩᛈᛋᚳᚫᛗ
53.jpg line 6, (close) double quotes:
ᛞᛖᛏ.ᚹᛉᛞᛚ.ᚩᚫᛂ.ᛇᚢᛒ".ᛗᛏ.ᛞᛗᛖ+ᛏ