Tokenization by the good guys
بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
We need to turn characters into numbers. We can do that with Unicode like this.
Unicode
On Ubuntu Linux press [CTRL][Shift][T] and in the terminal:
schmuck@Schmoe:~$ python3
to get >>>
ord('h')
‘h’ has the Unicode code point 104. ord
can only take a single character, to get the code points of many characters:
[ord(x) for x in "إِنَّ اللَّهَ اصْطَفَىٰ آدَمَ وَنُوحًا وَآلَ إِبْرَاهِيمَ وَآلَ عِمْرَانَ"]
[1573, 1616, 1606, 1617, 1614, 32, 1575, 1604, 1604, 1617, 1614, 1607, 1614, 32, 1575, 1589, 1618, 1591, 1614, 1601, 1614, 1609, 1648, 32, 1570, 1583, 1614, 1605, 1614, 32, 1608, 1614, 1606, 1615, 1608, 1581, 1611, 1575, 32, 1608, 1614, 1570, 1604, 1614, 32, 1573, 1616, 1576, 1618, 1585, 1614, 1575, 1607, 1616, 1610, 1605, 1614, 32, 1608, 1614, 1570, 1604, 1614, 32, 1593, 1616, 1605, 1618, 1585, 1614, 1575, 1606, 1614]
But Unicode is always changing so not very good for us. We can use a type of Unicode called UTF-8 which can turn our characters into binary-data or byte-streams like this:
"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ".encode("utf-8")
b’\xd8\xa8\xd9\x90\xd8\xb3\xd9\x92\xd9\x85\xd9\x90 \xd8\xa7\xd9\x84\xd9\x84\xd9\x91\xd9\x8e\xd9\x87\xd9\x90 \xd8\xa7\xd9\x84\xd8\xb1\xd9\x91\xd9\x8e\xd8\xad\xd9\x92\xd9\x85\xd9\x8e\xd9\xb0\xd9\x86\xd9\x90 \xd8\xa7\xd9\x84\xd8\xb1\xd9\x91\xd9\x8e\xd8\xad\xd9\x90\xd9\x8a\xd9\x85\xd9\x90’
but it is not very pretty so we can turn it into useful numbers to work with like this:
list("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ".encode("utf-8"))
[216, 168, 217, 144, 216, 179, 217, 146, 217, 133, 217, 144, 32, 216, 167, 217, 132, 217, 132, 217, 145, 217, 142, 217, 135, 217, 144, 32, 216, 167, 217, 132, 216, 177, 217, 145, 217, 142, 216, 173, 217, 146, 217, 133, 217, 142, 217, 176, 217, 134, 217, 144, 32, 216, 167, 217, 132, 216, 177, 217, 145, 217, 142, 216, 173, 217, 144, 217, 138, 217, 133, 217, 144]
Getting the data
Go to https://tanzil.net/download/ and choose ‘Uthmani’ under Quran text type: , ‘Text’ for Output file format: and tick all boxes except ‘Include sequential tanweens’, then Download to get the file and then using python3 to open:
text = open("quran-uthmani.txt", 'r').read()
print(text)
We can encode in UTF-8 and get some ugly binary:
tokens = text.encode("utf-8")
print(tokens)
A neater way, so we get 0-255 range of code points:
tokens = list(map(int, tokens))
quit()
to get out of >>> and back to $
Google Colab
If you have a Google account, or if you have a throw away SIM card to get a new Google account using a dumb phone for SMS verification, you can start using Google’s Colab for free:
https://colab.research.google.com/
Yes, for free so when at the payment screen, click away from it to get the free lab.
Adding your file to Google Colab
Click the file icon and then the dog-earred page with an up arrow and add your quran-uthmani.txt file from ‘Getting the data’ (above).
It will be deleted so you have to add it every session you start Colab, so have it saved somewhere.
Our actor example applied to Python
So, we talked earlier about Fenyman and James Clear’s ‘Atomic Habits’ and how actors learn their script.
We are now going to use that to learn our own ‘scripting’, that is learn programming:
*Type everything out and never ever copy and paste code. And never say never.*
We said we won’t type/write by going back and forth but in this case we will except that we type everything out. It will first enter our memory so we know where to find it when we need it in the future and for speed we will type first because searching for something we saw months ago somewhere takes longer than trying muscle memory out and just typing it.
We say, ‘never say never’ because there are times when there is no point in typing out long paragraphs of code that are ‘boiler plate’ meaning, always used as is everywhere. In those rare cases, copy and paste.
Getting the most common pairs
Type in Google Colab text = open("quran-uthmani.txt", 'r').read()
and then press [SHIFT][ENTER] to run the ‘code cell’.
open()
"quran-uthmani.txt"
as read-only 'r'
and save as text
.
tokens = text.encode("utf-8")
take text
and encode()
it with "utf-8"
and save as tokens
tokens = list(map(int, tokens))
def get_stats(ids):
counts = {}
for pair in zip(ids, ids[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
stats = get_stats(tokens)
print(stats)
{(216, 168): 11603, (168, 217): 11593, (217, 144): 46642, (144, 216): 7778, (216, 179): 6122, (179, 217): 6122, (217, 146): 37372, (146, 217): 14675, (217, 133): 27071, (133, 217): 25740, (144, 32): 7712, (32, 217): 45082, (217, 177): 13819, (177, 217): 25239, (217, 132): 38550, (132, 217): 36124, (217, 145): 23016, (145, 217): 23016, (217, 142): 123396, (142, 217): 53930, (217, 135): 14962, (135, 217): 14961, (132, 216): 2316, (216, 177): 12627, (142, 216): 50698, (216, 173): 4364, (173, 217): 4364, (217, 128): 6848, (128, 217): 6808, (217, 176): 9838, (176, 217): 10000, (217, 134): 27380, (134, 217): 22530, (144, 217): 29600, (217, 138): 18334, (138, 217): 16706, (144, 10): 595, (10, 217): 4481, (146, 216): 13852, (216, 175): 5991, (175, 217): 5945, (217, 143): 37320, (143, 32): 6675, (32, 216): 26762, (216, 185): 9405, (185, 217): 9403, (142, 10): 2843, (217, 131): 10497, (131, 217): 10497, (217, 136): 24970, (136, 217): 20377, (10, 216): 1555, (216, 165): 5088, (165, 217): 5260, (216, 167): 25184, (167, 217): 6829, (142, 32): 15924, (143, 216): 5806, (216, 170): 10520, (170, 217): 10504, (143, 10): 331, (167, 32): 10720, (216, 181): 2074, (181, 217): 2071, (176, 216): 3255, (216, 183): 1273, (183, 217): 1266, (217, 130): 7034, (130, 217): 7034, (216, 176): 4932, (216, 163): 8900, (163, 217): 8901, (146, 32): 8751, (216, 186): 1221, (186, 217): 1221, (216, 182): 1686, (182, 217): 1686, (143, 217): 23252, (136, 216): 4289, (217, 147): 5376, (147, 217): 90, (147, 10): 76, (32, 219): 4379, (219, 155): 12, (155, 32): 12, (217, 129): 8747, (129, 217): 8746, (217, 139): 3741, (139, 217): 93, (217, 137): 6603, (137, 32): 3035, (216, 164): 706, (164, 217): 706, (216, 169): 2344, (169, 217): 2344, (216, 178): 1599, (178, 217): 1599, (147, 32): 2459, (134, 216): 1499, (134, 32): 3081, (217, 148): 773, (148, 217): 773, (167, 216): 2953, (216, 174): 2497, (174, 217): 2497, (136, 219): 255, (219, 159): 3988, (159, 217): 268, (147, 216): 2751, (216, 166): 921, (166, 217): 1085, (137, 217): 3531, (176, 32): 1337, (219, 150): 1682, (150, 32): 1682, (167, 219): 3789, (159, 32): 3704, (216, 161): 2782, (161, 217): 2782, (217, 140): 2519, (140, 32): 1777, (216, 180): 2124, (180, 217): 2124, (216, 184): 853, (184, 217): 853, (140, 10): 605, (133, 32): 1328, (139, 216): 2976, (140, 219): 134, (219, 162): 510, (162, 32): 338, (219, 151): 603, (151, 32): 603, (177, 216): 1197, (170, 32): 17, (216, 172): 3317, (172, 217): 3317, (216, 171): 1414, (171, 217): 1414, (143, 219): 1256, (219, 165): 1257, (165, 32): 1042, (217, 141): 2633, (141, 32): 2080, (219, 154): 1972, (154, 32): 1972, (138, 216): 1618, (139, 32): 556, (144, 219): 957, (219, 166): 957, (166, 32): 791, (219, 153): 68, (153, 32): 68, (10, 219): 199, (219, 158): 199, (158, 32): 199, (219, 152): 22, (152, 32): 22, (134, 219): 270, (162, 216): 158, (132, 32): 110, (141, 10): 454, (141, 219): 99, (219, 173): 99, (173, 32): 84, (168, 32): 9, (136, 32): 49, (128, 219): 40, (219, 167): 38, (139, 219): 106, (175, 216): 38, (181, 219): 3, (219, 156): 7, (156, 217): 2, (165, 216): 19, (175, 32): 8, (219, 160): 66, (160, 32): 62, (159, 216): 14, (177, 32): 9, (138, 219): 10, (167, 10): 931, (159, 10): 2, (140, 216): 3, (162, 10): 14, (183, 216): 7, (137, 219): 1, (171, 32): 1, (219, 169): 15, (169, 10): 15, (177, 219): 1, (219, 170): 1, (142, 219): 1, (219, 171): 1, (129, 32): 1, (156, 10): 2, (137, 10): 36, (185, 32): 2, (135, 10): 1, (176, 10): 178, (146, 10): 94, (219, 168): 1, (168, 216): 1, (160, 10): 4, (173, 10): 15, (156, 32): 3, (219, 172): 1, (172, 216): 1, (133, 10): 3, (219, 163): 1, (139, 10): 10, (165, 10): 24, (166, 10): 2, (168, 10): 1, (10, 10): 2, (10, 35): 28, (35, 32): 18, (32, 80): 6, (80, 76): 1, (76, 69): 1, (69, 65): 1, (65, 83): 1, (83, 69): 2, (69, 32): 3, (32, 68): 1, (68, 79): 1, (79, 32): 1, (32, 78): 2, (78, 79): 2, (79, 84): 2, (84, 32): 4, (32, 82): 1, (82, 69): 1, (69, 77): 1, (77, 79): 1, (79, 86): 1, (86, 69): 1, (32, 79): 2, (79, 82): 1, (82, 32): 1, (32, 67): 6, (67, 72): 2, (72, 65): 2, (65, 78): 2, (78, 71): 3, (71, 69): 1, (32, 84): 9, (84, 72): 1, (72, 73): 1, (73, 83): 2, (83, 32): 3, (67, 79): 1, (79, 80): 1, (80, 89): 1, (89, 82): 1, (82, 73): 1, (73, 71): 1, (71, 72): 1, (72, 84): 1, (32, 66): 1, (66, 76): 1, (76, 79): 2, (79, 67): 1, (67, 75): 1, (75, 10): 1, (35, 61): 2, (61, 61): 134, (61, 10): 2, (35, 10): 8, (32, 32): 29, (84, 97): 4, (97, 110): 19, (110, 122): 6, (122, 105): 6, (105, 108): 7, (108, 32): 9, (32, 81): 3, (81, 117): 3, (117, 114): 4, (114, 97): 5, (110, 32): 11, (84, 101): 1, (101, 120): 6, (120, 116): 6, (116, 32): 9, (32, 40): 3, (40, 85): 1, (85, 116): 1, (116, 104): 6, (104, 109): 1, (109, 97): 2, (110, 105): 3, (105, 44): 1, (44, 32): 6, (32, 86): 1, (86, 101): 1, (101, 114): 7, (114, 115): 2, (115, 105): 3, (105, 111): 5, (111, 110): 9, (32, 49): 1, (49, 46): 1, (46, 49): 1, (49, 41): 1, (41, 10): 1, (67, 111): 2, (111, 112): 7, (112, 121): 4, (121, 114): 2, (114, 105): 7, (105, 103): 3, (103, 104): 3, (104, 116): 3, (40, 67): 1, (67, 41): 1, (41, 32): 2, (32, 50): 1, (50, 48): 2, (48, 48): 1, (48, 55): 1, (55, 45): 1, (45, 50): 1, (48, 50): 1, (50, 52): 1, (52, 32): 1, (80, 114): 3, (114, 111): 9, (111, 106): 3, (106, 101): 3, (101, 99): 5, (99, 116): 3, (116, 10): 1, (32, 76): 1, (76, 105): 1, (105, 99): 4, (99, 101): 5, (101, 110): 2, (110, 115): 2, (115, 101): 4, (101, 58): 1, (58, 32): 2, (67, 114): 1, (114, 101): 4, (101, 97): 3, (97, 116): 11, (116, 105): 9, (105, 118): 2, (118, 101): 5, (101, 32): 13, (111, 109): 2, (109, 109): 1, (109, 111): 2, (115, 32): 17, (32, 65): 2, (65, 116): 1, (116, 116): 2, (116, 114): 3, (105, 98): 2, (98, 117): 3, (117, 116): 3, (32, 51): 1, (51, 46): 1, (46, 48): 1, (48, 10): 1, (84, 104): 3, (104, 105): 6, (105, 115): 12, (32, 99): 12, (99, 111): 7, (121, 32): 9, (32, 111): 8, (111, 102): 6, (102, 32): 6, (32, 116): 16, (104, 101): 3, (116, 101): 12, (32, 105): 10, (99, 97): 4, (97, 114): 2, (101, 102): 1, (102, 117): 1, (117, 108): 1, (108, 108): 5, (108, 121): 5, (32, 112): 3, (112, 114): 5, (111, 100): 2, (100, 117): 2, (117, 99): 2, (101, 100): 10, (100, 44): 2, (32, 104): 2, (104, 108): 1, (32, 10): 7, (32, 118): 3, (105, 102): 1, (102, 105): 2, (105, 101): 3, (100, 32): 12, (32, 97): 13, (110, 100): 5, (110, 116): 4, (105, 110): 9, (110, 117): 1, (117, 111): 1, (111, 117): 3, (117, 115): 3, (115, 108): 1, (32, 109): 2, (105, 116): 3, (116, 111): 5, (111, 114): 4, (32, 98): 5, (98, 121): 1, (97, 32): 2, (32, 103): 2, (103, 114): 2, (117, 112): 3, (112, 32): 1, (32, 115): 5, (115, 112): 1, (112, 101): 1, (99, 105): 1, (105, 97): 3, (97, 108): 6, (108, 105): 3, (115, 116): 3, (116, 115): 2, (116, 46): 2, (46, 10): 4, (84, 69): 1, (69, 82): 1, (82, 77): 1, (77, 83): 1, (79, 70): 1, (70, 32): 1, (32, 85): 1, (85, 83): 1, (69, 58): 1, (58, 10): 1, (32, 45): 3, (45, 32): 3, (80, 101): 1, (114, 109): 1, (109, 105): 1, (115, 115): 1, (111, 32): 4, (32, 100): 2, (100, 105): 2, (114, 98): 2, (98, 97): 2, (105, 109): 2, (109, 32): 3, (112, 105): 2, (101, 115): 6, (116, 44): 2, (71, 73): 1, (73, 78): 1, (71, 32): 1, (32, 73): 2, (73, 84): 1, (65, 76): 1, (76, 76): 1, (79, 87): 1, (87, 69): 1, (69, 68): 1, (68, 46): 1, (98, 101): 3, (32, 117): 3, (110, 121): 1, (32, 119): 1, (119, 101): 1, (101, 98): 1, (98, 115): 2, (114, 32): 2, (97, 112): 2, (112, 112): 2, (112, 108): 1, (110, 44): 1, (111, 118): 1, (118, 105): 1, (105, 100): 1, (100, 101): 4, (104, 97): 4, (115, 111): 1, (114, 99): 1, (40, 84): 1, (116, 41): 1, (99, 108): 2, (108, 101): 4, (114, 108): 1, (32, 108): 1, (110, 107): 1, (107, 32): 3, (97, 100): 1, (116, 97): 4, (108, 46): 2, (46, 110): 2, (110, 101): 2, (101, 116): 2, (32, 101): 1, (110, 97): 1, (97, 98): 1, (98, 108): 1, (32, 107): 1, (107, 101): 1, (101, 101): 1, (101, 112): 2, (112, 10): 1, (97, 99): 1, (99, 107): 2, (99, 104): 2, (110, 103): 2, (103, 101): 1, (115, 46): 1, (32, 110): 1, (110, 111): 1, (111, 116): 1, (115, 104): 2, (110, 99): 1, (108, 117): 1, (117, 100): 1, (32, 114): 1, (101, 108): 1, (32, 102): 2, (102, 114): 1, (97, 105): 1, (103, 32): 1, (115, 117): 1, (117, 98): 1, (112, 111): 1, (114, 116): 1, (80, 108): 1, (97, 115): 1, (112, 100): 2, (100, 97): 2, (116, 58): 1, (116, 112): 1, (112, 58): 1, (58, 47): 1, (47, 47): 1, (47, 116): 1, (116, 47): 1, (47, 117): 1, (115, 47): 1, (47, 10): 1}
The most common pair is …
print(sorted(((v,k) for k,v in stats.items()), reverse=True))
sorted
by valuev
to get the most common pairs first
[(123396, (217, 142)), (53930, (142, 217)), (50698, (142, 216)), (46642, (217, 144)), (45082, (32, 217)), (38550, (217, 132)), (37372, (217, 146)), (37320, (217, 143)), (36124, (132, 217)), (29600, (144, 217)), (27380, (217, 134)), (27071, (217, 133)), (26762, (32, 216)), (25740, (133, 217)), (25239, (177, 217)), (25184, (216, 167)), (24970, (217, 136)), (23252, (143, 217)), (23016, (217, 145)), (23016, (145, 217)), (22530, (134, 217)), (20377, (136, 217)), (18334, (217, 138)), (16706, (138, 217)), (15924, (142, 32)), (14962, (217, 135)), (14961, (135, 217)), (14675, (146, 217)), (13852, (146, 216)), (13819, (217, 177)), (12627, (216, 177)), (11603, (216, 168)), (11593, (168, 217)), (10720, (167, 32)), (10520, (216, 170)), (10504, (170, 217)), (10497, (217, 131)), (10497, (131, 217)), (10000, (176, 217)), (9838, (217, 176)), (9405, (216, 185)), (9403, (185, 217)), (8901, (163, 217)), (8900, (216, 163)), (8751, (146, 32)), (8747, (217, 129)), (8746, (129, 217)), (7778, (144, 216)), (7712, (144, 32)), (7034, (217, 130)), (7034, (130, 217)), (6848, (217, 128)), (6829, (167, 217)), (6808, (128, 217)), (6675, (143, 32)), (6603, (217, 137)), (6122, (216, 179)), (6122, (179, 217)), (5991, (216, 175)), (5945, (175, 217)), (5806, (143, 216)), (5376, (217, 147)), (5260, (165, 217)), (5088, (216, 165)), (4932, (216, 176)), (4481, (10, 217)), (4379, (32, 219)), (4364, (216, 173)), (4364, (173, 217)), (4289, (136, 216)), (3988, (219, 159)), (3789, (167, 219)), (3741, (217, 139)), (3704, (159, 32)), (3531, (137, 217)), (3317, (216, 172)), (3317, (172, 217)), (3255, (176, 216)), (3081, (134, 32)), (3035, (137, 32)), (2976, (139, 216)), (2953, (167, 216)), (2843, (142, 10)), (2782, (216, 161)), (2782, (161, 217)), (2751, (147, 216)), (2633, (217, 141)), (2519, (217, 140)), (2497, (216, 174)), (2497, (174, 217)), (2459, (147, 32)), (2344, (216, 169)), (2344, (169, 217)), (2316, (132, 216)), (2124, (216, 180)), (2124, (180, 217)), (2080, (141, 32)), (2074, (216, 181)), (2071, (181, 217)), (1972, (219, 154)), (1972, (154, 32)), (1777, (140, 32)), (1686, (216, 182)), (1686, (182, 217)), (1682, (219, 150)), (1682, (150, 32)), (1618, (138, 216)), (1599, (216, 178)), (1599, (178, 217)), (1555, (10, 216)), (1499, (134, 216)), (1414, (216, 171)), (1414, (171, 217)), (1337, (176, 32)), (1328, (133, 32)), (1273, (216, 183)), (1266, (183, 217)), (1257, (219, 165)), (1256, (143, 219)), (1221, (216, 186)), (1221, (186, 217)), (1197, (177, 216)), (1085, (166, 217)), (1042, (165, 32)), (957, (219, 166)), (957, (144, 219)), (931, (167, 10)), (921, (216, 166)), (853, (216, 184)), (853, (184, 217)), (791, (166, 32)), (773, (217, 148)), (773, (148, 217)), (706, (216, 164)), (706, (164, 217)), (605, (140, 10)), (603, (219, 151)), (603, (151, 32)), (595, (144, 10)), (556, (139, 32)), (510, (219, 162)), (454, (141, 10)), (338, (162, 32)), (331, (143, 10)), (270, (134, 219)), (268, (159, 217)), (255, (136, 219)), (199, (219, 158)), (199, (158, 32)), (199, (10, 219)), (178, (176, 10)), (158, (162, 216)), (134, (140, 219)), (134, (61, 61)), (110, (132, 32)), (106, (139, 219)), (99, (219, 173)), (99, (141, 219)), (94, (146, 10)), (93, (139, 217)), (90, (147, 217)), (84, (173, 32)), (76, (147, 10)), (68, (219, 153)), (68, (153, 32)), (66, (219, 160)), (62, (160, 32)), (49, (136, 32)), (40, (128, 219)), (38, (219, 167)), (38, (175, 216)), (36, (137, 10)), (29, (32, 32)), (28, (10, 35)), (24, (165, 10)), (22, (219, 152)), (22, (152, 32)), (19, (165, 216)), (19, (97, 110)), (18, (35, 32)), (17, (170, 32)), (17, (115, 32)), (16, (32, 116)), (15, (219, 169)), (15, (173, 10)), (15, (169, 10)), (14, (162, 10)), (14, (159, 216)), (13, (101, 32)), (13, (32, 97)), (12, (219, 155)), (12, (155, 32)), (12, (116, 101)), (12, (105, 115)), (12, (100, 32)), (12, (32, 99)), (11, (110, 32)), (11, (97, 116)), (10, (139, 10)), (10, (138, 219)), (10, (101, 100)), (10, (32, 105)), (9, (177, 32)), (9, (168, 32)), (9, (121, 32)), (9, (116, 105)), (9, (116, 32)), (9, (114, 111)), (9, (111, 110)), (9, (108, 32)), (9, (105, 110)), (9, (32, 84)), (8, (175, 32)), (8, (35, 10)), (8, (32, 111)), (7, (219, 156)), (7, (183, 216)), (7, (114, 105)), (7, (111, 112)), (7, (105, 108)), (7, (101, 114)), (7, (99, 111)), (7, (32, 10)), (6, (122, 105)), (6, (120, 116)), (6, (116, 104)), (6, (111, 102)), (6, (110, 122)), (6, (104, 105)), (6, (102, 32)), (6, (101, 120)), (6, (101, 115)), (6, (97, 108)), (6, (44, 32)), (6, (32, 80)), (6, (32, 67)), (5, (118, 101)), (5, (116, 111)), (5, (114, 97)), (5, (112, 114)), (5, (110, 100)), (5, (108, 121)), (5, (108, 108)), (5, (105, 111)), (5, (101, 99)), (5, (99, 101)), (5, (32, 115)), (5, (32, 98)), (4, (160, 10)), (4, (117, 114)), (4, (116, 97)), (4, (115, 101)), (4, (114, 101)), (4, (112, 121)), (4, (111, 114)), (4, (111, 32)), (4, (110, 116)), (4, (108, 101)), (4, (105, 99)), (4, (104, 97)), (4, (100, 101)), (4, (99, 97)), (4, (84, 97)), (4, (84, 32)), (4, (46, 10)), (3, (181, 219)), (3, (156, 32)), (3, (140, 216)), (3, (133, 10)), (3, (117, 116)), (3, (117, 115)), (3, (117, 112)), (3, (116, 114)), (3, (115, 116)), (3, (115, 105)), (3, (111, 117)), (3, (111, 106)), (3, (110, 105)), (3, (109, 32)), (3, (108, 105)), (3, (107, 32)), (3, (106, 101)), (3, (105, 116)), (3, (105, 103)), (3, (105, 101)), (3, (105, 97)), (3, (104, 116)), (3, (104, 101)), (3, (103, 104)), (3, (101, 97)), (3, (99, 116)), (3, (98, 117)), (3, (98, 101)), (3, (84, 104)), (3, (83, 32)), (3, (81, 117)), (3, (80, 114)), (3, (78, 71)), (3, (69, 32)), (3, (45, 32)), (3, (32, 118)), (3, (32, 117)), (3, (32, 112)), (3, (32, 81)), (3, (32, 45)), (3, (32, 40)), (2, (185, 32)), (2, (166, 10)), (2, (159, 10)), (2, (156, 217)), (2, (156, 10)), (2, (121, 114)), (2, (117, 99)), (2, (116, 116)), (2, (116, 115)), (2, (116, 46)), (2, (116, 44)), (2, (115, 104)), (2, (114, 115)), (2, (114, 98)), (2, (114, 32)), (2, (112, 112)), (2, (112, 105)), (2, (112, 100)), (2, (111, 109)), (2, (111, 100)), (2, (110, 115)), (2, (110, 103)), (2, (110, 101)), (2, (109, 111)), (2, (109, 97)), (2, (108, 46)), (2, (105, 118)), (2, (105, 109)), (2, (105, 98)), (2, (103, 114)), (2, (102, 105)), (2, (101, 116)), (2, (101, 112)), (2, (101, 110)), (2, (100, 117)), (2, (100, 105)), (2, (100, 97)), (2, (100, 44)), (2, (99, 108)), (2, (99, 107)), (2, (99, 104)), (2, (98, 115)), (2, (98, 97)), (2, (97, 114)), (2, (97, 112)), (2, (97, 32)), (2, (83, 69)), (2, (79, 84)), (2, (78, 79)), (2, (76, 79)), (2, (73, 83)), (2, (72, 65)), (2, (67, 111)), (2, (67, 72)), (2, (65, 78)), (2, (61, 10)), (2, (58, 32)), (2, (50, 48)), (2, (46, 110)), (2, (41, 32)), (2, (35, 61)), (2, (32, 109)), (2, (32, 104)), (2, (32, 103)), (2, (32, 102)), (2, (32, 100)), (2, (32, 79)), (2, (32, 78)), (2, (32, 73)), (2, (32, 65)), (2, (10, 10)), (1, (219, 172)), (1, (219, 171)), (1, (219, 170)), (1, (219, 168)), (1, (219, 163)), (1, (177, 219)), (1, (172, 216)), (1, (171, 32)), (1, (168, 216)), (1, (168, 10)), (1, (142, 219)), (1, (137, 219)), (1, (135, 10)), (1, (129, 32)), (1, (119, 101)), (1, (118, 105)), (1, (117, 111)), (1, (117, 108)), (1, (117, 100)), (1, (117, 98)), (1, (116, 112)), (1, (116, 58)), (1, (116, 47)), (1, (116, 41)), (1, (116, 10)), (1, (115, 117)), (1, (115, 115)), (1, (115, 112)), (1, (115, 111)), (1, (115, 108)), (1, (115, 47)), (1, (115, 46)), (1, (114, 116)), (1, (114, 109)), (1, (114, 108)), (1, (114, 99)), (1, (112, 111)), (1, (112, 108)), (1, (112, 101)), (1, (112, 58)), (1, (112, 32)), (1, (112, 10)), (1, (111, 118)), (1, (111, 116)), (1, (110, 121)), (1, (110, 117)), (1, (110, 111)), (1, (110, 107)), (1, (110, 99)), (1, (110, 97)), (1, (110, 44)), (1, (109, 109)), (1, (109, 105)), (1, (108, 117)), (1, (107, 101)), (1, (105, 102)), (1, (105, 100)), (1, (105, 44)), (1, (104, 109)), (1, (104, 108)), (1, (103, 101)), (1, (103, 32)), (1, (102, 117)), (1, (102, 114)), (1, (101, 108)), (1, (101, 102)), (1, (101, 101)), (1, (101, 98)), (1, (101, 58)), (1, (99, 105)), (1, (98, 121)), (1, (98, 108)), (1, (97, 115)), (1, (97, 105)), (1, (97, 100)), (1, (97, 99)), (1, (97, 98)), (1, (89, 82)), (1, (87, 69)), (1, (86, 101)), (1, (86, 69)), (1, (85, 116)), (1, (85, 83)), (1, (84, 101)), (1, (84, 72)), (1, (84, 69)), (1, (82, 77)), (1, (82, 73)), (1, (82, 69)), (1, (82, 32)), (1, (80, 108)), (1, (80, 101)), (1, (80, 89)), (1, (80, 76)), (1, (79, 87)), (1, (79, 86)), (1, (79, 82)), (1, (79, 80)), (1, (79, 70)), (1, (79, 67)), (1, (79, 32)), (1, (77, 83)), (1, (77, 79)), (1, (76, 105)), (1, (76, 76)), (1, (76, 69)), (1, (75, 10)), (1, (73, 84)), (1, (73, 78)), (1, (73, 71)), (1, (72, 84)), (1, (72, 73)), (1, (71, 73)), (1, (71, 72)), (1, (71, 69)), (1, (71, 32)), (1, (70, 32)), (1, (69, 82)), (1, (69, 77)), (1, (69, 68)), (1, (69, 65)), (1, (69, 58)), (1, (68, 79)), (1, (68, 46)), (1, (67, 114)), (1, (67, 79)), (1, (67, 75)), (1, (67, 41)), (1, (66, 76)), (1, (65, 116)), (1, (65, 83)), (1, (65, 76)), (1, (58, 47)), (1, (58, 10)), (1, (55, 45)), (1, (52, 32)), (1, (51, 46)), (1, (50, 52)), (1, (49, 46)), (1, (49, 41)), (1, (48, 55)), (1, (48, 50)), (1, (48, 48)), (1, (48, 10)), (1, (47, 117)), (1, (47, 116)), (1, (47, 47)), (1, (47, 10)), (1, (46, 49)), (1, (46, 48)), (1, (45, 50)), (1, (41, 10)), (1, (40, 85)), (1, (40, 84)), (1, (40, 67)), (1, (32, 119)), (1, (32, 114)), (1, (32, 110)), (1, (32, 108)), (1, (32, 107)), (1, (32, 101)), (1, (32, 86)), (1, (32, 85)), (1, (32, 82)), (1, (32, 76)), (1, (32, 68)), (1, (32, 66)), (1, (32, 51)), (1, (32, 50)), (1, (32, 49))]
top_pair = max(stats, key=stats.get)
top_pair
Our most common pair is (217, 142) which is at the top of our list (123396, (217, 142) occuring 123396 times.
chr(217), chr(142)
(‘Ù’, ‘\x8e’)
Swapping the pair for a single token
def get_stats(ids):
counts = {}
for pair in zip(ids, idx[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
def merge(ids, pair, idx):
newids = []
i = 0
while i < len(ids):
if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
newids.append(idx)
i += 2
else:
newids.append(ids[i])
i += 1
return newids
print(merge([5, 6, 6, 7, 9, 1], (6, 7), 99))
Replace a pair
(6, 7) in a list [5, 6, 6, 7, 9, 1]
of numbers called ids
with a single token idx
99
[5, 6, 99, 9, 1]
We have 0-255 tokens, to replace the most common pair with a new token 256:
tokens2 = merge(tokens, top_pair, 256)
#print(tokens2)
print("length: ", len(tokens2))
length: 1237147
vocab_size = 276
num_merges = vocab_size - 256
ids = list(tokens)
merges = {}
for i in range(num_merges):
stats = get_stats(ids)
pair = max(stats, key=stats.get)
idx = 256 + i
print(f'merging {pair} into a new token {idx}')
ids = merge(ids, pair, idx)
merges[pair] = idx