What is Nostr?
teachYourselfAndYourChildrenHowtoHack
npub1cr0…8dg8
2024-06-13 17:32:07

Tokenization by the good guys

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

We need to turn characters into numbers. We can do that with Unicode like this.

Unicode

On Ubuntu Linux press [CTRL][Shift][T] and in the terminal:

schmuck@Schmoe:~$ python3

to get >>>

ord('h')

‘h’ has the Unicode code point 104. ord can only take a single character, to get the code points of many characters:

[ord(x) for x in "إِنَّ اللَّهَ اصْطَفَىٰ آدَمَ وَنُوحًا وَآلَ إِبْرَاهِيمَ وَآلَ عِمْرَانَ"]

[1573, 1616, 1606, 1617, 1614, 32, 1575, 1604, 1604, 1617, 1614, 1607, 1614, 32, 1575, 1589, 1618, 1591, 1614, 1601, 1614, 1609, 1648, 32, 1570, 1583, 1614, 1605, 1614, 32, 1608, 1614, 1606, 1615, 1608, 1581, 1611, 1575, 32, 1608, 1614, 1570, 1604, 1614, 32, 1573, 1616, 1576, 1618, 1585, 1614, 1575, 1607, 1616, 1610, 1605, 1614, 32, 1608, 1614, 1570, 1604, 1614, 32, 1593, 1616, 1605, 1618, 1585, 1614, 1575, 1606, 1614]

But Unicode is always changing so not very good for us. We can use a type of Unicode called UTF-8 which can turn our characters into binary-data or byte-streams like this:

"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ".encode("utf-8")

b’\xd8\xa8\xd9\x90\xd8\xb3\xd9\x92\xd9\x85\xd9\x90 \xd8\xa7\xd9\x84\xd9\x84\xd9\x91\xd9\x8e\xd9\x87\xd9\x90 \xd8\xa7\xd9\x84\xd8\xb1\xd9\x91\xd9\x8e\xd8\xad\xd9\x92\xd9\x85\xd9\x8e\xd9\xb0\xd9\x86\xd9\x90 \xd8\xa7\xd9\x84\xd8\xb1\xd9\x91\xd9\x8e\xd8\xad\xd9\x90\xd9\x8a\xd9\x85\xd9\x90’

but it is not very pretty so we can turn it into useful numbers to work with like this:

list("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ".encode("utf-8"))

[216, 168, 217, 144, 216, 179, 217, 146, 217, 133, 217, 144, 32, 216, 167, 217, 132, 217, 132, 217, 145, 217, 142, 217, 135, 217, 144, 32, 216, 167, 217, 132, 216, 177, 217, 145, 217, 142, 216, 173, 217, 146, 217, 133, 217, 142, 217, 176, 217, 134, 217, 144, 32, 216, 167, 217, 132, 216, 177, 217, 145, 217, 142, 216, 173, 217, 144, 217, 138, 217, 133, 217, 144]

Getting the data

Go to https://tanzil.net/download/ and choose ‘Uthmani’ under Quran text type: , ‘Text’ for Output file format: and tick all boxes except ‘Include sequential tanweens’, then Download to get the file and then using python3 to open:

text = open("quran-uthmani.txt", 'r').read() print(text)

We can encode in UTF-8 and get some ugly binary:

tokens = text.encode("utf-8") print(tokens)

A neater way, so we get 0-255 range of code points:

tokens = list(map(int, tokens))

quit() to get out of >>> and back to $

Google Colab

If you have a Google account, or if you have a throw away SIM card to get a new Google account using a dumb phone for SMS verification, you can start using Google’s Colab for free:

https://colab.research.google.com/

Yes, for free so when at the payment screen, click away from it to get the free lab.

Adding your file to Google Colab

Click the file icon and then the dog-earred page with an up arrow and add your quran-uthmani.txt file from ‘Getting the data’ (above).

image

It will be deleted so you have to add it every session you start Colab, so have it saved somewhere.

Our actor example applied to Python

So, we talked earlier about Fenyman and James Clear’s ‘Atomic Habits’ and how actors learn their script.

We are now going to use that to learn our own ‘scripting’, that is learn programming:

*Type everything out and never ever copy and paste code. And never say never.*

We said we won’t type/write by going back and forth but in this case we will except that we type everything out. It will first enter our memory so we know where to find it when we need it in the future and for speed we will type first because searching for something we saw months ago somewhere takes longer than trying muscle memory out and just typing it.

We say, ‘never say never’ because there are times when there is no point in typing out long paragraphs of code that are ‘boiler plate’ meaning, always used as is everywhere. In those rare cases, copy and paste.

Getting the most common pairs

Type in Google Colab text = open("quran-uthmani.txt", 'r').read() and then press [SHIFT][ENTER] to run the ‘code cell’.

open() "quran-uthmani.txt" as read-only 'r' and save as text.

tokens = text.encode("utf-8")

take text and encode() it with "utf-8" and save as tokens

tokens = list(map(int, tokens))

def get_stats(ids):
  counts = {}
  for pair in zip(ids, ids[1:]):
    counts[pair] = counts.get(pair, 0) + 1
  return counts

stats = get_stats(tokens)

print(stats)

{(216, 168): 11603, (168, 217): 11593, (217, 144): 46642, (144, 216): 7778, (216, 179): 6122, (179, 217): 6122, (217, 146): 37372, (146, 217): 14675, (217, 133): 27071, (133, 217): 25740, (144, 32): 7712, (32, 217): 45082, (217, 177): 13819, (177, 217): 25239, (217, 132): 38550, (132, 217): 36124, (217, 145): 23016, (145, 217): 23016, (217, 142): 123396, (142, 217): 53930, (217, 135): 14962, (135, 217): 14961, (132, 216): 2316, (216, 177): 12627, (142, 216): 50698, (216, 173): 4364, (173, 217): 4364, (217, 128): 6848, (128, 217): 6808, (217, 176): 9838, (176, 217): 10000, (217, 134): 27380, (134, 217): 22530, (144, 217): 29600, (217, 138): 18334, (138, 217): 16706, (144, 10): 595, (10, 217): 4481, (146, 216): 13852, (216, 175): 5991, (175, 217): 5945, (217, 143): 37320, (143, 32): 6675, (32, 216): 26762, (216, 185): 9405, (185, 217): 9403, (142, 10): 2843, (217, 131): 10497, (131, 217): 10497, (217, 136): 24970, (136, 217): 20377, (10, 216): 1555, (216, 165): 5088, (165, 217): 5260, (216, 167): 25184, (167, 217): 6829, (142, 32): 15924, (143, 216): 5806, (216, 170): 10520, (170, 217): 10504, (143, 10): 331, (167, 32): 10720, (216, 181): 2074, (181, 217): 2071, (176, 216): 3255, (216, 183): 1273, (183, 217): 1266, (217, 130): 7034, (130, 217): 7034, (216, 176): 4932, (216, 163): 8900, (163, 217): 8901, (146, 32): 8751, (216, 186): 1221, (186, 217): 1221, (216, 182): 1686, (182, 217): 1686, (143, 217): 23252, (136, 216): 4289, (217, 147): 5376, (147, 217): 90, (147, 10): 76, (32, 219): 4379, (219, 155): 12, (155, 32): 12, (217, 129): 8747, (129, 217): 8746, (217, 139): 3741, (139, 217): 93, (217, 137): 6603, (137, 32): 3035, (216, 164): 706, (164, 217): 706, (216, 169): 2344, (169, 217): 2344, (216, 178): 1599, (178, 217): 1599, (147, 32): 2459, (134, 216): 1499, (134, 32): 3081, (217, 148): 773, (148, 217): 773, (167, 216): 2953, (216, 174): 2497, (174, 217): 2497, (136, 219): 255, (219, 159): 3988, (159, 217): 268, (147, 216): 2751, (216, 166): 921, (166, 217): 1085, (137, 217): 3531, (176, 32): 1337, (219, 150): 1682, (150, 32): 1682, (167, 219): 3789, (159, 32): 3704, (216, 161): 2782, (161, 217): 2782, (217, 140): 2519, (140, 32): 1777, (216, 180): 2124, (180, 217): 2124, (216, 184): 853, (184, 217): 853, (140, 10): 605, (133, 32): 1328, (139, 216): 2976, (140, 219): 134, (219, 162): 510, (162, 32): 338, (219, 151): 603, (151, 32): 603, (177, 216): 1197, (170, 32): 17, (216, 172): 3317, (172, 217): 3317, (216, 171): 1414, (171, 217): 1414, (143, 219): 1256, (219, 165): 1257, (165, 32): 1042, (217, 141): 2633, (141, 32): 2080, (219, 154): 1972, (154, 32): 1972, (138, 216): 1618, (139, 32): 556, (144, 219): 957, (219, 166): 957, (166, 32): 791, (219, 153): 68, (153, 32): 68, (10, 219): 199, (219, 158): 199, (158, 32): 199, (219, 152): 22, (152, 32): 22, (134, 219): 270, (162, 216): 158, (132, 32): 110, (141, 10): 454, (141, 219): 99, (219, 173): 99, (173, 32): 84, (168, 32): 9, (136, 32): 49, (128, 219): 40, (219, 167): 38, (139, 219): 106, (175, 216): 38, (181, 219): 3, (219, 156): 7, (156, 217): 2, (165, 216): 19, (175, 32): 8, (219, 160): 66, (160, 32): 62, (159, 216): 14, (177, 32): 9, (138, 219): 10, (167, 10): 931, (159, 10): 2, (140, 216): 3, (162, 10): 14, (183, 216): 7, (137, 219): 1, (171, 32): 1, (219, 169): 15, (169, 10): 15, (177, 219): 1, (219, 170): 1, (142, 219): 1, (219, 171): 1, (129, 32): 1, (156, 10): 2, (137, 10): 36, (185, 32): 2, (135, 10): 1, (176, 10): 178, (146, 10): 94, (219, 168): 1, (168, 216): 1, (160, 10): 4, (173, 10): 15, (156, 32): 3, (219, 172): 1, (172, 216): 1, (133, 10): 3, (219, 163): 1, (139, 10): 10, (165, 10): 24, (166, 10): 2, (168, 10): 1, (10, 10): 2, (10, 35): 28, (35, 32): 18, (32, 80): 6, (80, 76): 1, (76, 69): 1, (69, 65): 1, (65, 83): 1, (83, 69): 2, (69, 32): 3, (32, 68): 1, (68, 79): 1, (79, 32): 1, (32, 78): 2, (78, 79): 2, (79, 84): 2, (84, 32): 4, (32, 82): 1, (82, 69): 1, (69, 77): 1, (77, 79): 1, (79, 86): 1, (86, 69): 1, (32, 79): 2, (79, 82): 1, (82, 32): 1, (32, 67): 6, (67, 72): 2, (72, 65): 2, (65, 78): 2, (78, 71): 3, (71, 69): 1, (32, 84): 9, (84, 72): 1, (72, 73): 1, (73, 83): 2, (83, 32): 3, (67, 79): 1, (79, 80): 1, (80, 89): 1, (89, 82): 1, (82, 73): 1, (73, 71): 1, (71, 72): 1, (72, 84): 1, (32, 66): 1, (66, 76): 1, (76, 79): 2, (79, 67): 1, (67, 75): 1, (75, 10): 1, (35, 61): 2, (61, 61): 134, (61, 10): 2, (35, 10): 8, (32, 32): 29, (84, 97): 4, (97, 110): 19, (110, 122): 6, (122, 105): 6, (105, 108): 7, (108, 32): 9, (32, 81): 3, (81, 117): 3, (117, 114): 4, (114, 97): 5, (110, 32): 11, (84, 101): 1, (101, 120): 6, (120, 116): 6, (116, 32): 9, (32, 40): 3, (40, 85): 1, (85, 116): 1, (116, 104): 6, (104, 109): 1, (109, 97): 2, (110, 105): 3, (105, 44): 1, (44, 32): 6, (32, 86): 1, (86, 101): 1, (101, 114): 7, (114, 115): 2, (115, 105): 3, (105, 111): 5, (111, 110): 9, (32, 49): 1, (49, 46): 1, (46, 49): 1, (49, 41): 1, (41, 10): 1, (67, 111): 2, (111, 112): 7, (112, 121): 4, (121, 114): 2, (114, 105): 7, (105, 103): 3, (103, 104): 3, (104, 116): 3, (40, 67): 1, (67, 41): 1, (41, 32): 2, (32, 50): 1, (50, 48): 2, (48, 48): 1, (48, 55): 1, (55, 45): 1, (45, 50): 1, (48, 50): 1, (50, 52): 1, (52, 32): 1, (80, 114): 3, (114, 111): 9, (111, 106): 3, (106, 101): 3, (101, 99): 5, (99, 116): 3, (116, 10): 1, (32, 76): 1, (76, 105): 1, (105, 99): 4, (99, 101): 5, (101, 110): 2, (110, 115): 2, (115, 101): 4, (101, 58): 1, (58, 32): 2, (67, 114): 1, (114, 101): 4, (101, 97): 3, (97, 116): 11, (116, 105): 9, (105, 118): 2, (118, 101): 5, (101, 32): 13, (111, 109): 2, (109, 109): 1, (109, 111): 2, (115, 32): 17, (32, 65): 2, (65, 116): 1, (116, 116): 2, (116, 114): 3, (105, 98): 2, (98, 117): 3, (117, 116): 3, (32, 51): 1, (51, 46): 1, (46, 48): 1, (48, 10): 1, (84, 104): 3, (104, 105): 6, (105, 115): 12, (32, 99): 12, (99, 111): 7, (121, 32): 9, (32, 111): 8, (111, 102): 6, (102, 32): 6, (32, 116): 16, (104, 101): 3, (116, 101): 12, (32, 105): 10, (99, 97): 4, (97, 114): 2, (101, 102): 1, (102, 117): 1, (117, 108): 1, (108, 108): 5, (108, 121): 5, (32, 112): 3, (112, 114): 5, (111, 100): 2, (100, 117): 2, (117, 99): 2, (101, 100): 10, (100, 44): 2, (32, 104): 2, (104, 108): 1, (32, 10): 7, (32, 118): 3, (105, 102): 1, (102, 105): 2, (105, 101): 3, (100, 32): 12, (32, 97): 13, (110, 100): 5, (110, 116): 4, (105, 110): 9, (110, 117): 1, (117, 111): 1, (111, 117): 3, (117, 115): 3, (115, 108): 1, (32, 109): 2, (105, 116): 3, (116, 111): 5, (111, 114): 4, (32, 98): 5, (98, 121): 1, (97, 32): 2, (32, 103): 2, (103, 114): 2, (117, 112): 3, (112, 32): 1, (32, 115): 5, (115, 112): 1, (112, 101): 1, (99, 105): 1, (105, 97): 3, (97, 108): 6, (108, 105): 3, (115, 116): 3, (116, 115): 2, (116, 46): 2, (46, 10): 4, (84, 69): 1, (69, 82): 1, (82, 77): 1, (77, 83): 1, (79, 70): 1, (70, 32): 1, (32, 85): 1, (85, 83): 1, (69, 58): 1, (58, 10): 1, (32, 45): 3, (45, 32): 3, (80, 101): 1, (114, 109): 1, (109, 105): 1, (115, 115): 1, (111, 32): 4, (32, 100): 2, (100, 105): 2, (114, 98): 2, (98, 97): 2, (105, 109): 2, (109, 32): 3, (112, 105): 2, (101, 115): 6, (116, 44): 2, (71, 73): 1, (73, 78): 1, (71, 32): 1, (32, 73): 2, (73, 84): 1, (65, 76): 1, (76, 76): 1, (79, 87): 1, (87, 69): 1, (69, 68): 1, (68, 46): 1, (98, 101): 3, (32, 117): 3, (110, 121): 1, (32, 119): 1, (119, 101): 1, (101, 98): 1, (98, 115): 2, (114, 32): 2, (97, 112): 2, (112, 112): 2, (112, 108): 1, (110, 44): 1, (111, 118): 1, (118, 105): 1, (105, 100): 1, (100, 101): 4, (104, 97): 4, (115, 111): 1, (114, 99): 1, (40, 84): 1, (116, 41): 1, (99, 108): 2, (108, 101): 4, (114, 108): 1, (32, 108): 1, (110, 107): 1, (107, 32): 3, (97, 100): 1, (116, 97): 4, (108, 46): 2, (46, 110): 2, (110, 101): 2, (101, 116): 2, (32, 101): 1, (110, 97): 1, (97, 98): 1, (98, 108): 1, (32, 107): 1, (107, 101): 1, (101, 101): 1, (101, 112): 2, (112, 10): 1, (97, 99): 1, (99, 107): 2, (99, 104): 2, (110, 103): 2, (103, 101): 1, (115, 46): 1, (32, 110): 1, (110, 111): 1, (111, 116): 1, (115, 104): 2, (110, 99): 1, (108, 117): 1, (117, 100): 1, (32, 114): 1, (101, 108): 1, (32, 102): 2, (102, 114): 1, (97, 105): 1, (103, 32): 1, (115, 117): 1, (117, 98): 1, (112, 111): 1, (114, 116): 1, (80, 108): 1, (97, 115): 1, (112, 100): 2, (100, 97): 2, (116, 58): 1, (116, 112): 1, (112, 58): 1, (58, 47): 1, (47, 47): 1, (47, 116): 1, (116, 47): 1, (47, 117): 1, (115, 47): 1, (47, 10): 1}

The most common pair is …

print(sorted(((v,k) for k,v in stats.items()), reverse=True))

sorted by valuev to get the most common pairs first

[(123396, (217, 142)), (53930, (142, 217)), (50698, (142, 216)), (46642, (217, 144)), (45082, (32, 217)), (38550, (217, 132)), (37372, (217, 146)), (37320, (217, 143)), (36124, (132, 217)), (29600, (144, 217)), (27380, (217, 134)), (27071, (217, 133)), (26762, (32, 216)), (25740, (133, 217)), (25239, (177, 217)), (25184, (216, 167)), (24970, (217, 136)), (23252, (143, 217)), (23016, (217, 145)), (23016, (145, 217)), (22530, (134, 217)), (20377, (136, 217)), (18334, (217, 138)), (16706, (138, 217)), (15924, (142, 32)), (14962, (217, 135)), (14961, (135, 217)), (14675, (146, 217)), (13852, (146, 216)), (13819, (217, 177)), (12627, (216, 177)), (11603, (216, 168)), (11593, (168, 217)), (10720, (167, 32)), (10520, (216, 170)), (10504, (170, 217)), (10497, (217, 131)), (10497, (131, 217)), (10000, (176, 217)), (9838, (217, 176)), (9405, (216, 185)), (9403, (185, 217)), (8901, (163, 217)), (8900, (216, 163)), (8751, (146, 32)), (8747, (217, 129)), (8746, (129, 217)), (7778, (144, 216)), (7712, (144, 32)), (7034, (217, 130)), (7034, (130, 217)), (6848, (217, 128)), (6829, (167, 217)), (6808, (128, 217)), (6675, (143, 32)), (6603, (217, 137)), (6122, (216, 179)), (6122, (179, 217)), (5991, (216, 175)), (5945, (175, 217)), (5806, (143, 216)), (5376, (217, 147)), (5260, (165, 217)), (5088, (216, 165)), (4932, (216, 176)), (4481, (10, 217)), (4379, (32, 219)), (4364, (216, 173)), (4364, (173, 217)), (4289, (136, 216)), (3988, (219, 159)), (3789, (167, 219)), (3741, (217, 139)), (3704, (159, 32)), (3531, (137, 217)), (3317, (216, 172)), (3317, (172, 217)), (3255, (176, 216)), (3081, (134, 32)), (3035, (137, 32)), (2976, (139, 216)), (2953, (167, 216)), (2843, (142, 10)), (2782, (216, 161)), (2782, (161, 217)), (2751, (147, 216)), (2633, (217, 141)), (2519, (217, 140)), (2497, (216, 174)), (2497, (174, 217)), (2459, (147, 32)), (2344, (216, 169)), (2344, (169, 217)), (2316, (132, 216)), (2124, (216, 180)), (2124, (180, 217)), (2080, (141, 32)), (2074, (216, 181)), (2071, (181, 217)), (1972, (219, 154)), (1972, (154, 32)), (1777, (140, 32)), (1686, (216, 182)), (1686, (182, 217)), (1682, (219, 150)), (1682, (150, 32)), (1618, (138, 216)), (1599, (216, 178)), (1599, (178, 217)), (1555, (10, 216)), (1499, (134, 216)), (1414, (216, 171)), (1414, (171, 217)), (1337, (176, 32)), (1328, (133, 32)), (1273, (216, 183)), (1266, (183, 217)), (1257, (219, 165)), (1256, (143, 219)), (1221, (216, 186)), (1221, (186, 217)), (1197, (177, 216)), (1085, (166, 217)), (1042, (165, 32)), (957, (219, 166)), (957, (144, 219)), (931, (167, 10)), (921, (216, 166)), (853, (216, 184)), (853, (184, 217)), (791, (166, 32)), (773, (217, 148)), (773, (148, 217)), (706, (216, 164)), (706, (164, 217)), (605, (140, 10)), (603, (219, 151)), (603, (151, 32)), (595, (144, 10)), (556, (139, 32)), (510, (219, 162)), (454, (141, 10)), (338, (162, 32)), (331, (143, 10)), (270, (134, 219)), (268, (159, 217)), (255, (136, 219)), (199, (219, 158)), (199, (158, 32)), (199, (10, 219)), (178, (176, 10)), (158, (162, 216)), (134, (140, 219)), (134, (61, 61)), (110, (132, 32)), (106, (139, 219)), (99, (219, 173)), (99, (141, 219)), (94, (146, 10)), (93, (139, 217)), (90, (147, 217)), (84, (173, 32)), (76, (147, 10)), (68, (219, 153)), (68, (153, 32)), (66, (219, 160)), (62, (160, 32)), (49, (136, 32)), (40, (128, 219)), (38, (219, 167)), (38, (175, 216)), (36, (137, 10)), (29, (32, 32)), (28, (10, 35)), (24, (165, 10)), (22, (219, 152)), (22, (152, 32)), (19, (165, 216)), (19, (97, 110)), (18, (35, 32)), (17, (170, 32)), (17, (115, 32)), (16, (32, 116)), (15, (219, 169)), (15, (173, 10)), (15, (169, 10)), (14, (162, 10)), (14, (159, 216)), (13, (101, 32)), (13, (32, 97)), (12, (219, 155)), (12, (155, 32)), (12, (116, 101)), (12, (105, 115)), (12, (100, 32)), (12, (32, 99)), (11, (110, 32)), (11, (97, 116)), (10, (139, 10)), (10, (138, 219)), (10, (101, 100)), (10, (32, 105)), (9, (177, 32)), (9, (168, 32)), (9, (121, 32)), (9, (116, 105)), (9, (116, 32)), (9, (114, 111)), (9, (111, 110)), (9, (108, 32)), (9, (105, 110)), (9, (32, 84)), (8, (175, 32)), (8, (35, 10)), (8, (32, 111)), (7, (219, 156)), (7, (183, 216)), (7, (114, 105)), (7, (111, 112)), (7, (105, 108)), (7, (101, 114)), (7, (99, 111)), (7, (32, 10)), (6, (122, 105)), (6, (120, 116)), (6, (116, 104)), (6, (111, 102)), (6, (110, 122)), (6, (104, 105)), (6, (102, 32)), (6, (101, 120)), (6, (101, 115)), (6, (97, 108)), (6, (44, 32)), (6, (32, 80)), (6, (32, 67)), (5, (118, 101)), (5, (116, 111)), (5, (114, 97)), (5, (112, 114)), (5, (110, 100)), (5, (108, 121)), (5, (108, 108)), (5, (105, 111)), (5, (101, 99)), (5, (99, 101)), (5, (32, 115)), (5, (32, 98)), (4, (160, 10)), (4, (117, 114)), (4, (116, 97)), (4, (115, 101)), (4, (114, 101)), (4, (112, 121)), (4, (111, 114)), (4, (111, 32)), (4, (110, 116)), (4, (108, 101)), (4, (105, 99)), (4, (104, 97)), (4, (100, 101)), (4, (99, 97)), (4, (84, 97)), (4, (84, 32)), (4, (46, 10)), (3, (181, 219)), (3, (156, 32)), (3, (140, 216)), (3, (133, 10)), (3, (117, 116)), (3, (117, 115)), (3, (117, 112)), (3, (116, 114)), (3, (115, 116)), (3, (115, 105)), (3, (111, 117)), (3, (111, 106)), (3, (110, 105)), (3, (109, 32)), (3, (108, 105)), (3, (107, 32)), (3, (106, 101)), (3, (105, 116)), (3, (105, 103)), (3, (105, 101)), (3, (105, 97)), (3, (104, 116)), (3, (104, 101)), (3, (103, 104)), (3, (101, 97)), (3, (99, 116)), (3, (98, 117)), (3, (98, 101)), (3, (84, 104)), (3, (83, 32)), (3, (81, 117)), (3, (80, 114)), (3, (78, 71)), (3, (69, 32)), (3, (45, 32)), (3, (32, 118)), (3, (32, 117)), (3, (32, 112)), (3, (32, 81)), (3, (32, 45)), (3, (32, 40)), (2, (185, 32)), (2, (166, 10)), (2, (159, 10)), (2, (156, 217)), (2, (156, 10)), (2, (121, 114)), (2, (117, 99)), (2, (116, 116)), (2, (116, 115)), (2, (116, 46)), (2, (116, 44)), (2, (115, 104)), (2, (114, 115)), (2, (114, 98)), (2, (114, 32)), (2, (112, 112)), (2, (112, 105)), (2, (112, 100)), (2, (111, 109)), (2, (111, 100)), (2, (110, 115)), (2, (110, 103)), (2, (110, 101)), (2, (109, 111)), (2, (109, 97)), (2, (108, 46)), (2, (105, 118)), (2, (105, 109)), (2, (105, 98)), (2, (103, 114)), (2, (102, 105)), (2, (101, 116)), (2, (101, 112)), (2, (101, 110)), (2, (100, 117)), (2, (100, 105)), (2, (100, 97)), (2, (100, 44)), (2, (99, 108)), (2, (99, 107)), (2, (99, 104)), (2, (98, 115)), (2, (98, 97)), (2, (97, 114)), (2, (97, 112)), (2, (97, 32)), (2, (83, 69)), (2, (79, 84)), (2, (78, 79)), (2, (76, 79)), (2, (73, 83)), (2, (72, 65)), (2, (67, 111)), (2, (67, 72)), (2, (65, 78)), (2, (61, 10)), (2, (58, 32)), (2, (50, 48)), (2, (46, 110)), (2, (41, 32)), (2, (35, 61)), (2, (32, 109)), (2, (32, 104)), (2, (32, 103)), (2, (32, 102)), (2, (32, 100)), (2, (32, 79)), (2, (32, 78)), (2, (32, 73)), (2, (32, 65)), (2, (10, 10)), (1, (219, 172)), (1, (219, 171)), (1, (219, 170)), (1, (219, 168)), (1, (219, 163)), (1, (177, 219)), (1, (172, 216)), (1, (171, 32)), (1, (168, 216)), (1, (168, 10)), (1, (142, 219)), (1, (137, 219)), (1, (135, 10)), (1, (129, 32)), (1, (119, 101)), (1, (118, 105)), (1, (117, 111)), (1, (117, 108)), (1, (117, 100)), (1, (117, 98)), (1, (116, 112)), (1, (116, 58)), (1, (116, 47)), (1, (116, 41)), (1, (116, 10)), (1, (115, 117)), (1, (115, 115)), (1, (115, 112)), (1, (115, 111)), (1, (115, 108)), (1, (115, 47)), (1, (115, 46)), (1, (114, 116)), (1, (114, 109)), (1, (114, 108)), (1, (114, 99)), (1, (112, 111)), (1, (112, 108)), (1, (112, 101)), (1, (112, 58)), (1, (112, 32)), (1, (112, 10)), (1, (111, 118)), (1, (111, 116)), (1, (110, 121)), (1, (110, 117)), (1, (110, 111)), (1, (110, 107)), (1, (110, 99)), (1, (110, 97)), (1, (110, 44)), (1, (109, 109)), (1, (109, 105)), (1, (108, 117)), (1, (107, 101)), (1, (105, 102)), (1, (105, 100)), (1, (105, 44)), (1, (104, 109)), (1, (104, 108)), (1, (103, 101)), (1, (103, 32)), (1, (102, 117)), (1, (102, 114)), (1, (101, 108)), (1, (101, 102)), (1, (101, 101)), (1, (101, 98)), (1, (101, 58)), (1, (99, 105)), (1, (98, 121)), (1, (98, 108)), (1, (97, 115)), (1, (97, 105)), (1, (97, 100)), (1, (97, 99)), (1, (97, 98)), (1, (89, 82)), (1, (87, 69)), (1, (86, 101)), (1, (86, 69)), (1, (85, 116)), (1, (85, 83)), (1, (84, 101)), (1, (84, 72)), (1, (84, 69)), (1, (82, 77)), (1, (82, 73)), (1, (82, 69)), (1, (82, 32)), (1, (80, 108)), (1, (80, 101)), (1, (80, 89)), (1, (80, 76)), (1, (79, 87)), (1, (79, 86)), (1, (79, 82)), (1, (79, 80)), (1, (79, 70)), (1, (79, 67)), (1, (79, 32)), (1, (77, 83)), (1, (77, 79)), (1, (76, 105)), (1, (76, 76)), (1, (76, 69)), (1, (75, 10)), (1, (73, 84)), (1, (73, 78)), (1, (73, 71)), (1, (72, 84)), (1, (72, 73)), (1, (71, 73)), (1, (71, 72)), (1, (71, 69)), (1, (71, 32)), (1, (70, 32)), (1, (69, 82)), (1, (69, 77)), (1, (69, 68)), (1, (69, 65)), (1, (69, 58)), (1, (68, 79)), (1, (68, 46)), (1, (67, 114)), (1, (67, 79)), (1, (67, 75)), (1, (67, 41)), (1, (66, 76)), (1, (65, 116)), (1, (65, 83)), (1, (65, 76)), (1, (58, 47)), (1, (58, 10)), (1, (55, 45)), (1, (52, 32)), (1, (51, 46)), (1, (50, 52)), (1, (49, 46)), (1, (49, 41)), (1, (48, 55)), (1, (48, 50)), (1, (48, 48)), (1, (48, 10)), (1, (47, 117)), (1, (47, 116)), (1, (47, 47)), (1, (47, 10)), (1, (46, 49)), (1, (46, 48)), (1, (45, 50)), (1, (41, 10)), (1, (40, 85)), (1, (40, 84)), (1, (40, 67)), (1, (32, 119)), (1, (32, 114)), (1, (32, 110)), (1, (32, 108)), (1, (32, 107)), (1, (32, 101)), (1, (32, 86)), (1, (32, 85)), (1, (32, 82)), (1, (32, 76)), (1, (32, 68)), (1, (32, 66)), (1, (32, 51)), (1, (32, 50)), (1, (32, 49))]

top_pair = max(stats, key=stats.get)
top_pair

Our most common pair is (217, 142) which is at the top of our list (123396, (217, 142) occuring 123396 times.

chr(217), chr(142)

(‘Ù’, ‘\x8e’)

Swapping the pair for a single token

def get_stats(ids):
  counts = {}
  for pair in zip(ids, idx[1:]):
    counts[pair] = counts.get(pair, 0) + 1
  return counts 

def merge(ids, pair, idx):
  newids = []
  i = 0
  while i < len(ids):
    if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
      newids.append(idx)
      i += 2
    else:
      newids.append(ids[i])
      i += 1
  return newids
  
print(merge([5, 6, 6, 7, 9, 1], (6, 7), 99))

Replace a pair(6, 7) in a list [5, 6, 6, 7, 9, 1] of numbers called ids with a single token idx 99

[5, 6, 99, 9, 1]

We have 0-255 tokens, to replace the most common pair with a new token 256:

tokens2 = merge(tokens, top_pair, 256)
#print(tokens2)
print("length: ", len(tokens2))

length: 1237147

vocab_size = 276
num_merges = vocab_size - 256
ids = list(tokens)

merges = {}
for i in range(num_merges):
  stats = get_stats(ids)
  pair = max(stats, key=stats.get)
  idx = 256 + i
  print(f'merging {pair} into a new token {idx}')
  ids = merge(ids, pair, idx)
  merges[pair] = idx

Source:

https://en.wikipedia.org/wiki/Arabic_script_in_Unicode

Author Public Key
npub1cr0yq3g9c2sug5tgh7z2agfsyqzl7ppjjsk0auaypjcakmh9wllq5q8dg8