Simon Willison on Nostr: The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 ...
The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.
Published at
2023-06-08 20:40:11Event JSON
{
"id": "dd8cf1870609f5f3a5e9b669769a7265fc523f02fe18a10254b8fe5f2065d78e",
"pubkey": "8b0be93ed69c30e9a68159fd384fd8308ce4bbf16c39e840e0803dcb6c08720e",
"created_at": 1686256811,
"kind": 1,
"tags": [
[
"e",
"625798861d48b5df0fa62dad302fa8e2b834a24572a9d3c93933a654b13593c4",
"wss://relay.mostr.pub",
"reply"
],
[
"mostr",
"https://fedi.simonwillison.net/users/simon/statuses/110510526403709478"
]
],
"content": "The tokenizers have a strong bias towards English: \"The dog eats the apples\" is 5 tokens, \"El perro come las manzanas\" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.",
"sig": "261592941599e8378f6f42ea103a85e85b470db1f13df85dd68e8f88bcde3065ecd901c647af004cf4b7b5390eb93d2563fc45ac2166e245e4fea386eeb1940e"
}