i have not looked deeply into it but the way Go does it is convert the variable ...

npub1fjq…leku

2024-12-20 18:38:55

in reply to nevent1q…j9ag

i have not looked deeply into it but the way Go does it is convert the variable length symbols into an array of 32 bit values (there is ~4bln points in the field) and you have to compute the byte length in UTF-8 encoding by the values of each element of the array...

i don't know about avoiding breaking up words, word boundaries are another thing again, as you know in some languages there isn't a notion of a "space" character but rather each symbol is an atomic unit, and as i'm fairly sure that they are maximum 4 bytes long encoded in most cases it's either you are splitting on a space or a symbol boundary, spaces are not considered as symbols except as far as the bytes they take up, which is 1, i think, just a hex 0x20

uniseg looks pretty much like a good choice. i should just point out that one of the Go Authors was one of the co-designers of the Unicode spec, Rob Pike, so Go's handling is especially good. It might help you find a better alternative by searching for his recommendation for C/C++ projects needing to do this

most of it his handled already in javascript also, because internationalisation was very important to web systems from very early on, but server and desktop app languages have lived in a bubble of mostly english for a long time.

as regards to the truncation length decision, i think that visual distance should be what you are aiming for here, and for the most part, most distinctive language scripts have a fairly uniform relative width for the same symbols, so i think the right place to evaluate that question has to do with the typography side of the encodings, the byte length is probably not relevant at all

Author Public Key

npub1fjqqy4a93z5zsjwsfxqhc2764kvykfdyttvldkkkdera8dr78vhsmmleku

Show more details

Published at

2024-12-20 18:38:55

Kind type

1 Short Text Note

Event JSON

{ "id": "375cbc2a06d0f3243130ed881e6c3691d400eaf6845dd2fb22b72183f00e8184", "pubkey": "4c800257a588a82849d049817c2bdaad984b25a45ad9f6dad66e47d3b47e3b2f", "created_at": 1734719935, "kind": 1, "tags": [ [ "e", "ffca90d2cd45d2cd92d75f00f2eefa08a8ba7424a0adbd302c60ea549b11f6d2", "", "root" ], [ "e", "ab3033edd2d3f21104e8ce269eff741051ec396ee9450124d2b16a9e7ef9fca0", "wss://theforest.nostr1.com/", "reply", "218238431393959d6c8617a3bd899303a96609b44a644e973891038a7de8622d" ], [ "p", "218238431393959d6c8617a3bd899303a96609b44a644e973891038a7de8622d" ], [ "p", "4c800257a588a82849d049817c2bdaad984b25a45ad9f6dad66e47d3b47e3b2f" ], [ "client", "noStrudel", "31990:266815e0c9210dfa324c6cba3573b14bee49da4209a9456f9484e5106cd408a5:1686066542546" ] ], "content": "i have not looked deeply into it but the way Go does it is convert the variable length symbols into an array of 32 bit values (there is ~4bln points in the field) and you have to compute the byte length in UTF-8 encoding by the values of each element of the array...\n\ni don't know about avoiding breaking up words, word boundaries are another thing again, as you know in some languages there isn't a notion of a \"space\" character but rather each symbol is an atomic unit, and as i'm fairly sure that they are maximum 4 bytes long encoded in most cases it's either you are splitting on a space or a symbol boundary, spaces are not considered as symbols except as far as the bytes they take up, which is 1, i think, just a hex 0x20\n\nuniseg looks pretty much like a good choice. i should just point out that one of the Go Authors was one of the co-designers of the Unicode spec, Rob Pike, so Go's handling is especially good. It might help you find a better alternative by searching for his recommendation for C/C++ projects needing to do this\n\nmost of it his handled already in javascript also, because internationalisation was very important to web systems from very early on, but server and desktop app languages have lived in a bubble of mostly english for a long time.\n\nas regards to the truncation length decision, i think that visual distance should be what you are aiming for here, and for the most part, most distinctive language scripts have a fairly uniform relative width for the same symbols, so i think the right place to evaluate that question has to do with the typography side of the encodings, the byte length is probably not relevant at all", "sig": "4c5ded2ebc9b90237eb1ae53b0378c963a1a5450dced4c7f77e3a8040599a07b4ffa8cbbed1d744e21a0bbed56b84cc76e5816b7e978208f773e4fbc562c4f8d" }