> It might help you find a better alternative by searching for his recommendation for ...

npub1yxp…qud4

2024-12-20 19:18:47

in reply to nevent1q…s8qw

> It might help you find a better alternative by searching for his recommendation for C/C++ projects needing to do this

Thanks, I will look into this. Go has much better Unicode support than C++, which basically doesn't have it at all, or rather, you need to pull in a library to do even basic things. This is why I'm doing the hack I mentioned above: I don't want to add a dependency on a library like ICU (also it is very efficient).

OTOH, Perl has outstanding Unicode support. If you don't care about byte length, then you can simply pick out the first 100 grapheme clusters with a regexp like this: /^\X{0,100}/. This will handle the segmentation according the the TR-29 rules.

Although you're probably right for this use-case, in my experience byte-length limits are quite common and you have to deal with them somehow, ideally without causing weird artifacts. For example in nostr you have byte-size limits on note length, tag values, etc. Another tricky aspect is that theoretically grapheme clusters are unbounded in length. So a single "character" could take up gigabytes of encoded space -- worth keeping in mind due to the DoS risk.

Author Public Key

npub1yxprsscnjw2e6myxz73mmzvnqw5kvzd5ffjya9ecjypc5l0gvgksh8qud4

Seen on

wss://nos.lol wss://nostr.wine wss://relay.nostr.band wss://relay.primal.net

Show more details

Published at

2024-12-20 19:18:47

Kind type

1 Short Text Note

Event JSON

{ "id": "f0cdb312e9c455bbdf03d572a1338ca44b882034e58d001fe6f0954d86e28048", "pubkey": "218238431393959d6c8617a3bd899303a96609b44a644e973891038a7de8622d", "created_at": 1734722327, "kind": 1, "tags": [ [ "client", "oddbean" ], [ "e", "ffca90d2cd45d2cd92d75f00f2eefa08a8ba7424a0adbd302c60ea549b11f6d2", "", "root" ], [ "e", "375cbc2a06d0f3243130ed881e6c3691d400eaf6845dd2fb22b72183f00e8184", "", "reply" ], [ "p", "218238431393959d6c8617a3bd899303a96609b44a644e973891038a7de8622d" ], [ "p", "4c800257a588a82849d049817c2bdaad984b25a45ad9f6dad66e47d3b47e3b2f" ] ], "content": "\u003e It might help you find a better alternative by searching for his recommendation for C/C++ projects needing to do this\n\nThanks, I will look into this. Go has much better Unicode support than C++, which basically doesn't have it at all, or rather, you need to pull in a library to do even basic things. This is why I'm doing the hack I mentioned above: I don't want to add a dependency on a library like ICU (also it is very efficient).\n\nOTOH, Perl has outstanding Unicode support. If you don't care about byte length, then you can simply pick out the first 100 grapheme clusters with a regexp like this: /^\\X{0,100}/. This will handle the segmentation according the the TR-29 rules.\n\nAlthough you're probably right for this use-case, in my experience byte-length limits are quite common and you have to deal with them somehow, ideally without causing weird artifacts. For example in nostr you have byte-size limits on note length, tag values, etc. Another tricky aspect is that theoretically grapheme clusters are unbounded in length. So a single \"character\" could take up gigabytes of encoded space -- worth keeping in mind due to the DoS risk.", "sig": "d40bde7fc86a400b0d07e8e0a9a733dedce7f525d65b4ccd214c9a34c55871df1b4e7393d8d44d4b322b4cf6b0be786c8404d98f945d820c4af0d96274756c64" }