What is Nostr?
Doug Hoyte /
npub1yxp…qud4
2024-12-20 19:18:47
in reply to nevent1q…s8qw

Doug Hoyte on Nostr: > It might help you find a better alternative by searching for his recommendation for ...

> It might help you find a better alternative by searching for his recommendation for C/C++ projects needing to do this

Thanks, I will look into this. Go has much better Unicode support than C++, which basically doesn't have it at all, or rather, you need to pull in a library to do even basic things. This is why I'm doing the hack I mentioned above: I don't want to add a dependency on a library like ICU (also it is very efficient).

OTOH, Perl has outstanding Unicode support. If you don't care about byte length, then you can simply pick out the first 100 grapheme clusters with a regexp like this: /^\X{0,100}/. This will handle the segmentation according the the TR-29 rules.

Although you're probably right for this use-case, in my experience byte-length limits are quite common and you have to deal with them somehow, ideally without causing weird artifacts. For example in nostr you have byte-size limits on note length, tag values, etc. Another tricky aspect is that theoretically grapheme clusters are unbounded in length. So a single "character" could take up gigabytes of encoded space -- worth keeping in mind due to the DoS risk.
Author Public Key
npub1yxprsscnjw2e6myxz73mmzvnqw5kvzd5ffjya9ecjypc5l0gvgksh8qud4