Doug Hoyte on Nostr: > It might help you find a better alternative by searching for his recommendation for ...
> It might help you find a better alternative by searching for his recommendation for C/C++ projects needing to do this
Thanks, I will look into this. Go has much better Unicode support than C++, which basically doesn't have it at all, or rather, you need to pull in a library to do even basic things. This is why I'm doing the hack I mentioned above: I don't want to add a dependency on a library like ICU (also it is very efficient).
OTOH, Perl has outstanding Unicode support. If you don't care about byte length, then you can simply pick out the first 100 grapheme clusters with a regexp like this: /^\X{0,100}/. This will handle the segmentation according the the TR-29 rules.
Although you're probably right for this use-case, in my experience byte-length limits are quite common and you have to deal with them somehow, ideally without causing weird artifacts. For example in nostr you have byte-size limits on note length, tag values, etc. Another tricky aspect is that theoretically grapheme clusters are unbounded in length. So a single "character" could take up gigabytes of encoded space -- worth keeping in mind due to the DoS risk.
Thanks, I will look into this. Go has much better Unicode support than C++, which basically doesn't have it at all, or rather, you need to pull in a library to do even basic things. This is why I'm doing the hack I mentioned above: I don't want to add a dependency on a library like ICU (also it is very efficient).
OTOH, Perl has outstanding Unicode support. If you don't care about byte length, then you can simply pick out the first 100 grapheme clusters with a regexp like this: /^\X{0,100}/. This will handle the segmentation according the the TR-29 rules.
Although you're probably right for this use-case, in my experience byte-length limits are quite common and you have to deal with them somehow, ideally without causing weird artifacts. For example in nostr you have byte-size limits on note length, tag values, etc. Another tricky aspect is that theoretically grapheme clusters are unbounded in length. So a single "character" could take up gigabytes of encoded space -- worth keeping in mind due to the DoS risk.