Doug Hoyte on Nostr: Truncating text is complicated. Today I spent some time fixing some bugs on ...
Truncating text is complicated.
Today I spent some time fixing some bugs on oddbean.com that I've been putting off for a while. Most just involved some uninteresting grunt work, but there's one that is a huge rabbit hole and, if you've never thought about it before you may be surprised at how deep it goes.
On Oddbean, we only show the first ~100 characters of a nostr note and then cut it off ("truncate" it). This is all well and good, except some titles got an unexpected weird character at the end:
Nostr Advent Calendar 2024 の 11 日目の記事を書きました。 2024年のNostrリレ�…
Now, I'm no expert on Japanese script but I'm pretty sure that diamond question mark character is not supposed to be there. What gives?
The answer is that almost all text on the web is encoded in UTF-8, which is a multi-byte Unicode encoding. That means that these Japanese characters actually take up 3 bytes, unlike Latin letters which take up 1. Oddbean was taking the first 100 bytes and cutting it off there. Unfortunately, that left an incomplete UTF-8 encoded code point which the browser replaces with a special replacement character (U+FFFD, the diamond question mark).
OK, easy fix right? Just do substr() on the code-points (not the UTF-8 encoding). Sure, but that is quite inefficient, requiring a pass over the data. Fortunately there is a more efficient way to fix this that relies on the fact that UTF-8 is a self-synchronising code, meaning you can always find nearest code point boundaries no matter where in the string you jump to. So that is what I did:
https://github.com/hoytech/strfry/blob/863edcff17834af5f51654b546e34de965382756/src/apps/web/WebUtils.h#L120-L127
Problem solved right? Well, that depends on your definition of "solved". Notice above I've been referring to "code points" instead of characters? In many languages such as English we can pretty much get away with considering these the same. However in other scripts this is not the case.
Sometimes what we think of as a character can actually require multiple code-points. For example, the character 'â' can be represented as 'a' followed by a special ' ̂' combining character. Most common characters such as â *also* have dedicated code-points, and which representation is used depends on the Unicode Normal Form. You may also have seen country flags represented by two composite characters, or emoji alterations such as skin tone -- it's the same principle. Cutting in between such characters will cause truncation artifacts.
So rather than "character" (which is an imprecise notion), Unicode refers to Extended Grapheme Clusters, which correspond as closely as possible with what we think of as individual atoms of text. You can read more than you ever wanted to know about this here: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Note that many langauges need special consideration when cutting on graphemes (or indeed words, lines etc). Especially Korean Hangul script is interesting, having been designed rather than evolved like most writing systems -- in fact it's quite elegant!
So my hack for Oddbean doesn't do all this fancy grapheme truncation, and that's because I know if I tried I would end up in a seriously deep rabbit hole. I know because I have and I did! 10 years ago I published the following Perl module: https://metacpan.org/pod/Unicode::Truncate
I'm pretty proud of this yak shave, because of the implementation. I was able to adapt the regular expressions from Unicode TR29, compose them with a UTF-8 regular expression, and compile it all with the Ragel state machine compiler ( https://www.colm.net/open-source/ragel/ ). As a result, it can both validate UTF-8 and (correctly!) truncate in a single-pass.
If you want (a lot) more Unicode trivia, I also made a presentation on this topic: https://hoytech.github.io/truncate-presentation/
Today I spent some time fixing some bugs on oddbean.com that I've been putting off for a while. Most just involved some uninteresting grunt work, but there's one that is a huge rabbit hole and, if you've never thought about it before you may be surprised at how deep it goes.
On Oddbean, we only show the first ~100 characters of a nostr note and then cut it off ("truncate" it). This is all well and good, except some titles got an unexpected weird character at the end:
Nostr Advent Calendar 2024 の 11 日目の記事を書きました。 2024年のNostrリレ�…
Now, I'm no expert on Japanese script but I'm pretty sure that diamond question mark character is not supposed to be there. What gives?
The answer is that almost all text on the web is encoded in UTF-8, which is a multi-byte Unicode encoding. That means that these Japanese characters actually take up 3 bytes, unlike Latin letters which take up 1. Oddbean was taking the first 100 bytes and cutting it off there. Unfortunately, that left an incomplete UTF-8 encoded code point which the browser replaces with a special replacement character (U+FFFD, the diamond question mark).
OK, easy fix right? Just do substr() on the code-points (not the UTF-8 encoding). Sure, but that is quite inefficient, requiring a pass over the data. Fortunately there is a more efficient way to fix this that relies on the fact that UTF-8 is a self-synchronising code, meaning you can always find nearest code point boundaries no matter where in the string you jump to. So that is what I did:
https://github.com/hoytech/strfry/blob/863edcff17834af5f51654b546e34de965382756/src/apps/web/WebUtils.h#L120-L127
Problem solved right? Well, that depends on your definition of "solved". Notice above I've been referring to "code points" instead of characters? In many languages such as English we can pretty much get away with considering these the same. However in other scripts this is not the case.
Sometimes what we think of as a character can actually require multiple code-points. For example, the character 'â' can be represented as 'a' followed by a special ' ̂' combining character. Most common characters such as â *also* have dedicated code-points, and which representation is used depends on the Unicode Normal Form. You may also have seen country flags represented by two composite characters, or emoji alterations such as skin tone -- it's the same principle. Cutting in between such characters will cause truncation artifacts.
So rather than "character" (which is an imprecise notion), Unicode refers to Extended Grapheme Clusters, which correspond as closely as possible with what we think of as individual atoms of text. You can read more than you ever wanted to know about this here: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Note that many langauges need special consideration when cutting on graphemes (or indeed words, lines etc). Especially Korean Hangul script is interesting, having been designed rather than evolved like most writing systems -- in fact it's quite elegant!
So my hack for Oddbean doesn't do all this fancy grapheme truncation, and that's because I know if I tried I would end up in a seriously deep rabbit hole. I know because I have and I did! 10 years ago I published the following Perl module: https://metacpan.org/pod/Unicode::Truncate
I'm pretty proud of this yak shave, because of the implementation. I was able to adapt the regular expressions from Unicode TR29, compose them with a UTF-8 regular expression, and compile it all with the Ragel state machine compiler ( https://www.colm.net/open-source/ragel/ ). As a result, it can both validate UTF-8 and (correctly!) truncate in a single-pass.
If you want (a lot) more Unicode trivia, I also made a presentation on this topic: https://hoytech.github.io/truncate-presentation/