Jonathan Underwood [ARCHIVE] on Nostr: ๐ Original date posted:2017-12-11 ๐ Original message: ZmnSCPxj, Thanks for the ...
๐
Original date posted:2017-12-11
๐ Original message:
ZmnSCPxj,
Thanks for the reply.
1. I agree that all naive implementations that do not follow UTF-8 spec
should perhaps be mentioned, to help people avoid mojibake corruption.
However, this problem is a problem as old as the internet itself (there are
still Japanese websites in shift-JIS encoding) and I think the consensus is
that UTF-8 is the standard. Allowing it in the description is fine as long
as the readers can decode the naive UTF-8 data correctly.
2. Normalization is only an issue when you need to hash something or
compare hash data so it is not an issue for the description, where the goal
is to relay information to the user.
(so maybe it could become an issue for the 256 bit description of purpose
of payment (SHA256) but not for the simple description)
In regards to the purpose commit hash, seeing as there is no specified way
to relay the data for the purpose commit hash, you could just solve it by
specifying the data to be fetched from a URL and having the data encoded as
a binary-stream with the exact bytes that were hashed and the UTF-8 parser
on the receiving app will just display it to the user. (since the goal is
to display info to the user, it doesn't matter if the word is using one
byte combination or the other, as long as you can verify the commit hash
matches the hash of the data and you then can display the data to the user)
Protocols like BIP39 require normalization because when a user inputs data,
they could be using any millions of IMEs that might use different unicode
codepoints to represent the same data as other IMEs... and we need to
ensure that the same human readable phrase ALWAYS is the same hash or else
money is lost (or hard to get to)
For description: "OMG I LOST 100 BTC BECAUSE THE DESCRIPTION SAID ใ instead
of ใใ !!!" will never happen.
I would love to hear your thoughts on other aspects as well.
Thanks,
Jon
2017-12-11 20:10 GMT+09:00 ZmnSCPxj <ZmnSCPxj at protonmail.com>:
> Good morning Jonathan,
>
> >3. Descriptions say they can encode ASCII only. Sorry, but this is
> nonsense. Full unicode support via UTF8 should be supported.
>
> I generally agree, but caution must be warned here. In particular, we
> should be precise, which variant of UTF8.
>
> Presumably, a naive implementation, that specially treats 0 bytes (as
> would happen if the implementation were naively written in C or C++, where
> by default, strings are terminated by a 0 byte), should work correctly
> without having to particularly care, if the encoding is UTF8 or plain 7-bit
> ASCII. This then leads to the use of so-called Modified UTF8 as used by
> Java in its native interface: embedded null characters are encoded as
> extralong 3-byte UTF8 sequences, which are normally invalid in UTF8, but
> which naive treatment by C and C++ leads to (mostly) correct behavior.
> Should we use Modifed UTF8 or simply disallow null characters? (Use of
> ASCII does not avoid this, but ASCII has no alternative to null characters
> and the standard C string terminating byte 0).
>
> In addition, pulling in UTF8 brings in the issue, of Unicode
> normalization. Multiple different byte-sequences in UTF8 may lead to the
> same sequence of human-readable glyphs. Specifying ASCII avoids this
> issue. Should we specify some Unicode normalization, and should GUI at
> least try to impose this Unicode normalization (even if backends/daemons
> simply ignore the description and hence any normalization issues)?
>
> Regards,
> ZmnSCPxj
>
--
-----------------
Jonathan Underwood
ใใใใใณใฏ็คพ ใใผใใใใใณใคใณใชใใฃใตใผ
-----------------
ๆๅทๅใใใกใใปใผใธใใ้ใใฎๆนใฏไธ่จใฎๅ ฌ้้ตใใๅฉ็จไธใใใ
ๆ็ด: 0xCE5EA9476DE7D3E45EBC3FDAD998682F3590FEA3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/lightning-dev/attachments/20171211/9da04785/attachment.html>
๐ Original message:
ZmnSCPxj,
Thanks for the reply.
1. I agree that all naive implementations that do not follow UTF-8 spec
should perhaps be mentioned, to help people avoid mojibake corruption.
However, this problem is a problem as old as the internet itself (there are
still Japanese websites in shift-JIS encoding) and I think the consensus is
that UTF-8 is the standard. Allowing it in the description is fine as long
as the readers can decode the naive UTF-8 data correctly.
2. Normalization is only an issue when you need to hash something or
compare hash data so it is not an issue for the description, where the goal
is to relay information to the user.
(so maybe it could become an issue for the 256 bit description of purpose
of payment (SHA256) but not for the simple description)
In regards to the purpose commit hash, seeing as there is no specified way
to relay the data for the purpose commit hash, you could just solve it by
specifying the data to be fetched from a URL and having the data encoded as
a binary-stream with the exact bytes that were hashed and the UTF-8 parser
on the receiving app will just display it to the user. (since the goal is
to display info to the user, it doesn't matter if the word is using one
byte combination or the other, as long as you can verify the commit hash
matches the hash of the data and you then can display the data to the user)
Protocols like BIP39 require normalization because when a user inputs data,
they could be using any millions of IMEs that might use different unicode
codepoints to represent the same data as other IMEs... and we need to
ensure that the same human readable phrase ALWAYS is the same hash or else
money is lost (or hard to get to)
For description: "OMG I LOST 100 BTC BECAUSE THE DESCRIPTION SAID ใ instead
of ใใ !!!" will never happen.
I would love to hear your thoughts on other aspects as well.
Thanks,
Jon
2017-12-11 20:10 GMT+09:00 ZmnSCPxj <ZmnSCPxj at protonmail.com>:
> Good morning Jonathan,
>
> >3. Descriptions say they can encode ASCII only. Sorry, but this is
> nonsense. Full unicode support via UTF8 should be supported.
>
> I generally agree, but caution must be warned here. In particular, we
> should be precise, which variant of UTF8.
>
> Presumably, a naive implementation, that specially treats 0 bytes (as
> would happen if the implementation were naively written in C or C++, where
> by default, strings are terminated by a 0 byte), should work correctly
> without having to particularly care, if the encoding is UTF8 or plain 7-bit
> ASCII. This then leads to the use of so-called Modified UTF8 as used by
> Java in its native interface: embedded null characters are encoded as
> extralong 3-byte UTF8 sequences, which are normally invalid in UTF8, but
> which naive treatment by C and C++ leads to (mostly) correct behavior.
> Should we use Modifed UTF8 or simply disallow null characters? (Use of
> ASCII does not avoid this, but ASCII has no alternative to null characters
> and the standard C string terminating byte 0).
>
> In addition, pulling in UTF8 brings in the issue, of Unicode
> normalization. Multiple different byte-sequences in UTF8 may lead to the
> same sequence of human-readable glyphs. Specifying ASCII avoids this
> issue. Should we specify some Unicode normalization, and should GUI at
> least try to impose this Unicode normalization (even if backends/daemons
> simply ignore the description and hence any normalization issues)?
>
> Regards,
> ZmnSCPxj
>
--
-----------------
Jonathan Underwood
ใใใใใณใฏ็คพ ใใผใใใใใณใคใณใชใใฃใตใผ
-----------------
ๆๅทๅใใใกใใปใผใธใใ้ใใฎๆนใฏไธ่จใฎๅ ฌ้้ตใใๅฉ็จไธใใใ
ๆ็ด: 0xCE5EA9476DE7D3E45EBC3FDAD998682F3590FEA3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/lightning-dev/attachments/20171211/9da04785/attachment.html>