What is Nostr?
mgorny-nyan (he) :autism:🙀🚂🐧 /
npub1xcf…2zan
2024-07-10 07:02:01

mgorny-nyan (he) :autism:🙀🚂🐧 on Nostr: Another curious #portability pitfall: #UTF-16, UTF-32, UCS-2 and UCS-4 encoding are ...

Another curious #portability pitfall: #UTF-16, UTF-32, UCS-2 and UCS-4 encoding are byte order dependent. That is, they can either be encoded as big endian or little endian. #Python uses the host byte order when encoding, and writes a Byte Order Marker at the beginning of the file. When decoding, it transparently reads the BOM back to determine the encoding, so everything works fine out of the box.

Problems start happening when you start comparing the exact byte-level output, e.g. by comparing a UTF-16 bytes read from a file with the result of `encode()`. If the file was written on a little endian system (which is commonly the case), and the test is running on a big endian system, you're suddenly going to get different strings!

The "obvious" way to solve this is to force a specific endianness, e.g. use `utf-16-le` rather than plain `utf-16`. However, when you force endianness, BOM is no longer used — so the byte-level data mismatches on the missing BOM now. The trick is, to add the BOM (`\ufeff`) straight into the #unicode string.

https://github.com/python/importlib_resources/pull/313/files

#Gentoo
Author Public Key
npub1xcf8c45mvdddthcrfzdh066wlrp5t0hqy9kr27ey02h2n9vsk2rqnt2zan