Another curious #portability pitfall: #UTF-16, UTF-32, UCS-2 and UCS-4 encoding are ...

mgorny-nyan (he) :autism:🙀🚂🐧 /

npub1xcf…2zan

2024-07-10 07:02:01

Another curious #portability pitfall: #UTF-16, UTF-32, UCS-2 and UCS-4 encoding are byte order dependent. That is, they can either be encoded as big endian or little endian. #Python uses the host byte order when encoding, and writes a Byte Order Marker at the beginning of the file. When decoding, it transparently reads the BOM back to determine the encoding, so everything works fine out of the box.

Problems start happening when you start comparing the exact byte-level output, e.g. by comparing a UTF-16 bytes read from a file with the result of `encode()`. If the file was written on a little endian system (which is commonly the case), and the test is running on a big endian system, you're suddenly going to get different strings!

The "obvious" way to solve this is to force a specific endianness, e.g. use `utf-16-le` rather than plain `utf-16`. However, when you force endianness, BOM is no longer used — so the byte-level data mismatches on the missing BOM now. The trick is, to add the BOM (`\ufeff`) straight into the #unicode string.

https://github.com/python/importlib_resources/pull/313/files

#Gentoo

Author Public Key

npub1xcf8c45mvdddthcrfzdh066wlrp5t0hqy9kr27ey02h2n9vsk2rqnt2zan

Show more details

Published at

2024-07-10 07:02:01

Kind type

1 Short Text Note

Event JSON

{ "id": "e3f75b6e039fdcfcff80082acf4d03d7138e9450416da42d18dc65ea808e7561", "pubkey": "36127c569b635ad5df03489b77eb4ef8c345bee0216c357b247aaea99590b286", "created_at": 1720594921, "kind": 1, "tags": [ [ "t", "unicode" ], [ "proxy", "https://social.treehouse.systems/@mgorny/112760908761003395", "web" ], [ "t", "utf" ], [ "t", "gentoo" ], [ "t", "python" ], [ "t", "portability" ], [ "proxy", "https://social.treehouse.systems/users/mgorny/statuses/112760908761003395", "activitypub" ], [ "L", "pink.momostr" ], [ "l", "pink.momostr.activitypub:https://social.treehouse.systems/users/mgorny/statuses/112760908761003395", "pink.momostr" ], [ "expiration", "1723186926" ] ], "content": "Another curious #portability pitfall: #UTF-16, UTF-32, UCS-2 and UCS-4 encoding are byte order dependent. That is, they can either be encoded as big endian or little endian. #Python uses the host byte order when encoding, and writes a Byte Order Marker at the beginning of the file. When decoding, it transparently reads the BOM back to determine the encoding, so everything works fine out of the box.\n\nProblems start happening when you start comparing the exact byte-level output, e.g. by comparing a UTF-16 bytes read from a file with the result of `encode()`. If the file was written on a little endian system (which is commonly the case), and the test is running on a big endian system, you're suddenly going to get different strings!\n\nThe \"obvious\" way to solve this is to force a specific endianness, e.g. use `utf-16-le` rather than plain `utf-16`. However, when you force endianness, BOM is no longer used — so the byte-level data mismatches on the missing BOM now. The trick is, to add the BOM (`\\ufeff`) straight into the #unicode string.\n\nhttps://github.com/python/importlib_resources/pull/313/files\n\n#Gentoo", "sig": "82c09c5c71c6c1d828311cb6609bec44806c2bcfd2318031ff581fc9c5ff4a5e084aa2af5db6a9f830e18506468be8e4aeda090f11c539e42b72761fb26f3f28" }