Bad new first. The "bytes touched" model does not actually reflect real costs, even ...

npub179e…lz4s

2024-06-24 01:19:23

in reply to nevent1q…5a3s

Bad new first. The "bytes touched" model does not actually reflect real costs, even on the RPi 3 (the lowest machine specs I'm measuring). Why? Caching is real, and these kind of linear operations which Bitcoin Script consists of are the most optimal case for readahead, write combining and all the tricks modern CPUs employ. A bytes touched model would have AND-in-place (A &= B, where A and B are 4MB long) be twice as expensive as invert (A = ~A): on my laptop they're actually comparable.

The slowest operation? 1 bit shift in place! Doing this to two 4MB blobs: Rpi3: 16ms, vs ADD (twice) at 12ms. Laptop: 2.0ms vs 1.1ms. This is because the data dependency cuts down on the ability to pipeline, as the next value depends on the previous. This is interesting, and more work is needed on finding the worst-case add for similar reasons: if you never carry (i.e. overflow on each addition) the branch prediction will be correct, so I will be benchmarking different patterns (carry every second one? Every eighth?) to find the worst case in real life.

Given that we're actually CPU constrained, not memory bandwidth constrained, you might be able to guess the fastest operations. Those optimized to use SIMD instructions! So memcmp, memcpy and memset etc are blindingly fast: my simple loop-coded memset on RPi3 takes 17msec for two 4MB blobs, but memset takes 4.1! (1.8 vs 0.45 on my laptop).

Author Public Key

npub179e9tp4yqtqx4myp35283fz64gxuzmr6n3yxnktux5pnd5t03eps0elz4s

Show more details

Published at

2024-06-24 01:19:23

Kind type

1 Short Text Note

Event JSON

{ "id": "1683037b11d238a01606a96a8b2ebbab930b8d1b66da3cd848ec6953bb02294e", "pubkey": "f1725586a402c06aec818d1478a45aaa0dc16c7a9c4869d97c350336d16f8e43", "created_at": 1719191963, "kind": 1, "tags": [ [ "e", "5f1e7057905f583d5352cd60f33e3dfd509dab3e187ba530a0d1e61872d59301", "wss://nostr.bitcoiner.social/", "root" ] ], "content": "Bad new first. The \"bytes touched\" model does not actually reflect real costs, even on the RPi 3 (the lowest machine specs I'm measuring). Why? Caching is real, and these kind of linear operations which Bitcoin Script consists of are the most optimal case for readahead, write combining and all the tricks modern CPUs employ. A bytes touched model would have AND-in-place (A \u0026= B, where A and B are 4MB long) be twice as expensive as invert (A = ~A): on my laptop they're actually comparable.\n\nThe slowest operation? 1 bit shift in place! Doing this to two 4MB blobs: Rpi3: 16ms, vs ADD (twice) at 12ms. Laptop: 2.0ms vs 1.1ms. This is because the data dependency cuts down on the ability to pipeline, as the next value depends on the previous. This is interesting, and more work is needed on finding the worst-case add for similar reasons: if you never carry (i.e. overflow on each addition) the branch prediction will be correct, so I will be benchmarking different patterns (carry every second one? Every eighth?) to find the worst case in real life.\n\nGiven that we're actually CPU constrained, not memory bandwidth constrained, you might be able to guess the fastest operations. Those optimized to use SIMD instructions! So memcmp, memcpy and memset etc are blindingly fast: my simple loop-coded memset on RPi3 takes 17msec for two 4MB blobs, but memset takes 4.1! (1.8 vs 0.45 on my laptop).", "sig": "509382b9be14ba1d1b89fa64422c6dc357169700e8294bcfa6959c1bb9c2d0dc50d1ce92e184b2946b3c3d9e0bf0de2b18a9e5b228cc910997e9a64a26c37a84" }