Rusty Russell on Nostr: Bad new first. The "bytes touched" model does not actually reflect real costs, even ...
Bad new first. The "bytes touched" model does not actually reflect real costs, even on the RPi 3 (the lowest machine specs I'm measuring). Why? Caching is real, and these kind of linear operations which Bitcoin Script consists of are the most optimal case for readahead, write combining and all the tricks modern CPUs employ. A bytes touched model would have AND-in-place (A &= B, where A and B are 4MB long) be twice as expensive as invert (A = ~A): on my laptop they're actually comparable.
The slowest operation? 1 bit shift in place! Doing this to two 4MB blobs: Rpi3: 16ms, vs ADD (twice) at 12ms. Laptop: 2.0ms vs 1.1ms. This is because the data dependency cuts down on the ability to pipeline, as the next value depends on the previous. This is interesting, and more work is needed on finding the worst-case add for similar reasons: if you never carry (i.e. overflow on each addition) the branch prediction will be correct, so I will be benchmarking different patterns (carry every second one? Every eighth?) to find the worst case in real life.
Given that we're actually CPU constrained, not memory bandwidth constrained, you might be able to guess the fastest operations. Those optimized to use SIMD instructions! So memcmp, memcpy and memset etc are blindingly fast: my simple loop-coded memset on RPi3 takes 17msec for two 4MB blobs, but memset takes 4.1! (1.8 vs 0.45 on my laptop).
The slowest operation? 1 bit shift in place! Doing this to two 4MB blobs: Rpi3: 16ms, vs ADD (twice) at 12ms. Laptop: 2.0ms vs 1.1ms. This is because the data dependency cuts down on the ability to pipeline, as the next value depends on the previous. This is interesting, and more work is needed on finding the worst-case add for similar reasons: if you never carry (i.e. overflow on each addition) the branch prediction will be correct, so I will be benchmarking different patterns (carry every second one? Every eighth?) to find the worst case in real life.
Given that we're actually CPU constrained, not memory bandwidth constrained, you might be able to guess the fastest operations. Those optimized to use SIMD instructions! So memcmp, memcpy and memset etc are blindingly fast: my simple loop-coded memset on RPi3 takes 17msec for two 4MB blobs, but memset takes 4.1! (1.8 vs 0.45 on my laptop).