Rusty Russell on Nostr: Great report on channeld using 100% CPU and getting laggy on giant nodes. Random ...
Great report on channeld using 100% CPU and getting laggy on giant nodes. Random backtrace via `gdb -ex` showed memmove in the to-gossipd queue.
Ok, it's a bad implementation for big queues, which is easy to fix, but I also added a backtrace first time we surpassed 100k entries.
This hit *gossipd* first: and, because it was the queue to the main daemon and I'm an idiot (writing the backtrace before setting the "done-once" flag) it recursed infinitely.
But the root of *this* was a known issue, that we spam the logs when we get lots of gossip. So a log level below "debug" was added a while ago, "trace".
But we had explicit logic to suppress debug messages from making the queue too long, which *wasn't* extended for trace! One line fix.
Now, I went to look at the original issue: connectd flooding gossipd and slowing down. This is definitely possible, if we get a lot of gossip at startup: gossipd may have its hands full keeping up. But actually, I'm pretty sure what was happening was gossipd running slow *because* of the slow performance of its own queue to lightningd!
So there's a trivial commit at the end of the series which raises the queue size before reporting an issue to 250k and explains why the other commits (which didn't touch channeld) actually fixed the issue!
Ok, it's a bad implementation for big queues, which is easy to fix, but I also added a backtrace first time we surpassed 100k entries.
This hit *gossipd* first: and, because it was the queue to the main daemon and I'm an idiot (writing the backtrace before setting the "done-once" flag) it recursed infinitely.
But the root of *this* was a known issue, that we spam the logs when we get lots of gossip. So a log level below "debug" was added a while ago, "trace".
But we had explicit logic to suppress debug messages from making the queue too long, which *wasn't* extended for trace! One line fix.
Now, I went to look at the original issue: connectd flooding gossipd and slowing down. This is definitely possible, if we get a lot of gossip at startup: gossipd may have its hands full keeping up. But actually, I'm pretty sure what was happening was gossipd running slow *because* of the slow performance of its own queue to lightningd!
So there's a trivial commit at the end of the series which raises the queue size before reporting an issue to 250k and explains why the other commits (which didn't touch channeld) actually fixed the issue!