Some nice coat bug repots coming in, from real usage. A nice "please submit a bug ...

Some nice coat bug repots coming in, from real usage. A nice "please submit a bug report" message came in (obviously, a case I thought would never get hit!).

So my weekend (remember how I said I wouldn't be working all hours now the release is close? Ha!) has been occupied thinking about this.

On the surface, this happens when we exactly fill the maximum capacity of a local channel, then go to add fees and can't fit (if we hit htlc max, we split into two htlcs for guys case). We should go back and ask our min-cost-flow solver for another route for the part we can't afford. This is almost certain to fail, though, because there was a reason we were trying to jam the entire thing down that one channel.

But what's more interesting is what's actually happening: something I managed to accidentally trigger in CI for *another* test. See, we fail a payment at the time we get the peer's sig on the tx with them HTLC removed. But after that, there's another round trip while we clear the HTLC from the peer's tx. The funds in flight aren't *really* available again until that completes.

This matters for xpay, which tends to respond to failure by throwing another payment out. This can fail because the previous one hasn't totally finished (in my test, it wasn't out of capacity, but actually hit the total dust limit, but it's the same effect: gratuitous failure on the local channel). Xpay assumes the previous failure is caused by capacity limits, and reduces the capacity estimate of the local channel (it should know the capacity, but other operations or the peer could change it, so it tries not to assume).

Eventually, this capacity estimate becomes exactly the payment we are trying to make, and we hit the "can't add fees" corner case.

There are four ways to fix this:
1. Allow adding a new htlc while the old one is being removed. This seems spec-legal but in practice would need a lot of interop testing.
2. Don't fail htlcs until they're completely cleared. But the sooner we report failure the sooner we can start calculating more routes.
3. If a local error happens, wait until htlcs are fully clear and try again.
4. Wait inside "injectpaymentonion" until htlcs are clear.

We're at rc2, so I'm going mid-brain on this: wait for a second and retry if this happens! Polling on channel htlcs is possible, but won't win much for this corner case.

Longer term, inject could efficiently retry (it can trigger on the htlc vanishing, as it's inside lightningd). But that's more code and nobody will ever care

Rusty Russell on Nostr: Some nice coat bug repots coming in, from real usage. A nice "please submit a bug ...