[go: up one dir, main page]

Octez 16.1 archive nodes stall under load when merging store

Archive nodes under heavy RPC load and simultaneous store merge (1) time out all RPC requests, (2) process no more blocks, (3) take >1hr to perform store merge. The actual reason is unclear to me but I have observed this 2 times in the past week. This always happens at cycle end when (a) store is merged and (b) we send a large amounts of concurrent RPC requests (16k per backend to fetch endorsing rights block by block, 4M+ to validate account balances). Not all archive nodes get stuck and it doesn't happen immediately at cycle end, but a few blocks into the next cycle. It is likely a race / lifelock between merge and RPC related database lookups. First time, I killed the node which corrupted storage, but repair was possible. Second time I let it run and it eventually completed after 1hr 13min. Node log is attached.

node-stalled-2023-04-13T03_36_35.000Z-2023-04-13T04_49_55.000Z.txt