[go: up one dir, main page]

Prevent the DDB from freezing the node

Context

When a node lags behind a few blocks, e.g. when a sequential costly RPC is called and freezes the node for several minutes, the node while trying to catch back processes its backlog of peer messages which contain operations that are no longer available to its peers (e.g. preendorsements). The set of pending distributed database requests will grow significantly. When this occurs, processing the set is costly and sequential which adds as a side-effect that only one request will be removed from the set of pending requests instead of a complete batch. This MR adds a throttler (implemented as a short sleep) to the clean-up mechanism so that requests cleanup are batched. Also, it reduces the iterations by computing the next timeout while doing the clean-up pass. Lastly, while catching up, a node is still considered bootstrapped and will handle every backup Current_head messages (which are numerous) which results in computing and exchanging a locator with its peers for each message. We now ignore these messages as we catch up.

Supposedly fixes #5430 (closed)

Manually testing the MR

  • Run a node on a machine (the less performant, the more apparent the bug is);
  • Send a SIGSTOP (ctrl-z) while the node is bootstrapped;
  • Wait a few minutes;
  • Send a SIGCONT ($ fg);
  • Notice that, without the patch, the node may have trouble catching up or freezes a lot (e.g. very few block validation, ram growing, ..) and check that the fix in this MR prevent this situation from happening.

Checklist

  • Document the interface of any function added or modified (see the coding guidelines)
  • Document any change to the user interface, including configuration parameters (see node configuration)
  • For new features and bug fixes, add an item in the appropriate changelog (docs/protocols/alpha.rst for the protocol and the environment, CHANGES.rst at the root of the repository for everything else).
  • Select suitable reviewers using the Reviewers field below.
  • Select as Assignee the next person who should take action on that MR
Edited by vbot

Merge request reports

Loading