shell: mitigate context errors
What
This MR introduce a new mitigation for context error.
Depends on !14228 (merged)
Why
The node currently shut down on context error at block application. This can be costly for some infrastructure that take minutes to restart the node. By only restarting the external validator, the time to re-apply the block should decrease. This has also a impact on the baker that shutdown on node crash. With this fix, the baker will no longer shutdown on context error.
How
Previously on context error at block application, the node exited gracefully. With this MR, the node, on context error at block application will try to re-apply the block after restarting the external validator. The restart of the external validator, force a full reload of the context that should avoid the application error.
Manually testing the MR
This MR has been tested for both bootstrapping and bootstrapped cases.
bootstrapping case
bootstrapped case
on master we have:
Jul 17 11:47:11.286: head is now BLkGQufj81BdEdcnCNCiRB4yAAnDRKT1dKhYw1evEgAjDG6yCnv (7128721)
Jul 17 11:47:12.079: operation opEitzrHdkAWMgFAUgvnc4f4mZJ76QusXxb5qo8ndXbeFxwinRA injected
Jul 17 11:47:16.321: peer idtCrcg3eXxBcJd61rXqpSg5vTEytk disconnected
Jul 17 11:47:16.382: peer idsaSYY9JaspsaP5kiYKUMRuAKMfq4 disconnected
Jul 17 11:47:16.475: operation op1QpCxD1t1MqnXsBcH6PhkuwTFqbrjGLTqHgJhNgYBoqrL3q7R injected
Error:
{"Direct":["CoVYEnSsB7B6x8SGsaaqPaVfwNN9yNDMNPiwfZYSLGcBAsjvUXgk",706898519600,360]}: unknown inode key (find_value)
Jul 17 11:47:16.541: application of block BLk2YfA5oDVeq6RfPyeKiwy7oaM5KKmMhLE5iETd19v7SsPG9cT
Jul 17 11:47:16.541: failed but validation succeeded,
Jul 17 11:47:16.541: Request pushed on 2024-07-17T09:47:16.419-00:00, treated in 1.540ms, completed in 120ms:
Jul 17 11:47:16.541: {"Direct":["CoVYEnSsB7B6x8SGsaaqPaVfwNN9yNDMNPiwfZYSLGcBAsjvUXgk",706898519600,360]}: unknown inode key (find_value)
Jul 17 11:47:16.541: retrying application of block
Jul 17 11:47:16.541: BLk2YfA5oDVeq6RfPyeKiwy7oaM5KKmMhLE5iETd19v7SsPG9cT from
Jul 17 11:47:16.541: idr7JtwhF7Fn2KbCLju3bcLeSgUHJa after context error
Jul 17 11:47:16.542: critical context error: stopping the node gracefully.
Jul 17 11:47:16.543: shutting down the Tezos node
...
Jul 17 11:47:23.811: the Tezos node is now running
...
Jul 17 11:47:24.504: synchronisation status: synced
Jul 17 11:47:24.505: chain is bootstrapped
Jul 17 11:47:26.450: switch branch to BM8dUPsywkxzcaAwQoAP9dRMpEkuXG7qWrU43GbD5k3Lo51hq7Y
Jul 17 11:47:26.450: (7128723)
Jul 17 11:47:33.165: head is now BLvY1KwFjnTEZh4Hr6dwwqUFzKE18mo5Kfy5e1TFy42SNr6AUrj (7128724)
The node crashed and restarted automatically (setup as a service)
With this MR we have:
Jul 22 10:01:35.376: head is now BLs6vnTdiB1A1TMcokw4EE6b8CaYYLiL4V4vNckDvMWw5tkNMn3 (7202674)
Jul 22 10:01:36.000: operation op4wqVwzjYpkiEZ4aWxXKEWVtN5YP8225Ha828b6thCkUyKzUi3 injected
Jul 22 10:01:45.430: peer idrVoifTbjquatCpvGRvnuoM87RMcU disconnected
Jul 22 10:01:45.470: peer idtWtfZGVdEU4Tj4fnUXHxZadnJsKR disconnected
Error:
{"Direct":["CoVnFLCekzXQLWczwqL2QHQuqsw7yEo7JvhKrDUypiu4iiuFa98f",726135632765,360]}: unknown inode key (find_value)
Jul 22 10:01:45.566: Application of block BLsyZzx83erbvDo7h7Q2jdWA54geghMQifskSBW799vEBuLEmAH
Jul 22 10:01:45.566: failed on context error:
Jul 22 10:01:45.566: Error:
Jul 22 10:01:45.566: {"Direct":["CoVnFLCekzXQLWczwqL2QHQuqsw7yEo7JvhKrDUypiu4iiuFa98f",726135632765,360]}: unknown inode key (find_value)
Jul 22 10:01:45.566:
Jul 22 10:01:45.571: shutting down external validator
Jul 22 10:01:45.571: operation oomj4b1qyyQSFuxQkD3s2xoptQVv6pY45BADznb3wP24hgfQH7M injected
Jul 22 10:01:45.609: process terminated abnormally with exit code 1
Jul 22 10:01:45.609: retry block BLsyZzx83erbvDo7h7Q2jdWA54geghMQifskSBW799vEBuLEmAH application
Jul 22 10:01:46.785: validator process started with pid 1760530
Jul 22 10:01:54.582: head is now BLsyZzx83erbvDo7h7Q2jdWA54geghMQifskSBW799vEBuLEmAH (7202675)
Jul 22 10:02:12.602: operation ooWCC6EsHgxF6RqLqir34cYyyLfeRMNYGxdMiiHXQa5Mtv4PpZj injected
Jul 22 10:02:12.664: switch branch to BM9wYxssKMvT8PxCebzZfSvkmZ2mcm2B6c1pDzKEMDKmLrbSPx6
Jul 22 10:02:12.664: (7202675)
The node did not crash and was able to handle the error more smoothly.
Checklist
-
Document the interface of any function added or modified (see the coding guidelines) -
Document any change to the user interface, including configuration parameters (see node configuration) -
Provide automatic testing (see the testing guide). -
For new features and bug fixes, add an item in the appropriate changelog ( docs/protocols/alpha.rstfor the protocol and the environment,CHANGES.rstat the root of the repository for everything else). -
Select suitable reviewers using the Reviewersfield below. -
Select as Assigneethe next person who should take action on that MR

