[go: up one dir, main page]

Agnostic/Baker/Accuser: tzboth instead of pick

What & why

We noticed during the activation of Seoul that a lot of accusers got stuck at the migration. The problem was mitigated by restarting the services but that doesn't explain why the agnostic accuser failed to start the new protocol accuser.

sept. 20 08:48:24 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:24.516 NOTICE │   period (remaining period duration 144276)
sept. 20 08:48:24 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:24.516 NOTICE │ new block (BLxoQYx7gnZjv85WKXpX1cpTcjceiSXjUgVic3SbjeuGZRUd41N) on proposal
sept. 20 08:48:16 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:16.466 NOTICE │ block BMJtzwuzwY3xkgNwXWZyViQfAd2c3qbDRJnthKkzuoKUZbw7ujC registered
sept. 20 08:48:16 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:16.437 NOTICE │   period (remaining period duration 144277)
sept. 20 08:48:16 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:16.437 NOTICE │ new block (BMJtzwuzwY3xkgNwXWZyViQfAd2c3qbDRJnthKkzuoKUZbw7ujC) on proposal
sept. 20 08:48:08 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:08.823 NOTICE │   period (remaining period duration 144278)
sept. 20 08:48:08 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:08.823 NOTICE │ new block (BLsD2dqZsbRvo2MyFbStyv9Wzs2JFbHYKtMKPUUPpiPcxFngj1g) on proposal
sept. 20 08:48:08 vicall-HeroBox octez-baker[263572]: Accuser 23.2 (13afca5d) for PtSeouLouXkx started.
sept. 20 08:48:08 vicall-HeroBox octez-baker[263572]: Waiting for protocol 023-PtSeouLo to start...
sept. 20 08:48:08 vicall-HeroBox octez-baker[263572]: Node is bootstrapped.
sept. 20 08:48:08 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:08.809 NOTICE │ baker for protocol PtSeouLouXkx is now running
sept. 20 08:48:08 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:08.216 NOTICE │ starting baker for protocol PtSeouLouXkx
sept. 20 08:48:08 vicall-HeroBox octez-baker[263572]: Sep 20 08:48:08.213 NOTICE │ starting baker daemon
sept. 20 08:48:07 vicall-HeroBox systemd[1]: Started octez-mainnet-accuser.service - Tezos mainnet accuser Service.
sept. 20 08:48:07 vicall-HeroBox systemd[1]: octez-mainnet-accuser.service: Consumed 25min 41.004s CPU time, 722.2M memory peak, 248.6M memory swap peak.
sept. 20 08:48:07 vicall-HeroBox systemd[1]: Stopped octez-mainnet-accuser.service - Tezos mainnet accuser Service.
sept. 20 08:48:07 vicall-HeroBox systemd[1]: octez-mainnet-accuser.service: Failed with result 'exit-code'.
sept. 20 08:48:07 vicall-HeroBox systemd[1]: octez-mainnet-accuser.service: Main process exited, code=exited, status=127/n/a
sept. 20 08:48:06 vicall-HeroBox octez-baker[24144]: Sep 20 08:48:06.858 NOTICE │ stopping baker daemon
sept. 20 08:48:06 vicall-HeroBox octez-baker[24144]: (/home/vicall/Tezos/v23-release/octez-baker) TERM: already in shutdown.
sept. 20 08:48:06 vicall-HeroBox octez-baker[24144]: Shutting down the accuser...
sept. 20 08:48:06 vicall-HeroBox octez-baker[24144]: (/home/vicall/Tezos/v23-release/octez-baker) TERM: triggering shutdown.
sept. 20 08:48:06 vicall-HeroBox systemd[1]: Stopping octez-mainnet-accuser.service - Tezos mainnet accuser Service...
sept. 19 16:59:18 vicall-HeroBox octez-baker[24144]: Sep 19 16:59:18.128 NOTICE │ stopping 022-PsRiotum daemon
sept. 19 16:59:00 vicall-HeroBox octez-baker[24144]: Sep 19 16:59:00.687 NOTICE │ block BLhSiuMazyhqbVtwpwhTf6f23cNKFEfLLbVmejtGyNjRjCzA8rp registered
sept. 19 16:59:00 vicall-HeroBox octez-baker[24144]: Sep 19 16:59:00.656 NOTICE │   period (remaining period duration 1)
sept. 19 16:59:00 vicall-HeroBox octez-baker[24144]: Sep 19 16:59:00.656 NOTICE │ new block (BLhSiuMazyhqbVtwpwhTf6f23cNKFEfLLbVmejtGyNjRjCzA8rp) on adoption
sept. 19 16:58:52 vicall-HeroBox octez-baker[24144]: Sep 19 16:58:52.434 NOTICE │ block BLMutp1Z7fkdaNejfHn38gjv3M5U4Pk7QhHCUqnVZG9bsZFq97k registered
sept. 19 16:58:52 vicall-HeroBox octez-baker[24144]: Sep 19 16:58:52.428 NOTICE │   period (remaining period duration 2)
sept. 19 16:58:52 vicall-HeroBox octez-baker[24144]: Sep 19 16:58:52.428 NOTICE │ new block (BLMutp1Z7fkdaNejfHn38gjv3M5U4Pk7QhHCUqnVZG9bsZFq97k) on adoption
sept. 19 16:58:44 vicall-HeroBox octez-baker[24144]: Sep 19 16:58:44.497 NOTICE │ block BLTLbKCsGs1yad9CkSevXDFosQPxpmLQyJbPp6kLe9V1reTor6R registered
sept. 19 16:58:44 vicall-HeroBox octez-baker[24144]: Sep 19 16:58:44.489 NOTICE │   period (remaining period duration 3)

After investigation, I have identified !17905 (merged) as the root cause. The problem of these changes is that it resolves the current accuser thread as soon as it realizes that we're switching protocols. The main loop of the agnostic daemon is not resilient to threads just stopping, it's either it crash or it continues. Therefore we were exiting the "main loop" of the agnostic daemon without giving the possibility to the protocol's monitoring to start the new protocol. The bug may exist or not depending on the lwt scheduler, explaining why not all accusers were impacted.

How

The main loop of the agnostic daemon consists of 2 promises wrapped in a Lwt.pick:

  1. The protocol's monitoring, responsible to start and stop threads based on what's the current protocol
  2. The protocol specific thread

Instead of wrapping them in a Lwt.pick, I changed it to Lwt_result_syntax.tzboth, meaning that both must resolve or at least one fail to exit this. It works because if the accuser thread resolves as it did, the (1.) protocol's monitoring will remain active and will eventually restart the accuser thread.

Manually testing the MR

I recommend you to do a manual migration test on mainnet with and without this merge request.

Checklist

  • Document the interface of any function added or modified (see the coding guidelines)
  • Document any change to the user interface, including configuration parameters (see node configuration)
  • Provide automatic testing (see the testing guide).
  • For new features and bug fixes, add an item in the appropriate changelog (docs/protocols/alpha.rst for the protocol and the environment, CHANGES.rst at the root of the repository for everything else).
  • Select suitable reviewers using the Reviewers field below.
  • Select as Assignee the next person who should take action on that MR

Merge request reports

Loading