[go: up one dir, main page]

Baker/Accuser: properly monitor processes

What & why

New attempt to fix the bug we had on the accuser. !19364 (merged) is incorrect, I misunderstood how tzboth behaved but surprisingly it fixed the bug I was chasing but introduced new ones :-). It's obvious but tzboth completes only if both promises complete. I need this piece of code to terminate as soon as one of them raise an error. So what happens is that if for instance you forgot to put the argument about DAL --without-dal, the first promise completes with an error, but the second promise remains alive and the baker becomes a zombie process.

But this is the context of why my previous merge request was incorrect. Now let's recap what was initially the issue (because now I have much better understanding).

The agnostic baker is composed of 2 main threads:

  1. baker_thread that monitors the current thread
  2. monitor_voting_periods that monitors the L1 and spawn bakers when needed. Those 2 threads are followed in a Lwt_utils.never_ending in case one of them terminate (this is absolutely wrong).

At protocol migration, the accuser stopped itself, the thread terminated with Ok (). It stopped for "good" reason, but it's not really compatible with our thread monitoring. What happened then, is the accuser thread terminated, the Lwt.pick then stopped the monitor_voting_periods and the daemon was stuck in the Lwt_utils.never_ending

I learned multiple things from this:

  1. Lwt_utils.never_ending should really not exist, if one of the thread must exit, the deamon must exit
  2. baker_thread monitors only the current_thread. However, this thread might change during protocol migrations, but we actually monitor only the very first current_thread, that is the one spawned when the baker is first started.
  3. old_baker is not monitored. Once you have a protocol migration, whatever happens with old_baker is ignored. Let's say the chain is stuck on the last block of the protocol, you need the old_baker to play the consensus, and you have no idea whether it's running.
  4. The code is too much spaghetting, splitting monitor voting periods and baker threads is misleading. You have to understand when and why and how the state is modified, and how they play together.

How

I followed a suggestion of @4dam to have a main scheduler, similar to what has been done in the baker. The main (recursive) loop is now composed of 4 events:

type event =
  | New_head
  | Head_stream_ended
  | Old_baker_stopped
  | Current_baker_stopped

let rec main_loop state =
  let* event = Lwt.choose [ ... ] in
  match event with
  | New_head -> (* monitor voting periods *)
  | Head_stream_ended (* connection is lost *)
  | Old_baker_stopped (* should the old baker be stopped? is it expired? *)
  | Current_baker_stopped (* the baker cannot stop. *)

Therefore:

  1. The main_loop may never stop, unless we have some unexpected completions or errors.
  2. Old baker and current baker are both monitored

Manually testing the MR

I'm doing a lot of manual migration tests. I think there are several scenarios to test, I'm going to list some of them, please request more tests if you have ideas.

Kill the old baker before it is expired

Using the follow patch: you need to activate the new protocol, stop baking blocks (you don't want the old baker to be expired), wait to see if the baker exits.

diff --git a/src/lib_agnostic_baker/daemon.ml b/src/lib_agnostic_baker/daemon.ml
index eaedf0a8c3..501e612622 100644
--- a/src/lib_agnostic_baker/daemon.ml
+++ b/src/lib_agnostic_baker/daemon.ml
@@ -404,7 +404,16 @@ module Make_daemon (Agent : AGENT) :
       | None ->
           let*! () = Lwt_utils.never_ending () in
           assert false
-      | Some old_baker -> old_baker.baker.process.thread
+      | Some old_baker ->
+          let* () =
+            Lwt.pick
+              [
+                old_baker.baker.process.thread;
+                (let*! () = Lwt_unix.sleep 5. in
+                 failwith "sleep");
+              ]
+          in
+          return_unit
     in
     let head_stream : event tzresult Lwt.t =
       Lwt.map

Kill the new baker soon after spawn

Same logic as the previous patch:

diff --git a/src/lib_agnostic_baker/daemon.ml b/src/lib_agnostic_baker/daemon.ml
index eaedf0a8c3..98845a8776 100644
--- a/src/lib_agnostic_baker/daemon.ml
+++ b/src/lib_agnostic_baker/daemon.ml
@@ -182,6 +182,8 @@ module Make_daemon (Agent : AGENT) :
     let agent_thread = Agent.run_command plugin cctxt command in
     Lwt.pick [agent_thread; cancel_promise]
 
+  let counter = ref 0
+
   (** [spawn_baker cctxt command protocol_hash] spawns a new baker process for
       the given [protocol_hash]. *)
   let spawn_baker cctxt command protocol_hash =
@@ -189,13 +191,27 @@ module Make_daemon (Agent : AGENT) :
     let*! () = Events.(emit starting_agent) (Agent.name, protocol_hash) in
     let cancel_promise, canceller = Lwt.wait () in
     let thread =
-      run_thread
-        ~protocol_hash
-        ~cancel_promise
-        ~init_sapling_params:Agent.init_sapling_params
-        cctxt
-        command
+      if !counter = 1 then
+        Lwt.pick
+          [
+            run_thread
+              ~protocol_hash
+              ~cancel_promise
+              ~init_sapling_params:Agent.init_sapling_params
+              cctxt
+              command;
+            (let*! () = Lwt_unix.sleep 5. in
+             failwith "sleep");
+          ]
+      else
+        run_thread
+          ~protocol_hash
+          ~cancel_promise
+          ~init_sapling_params:Agent.init_sapling_params
+          cctxt
+          command
     in
+    incr counter ;
     let*! () = Events.(emit agent_running) (Agent.name, protocol_hash) in
     return {protocol_hash; process = {thread; canceller}}

Have the baker terminates gracefully (old - or previous baker).

Apply the previous patches, but switch the failwith with return_unit.

Checklist

  • Document the interface of any function added or modified (see the coding guidelines)
  • Document any change to the user interface, including configuration parameters (see node configuration)
  • Provide automatic testing (see the testing guide).
  • For new features and bug fixes, add an item in the appropriate changelog (docs/protocols/alpha.rst for the protocol and the environment, CHANGES.rst at the root of the repository for everything else).
  • Select suitable reviewers using the Reviewers field below.
  • Select as Assignee the next person who should take action on that MR

Summary by CodeRabbit

  • New Features

    • Enhanced block monitoring with protocol migration awareness.
    • Added CLI testing for agnostic baker.
  • Bug Fixes

    • Improved baker daemon state management during protocol transitions.
    • Better handling of protocol changes in accuser operations.
  • Chores

    • Removed deprecated daemon stop event.
    • Consolidated baker test suite structure.
Edited by CodeRabbit

Merge request reports

Loading