Baker/Accuser: properly monitor processes
What & why
New attempt to fix the bug we had on the accuser. !19364 (merged) is incorrect, I misunderstood how tzboth behaved but surprisingly it fixed the bug I was chasing but introduced new ones :-). It's obvious but tzboth completes only if both promises complete. I need this piece of code to terminate as soon as one of them raise an error. So what happens is that if for instance you forgot to put the argument about DAL --without-dal, the first promise completes with an error, but the second promise remains alive and the baker becomes a zombie process.
But this is the context of why my previous merge request was incorrect. Now let's recap what was initially the issue (because now I have much better understanding).
The agnostic baker is composed of 2 main threads:
-
baker_threadthat monitors the current thread -
monitor_voting_periodsthat monitors the L1 and spawn bakers when needed. Those 2 threads are followed in aLwt_utils.never_endingin case one of them terminate (this is absolutely wrong).
At protocol migration, the accuser stopped itself, the thread terminated with Ok (). It stopped for "good" reason, but it's not really compatible with our thread monitoring. What happened then, is the accuser thread terminated, the Lwt.pick then stopped the monitor_voting_periods and the daemon was stuck in the Lwt_utils.never_ending
I learned multiple things from this:
- Lwt_utils.never_ending should really not exist, if one of the thread must exit, the deamon must exit
-
baker_threadmonitors only thecurrent_thread. However, this thread might change during protocol migrations, but we actually monitor only the very firstcurrent_thread, that is the one spawned when the baker is first started. -
old_bakeris not monitored. Once you have a protocol migration, whatever happens withold_bakeris ignored. Let's say the chain is stuck on the last block of the protocol, you need theold_bakerto play the consensus, and you have no idea whether it's running. - The code is too much spaghetting, splitting monitor voting periods and baker threads is misleading. You have to understand when and why and how the
stateis modified, and how they play together.
How
I followed a suggestion of @4dam to have a main scheduler, similar to what has been done in the baker. The main (recursive) loop is now composed of 4 events:
type event =
| New_head
| Head_stream_ended
| Old_baker_stopped
| Current_baker_stopped
let rec main_loop state =
let* event = Lwt.choose [ ... ] in
match event with
| New_head -> (* monitor voting periods *)
| Head_stream_ended (* connection is lost *)
| Old_baker_stopped (* should the old baker be stopped? is it expired? *)
| Current_baker_stopped (* the baker cannot stop. *)
Therefore:
- The main_loop may never stop, unless we have some unexpected completions or errors.
- Old baker and current baker are both monitored
Manually testing the MR
I'm doing a lot of manual migration tests. I think there are several scenarios to test, I'm going to list some of them, please request more tests if you have ideas.
Kill the old baker before it is expired
Using the follow patch: you need to activate the new protocol, stop baking blocks (you don't want the old baker to be expired), wait to see if the baker exits.
diff --git a/src/lib_agnostic_baker/daemon.ml b/src/lib_agnostic_baker/daemon.ml
index eaedf0a8c3..501e612622 100644
--- a/src/lib_agnostic_baker/daemon.ml
+++ b/src/lib_agnostic_baker/daemon.ml
@@ -404,7 +404,16 @@ module Make_daemon (Agent : AGENT) :
| None ->
let*! () = Lwt_utils.never_ending () in
assert false
- | Some old_baker -> old_baker.baker.process.thread
+ | Some old_baker ->
+ let* () =
+ Lwt.pick
+ [
+ old_baker.baker.process.thread;
+ (let*! () = Lwt_unix.sleep 5. in
+ failwith "sleep");
+ ]
+ in
+ return_unit
in
let head_stream : event tzresult Lwt.t =
Lwt.map
Kill the new baker soon after spawn
Same logic as the previous patch:
diff --git a/src/lib_agnostic_baker/daemon.ml b/src/lib_agnostic_baker/daemon.ml
index eaedf0a8c3..98845a8776 100644
--- a/src/lib_agnostic_baker/daemon.ml
+++ b/src/lib_agnostic_baker/daemon.ml
@@ -182,6 +182,8 @@ module Make_daemon (Agent : AGENT) :
let agent_thread = Agent.run_command plugin cctxt command in
Lwt.pick [agent_thread; cancel_promise]
+ let counter = ref 0
+
(** [spawn_baker cctxt command protocol_hash] spawns a new baker process for
the given [protocol_hash]. *)
let spawn_baker cctxt command protocol_hash =
@@ -189,13 +191,27 @@ module Make_daemon (Agent : AGENT) :
let*! () = Events.(emit starting_agent) (Agent.name, protocol_hash) in
let cancel_promise, canceller = Lwt.wait () in
let thread =
- run_thread
- ~protocol_hash
- ~cancel_promise
- ~init_sapling_params:Agent.init_sapling_params
- cctxt
- command
+ if !counter = 1 then
+ Lwt.pick
+ [
+ run_thread
+ ~protocol_hash
+ ~cancel_promise
+ ~init_sapling_params:Agent.init_sapling_params
+ cctxt
+ command;
+ (let*! () = Lwt_unix.sleep 5. in
+ failwith "sleep");
+ ]
+ else
+ run_thread
+ ~protocol_hash
+ ~cancel_promise
+ ~init_sapling_params:Agent.init_sapling_params
+ cctxt
+ command
in
+ incr counter ;
let*! () = Events.(emit agent_running) (Agent.name, protocol_hash) in
return {protocol_hash; process = {thread; canceller}}
Have the baker terminates gracefully (old - or previous baker).
Apply the previous patches, but switch the failwith with return_unit.
Checklist
-
Document the interface of any function added or modified (see the coding guidelines) -
Document any change to the user interface, including configuration parameters (see node configuration) -
Provide automatic testing (see the testing guide). -
For new features and bug fixes, add an item in the appropriate changelog ( docs/protocols/alpha.rstfor the protocol and the environment,CHANGES.rstat the root of the repository for everything else). -
Select suitable reviewers using the Reviewersfield below. -
Select as Assigneethe next person who should take action on that MR
Summary by CodeRabbit
-
New Features
- Enhanced block monitoring with protocol migration awareness.
- Added CLI testing for agnostic baker.
-
Bug Fixes
- Improved baker daemon state management during protocol transitions.
- Better handling of protocol changes in accuser operations.
-
Chores
- Removed deprecated daemon stop event.
- Consolidated baker test suite structure.