Non-Blocking RPCs
Objective: Great UX for node users
Key Result: RPCs do not block the node
Update: 2024/04/01
⚠️ This project was revived in the context of %(2024Q2) - Layer 1 - Public RPC endpoint supporting average 1k RPS
Update: 2023/12/15
⚠️ STALLED
The project is stalled for now as a breaking file descriptor leak was discovered in the Cohttp library.
Leak analysis
When a streamed RPC is called on the external RPC process, 3 FDs are allocated:
- A TCP one that is from connection from the client to the external RPC process
- 2 UNIX FDs, one for each side of the connection between the external RPC process and the node.
When the client stops the connection, the TCP FD remains until the external RPC process tries to send some data on it. When the external RPC process tries to send data on the close connection, the associated FD is released.
However, the releasing of the TCP FD does not lead to the releasing of the UNIX FDs.
Thus the leak
Solutions
Improve Cohttp
Record the resource creations/deletions and ask explicitly to Cohhtp to clean the resource. This requires updating/forking Cohttp.
Do no transfer streamed RPCs
Wrap the streamed RPCs on the RPC process so that only one streamed RPC is opened, and shared, for each of the client request. This complexifie the adding of streamed RPCs to the node.
Disable streamed RPCs on RPC process
Allow streamed RPC on the local rpc server of the node only. Thus the user needs to run the rpc server for the (potentially blocking) rpcs and the node's local one for the streamed RPCs.
Conclusion
This issue is blocking for the Non-Blocking RPC project. Without resolving this issue, it won’t be possible to put back the external RPC server by default, and then this may cause this feature to not be used.
The best solution to envisage is to migrate from Cohttp to httpaf -- we are complaining about Cohttp since a while -- before enabling the external rpc server.
As there is not enough resources for now, we pause the project.
Motivation
When calling RPCs, it requires the node to do some work, in addition to it's essential task such as networking, storing data and validating blocks. Most of theses RPCs are lightweight and the node can easily handle them amongst it's other tasks. However, when it comes to heavy RPCs, such as requesting endorsing on baking rights, the node may struggle during a while and may delay other essential tasks. At worst, the node may be frozen during the evaluation of an heavy RPC. Avoiding such slow downs would help node operators to get responsive and more predictable performances when calling RPCs on their nodes.
Scope
The goal of this milestone is to avoid blocking the node when requesting heavy RPCs. By heavy rpcs, we consider all the computational or IOs intensive RPCs, such as:
- baking rights
- endorsing rights
- accounts list
- contract list
- …
To tackle that, we propose to:
- spawn an
octez-rpc-server, alongside the node, that will be in charge of handling RPCs, - optimize the
octez-rpc-serverto reuse data an reduce it's workload when calling heavy RPCs, - optimize all the remaining RPCs with a best effort strategy.
Design
The octez-rpc-sever will be developed incrementally, by adding features one by one.
First, we will start with a minimal octez-rpc-sever that will only redirect RPCs to the node it is associated to. To do so, we will use the builtin redirection capabilities of the RPC library middleware. The way the octez-rpc-sever communicates with the node will be improved in a second time and we will ensure the reliability of the octez-rpc-sever thanks to tests.
Then, we will start implementing a way to avoid blocking the node when heavy RPCs are queried. To do so we will first open the storage in the octez-rpc-sever each time a RPC call is received. Then, to optimize the performances, we will switch to a store update function called only when necessary. Indeed, the store updates it states regularly, however, the breaking changes (file descriptors changes) are performed only during the store merges -- a maintenance procedure occurring at the end of cycles only. We will also make sure that we observe no critical performance regression. In addition to that, and as RPCs are central and critical component, we must ensure that the system is reliable (thanks to the success of all automatic tests).
Finally, we will try to improve as many RPCs as possible with a best effort strategy.
In addition to that, we will tackle some technical debt introduced by the proxy-mode and proxy-server by cleaning the code and improving some naming.
Here is the list of RPCs handled by the RPC process (goal).
Minimal octez-rpc-server that redirects to the node (ETA: end of August)
-
(hours) enable redirection of the RPC middleware -
(hours) octez-proxy-serverforwards RPCs to the node in a transparent way (!8672 (merged))
-
-
(days) spawn and use an octez-rpc-serverprocess ~~ !8946 (merged)~~ | ~~ !9567 (merged)~~ | !9957 (merged)-
(days) spawn an octez-rpc-serverprocess that communicates through RPCs with the node -
(days) improve octez-rpc-serverprocess communication with the node -
(days) fix p2p resource leak !9326 (merged) -
(hours) fix cohttp ressource leak cohttp!982 -
(hours) conform to cohttp.6( !9392 (merged)) -
(days) wait for cohttp.5.2release cohttp!989 -
(days) wait for cohttp.5.2release to be merged in opam repository opam-repository!24082 -
(hours) new resto.1.2release to be compatible with cohttp.5.2 resto.1.2 -
(days) wait for resto.1.2to be merged in opam repository opam-repository!24097 -
(days) merge cohttp.5.2andresto.1.2in tezos/opam-repo opam-repository!434 (merged) -
(days) merge cohttp.5.2andresto.1.2!9454 (merged) -
(days) test the reliability of the octez-rpc-serverprocess- risk: the kill signals are wrongly handled by the CI
-
(days) fix CI flakiness !9587 (merged)
-
-
(weeks) make sure the CI is not flaky (i.e. fix tech. debt) -
Optimize storage snapshot tezt !9715 (merged) -
Optimize storage snapshot tezt mk2 !9656 (merged) -
Tezt: reduce memory consumption of p2p-swap-disable test by 2 !9587 (merged) -
Tezt: avoid flakiness by waiting the minimal block delay !9663 (merged) -
Non flaky propose forcommand !9815 (merged) -
Add missing cloexec flags in logging system !9525 (merged) -
Tezt: fix flaky external validator test !9458 (merged) -
introduce tezt greedy tests !9650 (merged) -
Flaky: Nairobi: Testing Full DAC infrastructure (test DAC disconnects from L1)!9993 (merged) -
Reduce p2p-maintenance-init-expected_connectionsmemory usage by switching to non default--local-rpc-server!9942 (closed) -
Flaky: move greedy tests to an isolated dedicated pipeline !9650 (merged) -
Rollup node: better detection of disconnections !10525 (merged)
-
Improves the octez-rpc-server by loading some data internally (ETA: mid of November)
-
(days) load the config data to answer config RPCs !9434 (merged) -
(days) load the version data to answer version RPCs !9432 (merged) -
(weeks) load the chain store and build_rpc_directory only when required !9490 (merged) -
(days) specify when the store requires to be reloaded -
(days) define a Store.reload/refresh function to reflect store changes -
(days) plug the Store.reload/refresh function to the octez-rpc-server-
(weeks) introduce store locks to avoid data races -
(weeks) plug the store opening in the RPC process -
(days) ensure RPC consistency -
(days) test data races and deadlock absence -
(days) track CI flakiness
-
-
(days) bench store sync overhead -
(weeks) Fix various tezt errors ( #6233 (closed)) -
(days) Flaky 'Alpha: storage snapshot export and import' !9490 (merged) -
(days) Flaky 'Alpha: forge block with wrong payload' !9815 (merged) -
(days) Flaky 'manually forked migration blocks from nairobi to alpha' !9815 (merged) -
(days) Flaky 'Alpha: forge block with wrong payload' -
(days) Flaky 'Oxford: node synchronization (archive / archive)' ( @vect0r ) -
(days) Failing 'Alpha: VDF daemon' !9922 (merged) -
(days) Failing 'amendment: alpha -> injected_test (losers: nairobi)' -
(days) Flaky 'Nairobi: Manager_restriction_propagation' -
(hours) Remove deprecated/duplicated RPC to simplify workflow !10967 (merged) -
(days) Double baking flakiness: Cannot find protocol Xdune exec tezt/tests/main.exe -- --file tezt/tests/double_bake.ml --title 'Alpha: double baking with accuser' --loop-count 1000 --test-timeout 120 --verbose |& tee tezt.log- It seems that the protocol table is updated after a new head is promoted, on the shell part. However, the rpc process may receive the notification of an applied block in the mean time. As the protocol table is not yet updated, et fails to get the block
-
(hours) RPC forward issue !10986 (merged) --title 'Alpha: RPC process forward'- it appears that exception and lwt-errors are not handled the same way in the RPC_middleware, leading to unexpected non-forwarded RPCs.
-
(hours) Alpha: storage snapshot drag after rolling import - Notify store head sync only on new head
-
(hours) incosistent caboose values because of lazy store invariants sync --title 'Nairobi: node synchronization (rolling_0 / full)- restart nodes during tests where the value of the caboose were not well synchronized as the RPC-process Store.sync is, sometimes, faster than the store's merge
-
(hours) fix storage tezt relying on wrong invariants !11019 (merged) --title 'Alpha: storage snapshot drag after rolling import'
-
… ?
-
-
-
(hours) check that the block metadata of the head can be queried -
(hours) check that heavy RPCs do not block the node anymore -
baking rights -
endorsing rights -
accounts list -
contract list -
get operations of blocks (many times in a raw) -
get big maps values (many times in a raw)
-
-
(days) Handle RPC metrics -
complicated to have non-broken metrics -- deactivate all RPC metrics. See #????
-
-
(days) provide benchmarks ( @vivienpe ) -
no regression -
better performances for typical use cases
-
(Stretch goal) Tackle some proxy-mode and proxy-server technical debt (ETA: mid of November)
(Done in vicall@remove-proxy-sever but requires preceding work to be finished to be merged)
-
unplug the tezos-proxy-server, aka "proxy-mode" (req. naive build rpc directory)-
remove the type mode | Serverfrom src/lib_proxy/proxy.ml -
remove the lib_contextdependency of the proxy -
remove the lib_contextdependency of theoctez-client
-