Non-Blocking RPCs

Objective: Great UX for node users

Key Result: RPCs do not block the node

Update: 2024/04/01

⚠️ This project was revived in the context of %(2024Q2) - Layer 1 - Public RPC endpoint supporting average 1k RPS

Update: 2023/12/15

⚠️ STALLED

The project is stalled for now as a breaking file descriptor leak was discovered in the Cohttp library.

Leak analysis

When a streamed RPC is called on the external RPC process, 3 FDs are allocated:

A TCP one that is from connection from the client to the external RPC process
2 UNIX FDs, one for each side of the connection between the external RPC process and the node.

When the client stops the connection, the TCP FD remains until the external RPC process tries to send some data on it. When the external RPC process tries to send data on the close connection, the associated FD is released.

However, the releasing of the TCP FD does not lead to the releasing of the UNIX FDs.

Thus the leak

Solutions

Improve Cohttp

Record the resource creations/deletions and ask explicitly to Cohhtp to clean the resource. This requires updating/forking Cohttp.

Do no transfer streamed RPCs

Wrap the streamed RPCs on the RPC process so that only one streamed RPC is opened, and shared, for each of the client request. This complexifie the adding of streamed RPCs to the node.

Disable streamed RPCs on RPC process

Allow streamed RPC on the local rpc server of the node only. Thus the user needs to run the rpc server for the (potentially blocking) rpcs and the node's local one for the streamed RPCs.

Conclusion

This issue is blocking for the Non-Blocking RPC project. Without resolving this issue, it won’t be possible to put back the external RPC server by default, and then this may cause this feature to not be used.

The best solution to envisage is to migrate from Cohttp to httpaf -- we are complaining about Cohttp since a while -- before enabling the external rpc server.

As there is not enough resources for now, we pause the project.

Motivation

When calling RPCs, it requires the node to do some work, in addition to it's essential task such as networking, storing data and validating blocks. Most of theses RPCs are lightweight and the node can easily handle them amongst it's other tasks. However, when it comes to heavy RPCs, such as requesting endorsing on baking rights, the node may struggle during a while and may delay other essential tasks. At worst, the node may be frozen during the evaluation of an heavy RPC. Avoiding such slow downs would help node operators to get responsive and more predictable performances when calling RPCs on their nodes.

Scope

The goal of this milestone is to avoid blocking the node when requesting heavy RPCs. By heavy rpcs, we consider all the computational or IOs intensive RPCs, such as:

baking rights
endorsing rights
accounts list
contract list
…

To tackle that, we propose to:

spawn an octez-rpc-server, alongside the node, that will be in charge of handling RPCs,
optimize the octez-rpc-server to reuse data an reduce it's workload when calling heavy RPCs,
optimize all the remaining RPCs with a best effort strategy.

Design

The octez-rpc-sever will be developed incrementally, by adding features one by one.

First, we will start with a minimal octez-rpc-sever that will only redirect RPCs to the node it is associated to. To do so, we will use the builtin redirection capabilities of the RPC library middleware. The way the octez-rpc-sever communicates with the node will be improved in a second time and we will ensure the reliability of the octez-rpc-sever thanks to tests.

Then, we will start implementing a way to avoid blocking the node when heavy RPCs are queried. To do so we will first open the storage in the octez-rpc-sever each time a RPC call is received. Then, to optimize the performances, we will switch to a store update function called only when necessary. Indeed, the store updates it states regularly, however, the breaking changes (file descriptors changes) are performed only during the store merges -- a maintenance procedure occurring at the end of cycles only. We will also make sure that we observe no critical performance regression. In addition to that, and as RPCs are central and critical component, we must ensure that the system is reliable (thanks to the success of all automatic tests).

Finally, we will try to improve as many RPCs as possible with a best effort strategy.

In addition to that, we will tackle some technical debt introduced by the proxy-mode and proxy-server by cleaning the code and improving some naming.

Here is the list of RPCs handled by the RPC process (goal).

Minimal `octez-rpc-server` that redirects to the node (ETA: end of August)

Improves the `octez-rpc-server` by loading some data internally (ETA: mid of November)

(Stretch goal) Tackle some `proxy-mode` and `proxy-server` technical debt (ETA: mid of November)

(Done in vicall@remove-proxy-sever but requires preceding work to be finished to be merged)

unplug the tezos-proxy-server, aka "proxy-mode" (req. naive build rpc directory)
- remove the type mode | Server from src/lib_proxy/proxy.ml
- remove the lib_context dependency of the proxy
- remove the lib_context dependency of the octez-client