(Q3-2024) Layer 1 - Improve store scalability
Improve store scalability
Cemented files are the place where cemented blocks (blocks not subject to reorgs) are stored, as chunks (usually conforming to the definition of protocol cycles). These files are basically concatenated blocks (Block_repr.t) that we access thanks to file offsets and indexes. These offset are encoded as unsigned 32 bits integers, thus, only 4Gib files can be addressed. It is mandatory to be able to address bigger files to avoid critical and unrecoverable failure as soon as we will reach this limit.
The metadata are store in the same way, but they are compressed to gain disk space. camlzip is used to compress the data allowing to gain a decent disk space and allow good performance. Thanks to zip specifications, accessing to the data can be performed in a random way, without uncompressing the whole file. Currently, the default metadata limits (set by the shell) allow them to grow up to 240Gb for 24,576 blocks (1 cycle).
In relation to the Store, it is known that merges of store are consuming a lot of resources (doubling the memory usage). Currently, merging occurs simultaneously for each node in the network at each new cycle. This can slow down nodes, in the worst case can be a source of crashes and can slow down the overall network. Besides, beginning of cycles are also periods where bakers pay their rewards and may query their nodes with heavy RPCs. Moreover, having all nodes merging their stores at the same time make this high resource consumption predictable which can be used by attackers. Considering all this, we intend to make merging of store unpredictable by having each node separately and randomly selecting merge starting point within a cycle (making sure to avoid the early blocks of the cycle).
Work breakdown
Note: ETAs are estimates based on a project starting on the beginning of Q3 that is 2024/07/01. As the projects is likely to start on 2024/07/15, ETA are expected to be shifted accordingly to the start date.
-
(ETA: 2024-07-12) Allow cemented store to handle cycle above 4Gb #7034 (closed) -
Introduce an upgrade procedure to migrate the former layout to the new one - !14211 (merged) -
Introduce a new store version -
Upgrade the cemented blocks files from the store
-
-
Allow backward compatibility for snapshot import - !14398 (merged) -
Introduce a new snapshot version -
Instead of copying the cemented blocks files during snapshot import, upgrade them as well for better efficiency ( Raw/Tarversions)
-
-
Bench the upgrade procedure - !14401 (merged) (storage) and !14440 (merged) (snapshots) -
Rolling nodes -
Full/archive nodes
-
-
Handle backward compatibility - this is no longer required, since we force the storage and snapshot upgrade procedures from above-
Allow the store to open legacy layout - !14135 (closed) -
Display a warning message to encourage storage upgrade
-
-
-
(ETA: 2024-07-05) Allow compressed metadata to handle over 4Gb of data #7035 (closed) -
Look for alternate compression library-
Look for an alternate Zip library relying on 64 bytes support -
Look for an alternate compression library implemented in OCaml (matching our requirements) -
64 zip implementation (feature request made up to 2 years ago -- released 1 month ago!)- https://github.com/xavierleroy/camlzip/pull/43
-
-
Implement a metadata splitting mechanism to avoid reaching 4Gb metadata files -
Check the camlzip64 library backward compatibility-
camlzip.1.11files can be opened withcamlzip.1.12 -
camlzip.1.12files can exceed 4Gb -
camlzip.1.11files opened withcamlzip.1.12can exceed 4Gb
-
-
Bump the camlzip.1.11dep tocamlzip.1.12!14474 (merged)
-
-
(ETA: 2024-07-19) Reconsider the metadata-size-limitdefault value-
Consider not storing metadata -- Postponed -
Understand the impact of not storing metadata any more -- Postponed -
Bench recomputing metadata instead of reading it from disk -- Postponed
-
-
Consider reducing the limit -- Postponed -
Investigation document
-
-
(ETA: 2024-07-31) Introduce non-deterministic storage maintenance at cycle dawn -
Define a window within we want the storage maintenance to be triggered -
Exclusion: not in the 20th of a cycle -- 3h30 for mainnet -
Limit: not after the middle of a cycle -- ~1.5 days for mainnet
-
-
Allow to configure the storage maintenance (auto/custom/off) -
off !14152 (merged) -
custom !14279 (merged) -
auto !14456 (merged) -
Uses a random seed used to delay the storage maintenance -
Respects the exclusion and limit bounds
-
-
Auto is the default behaviour !14503 (merged)
-
-