Investigate memory consumption when importing projects
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
We need to investigate ways to decrease the memory consumption during the import process. We have observed several Sidekiq restarts on GitLab Dedicated, and even in our 3k instance, when multiple import jobs are run simultaneously.
To provide context, the current advice for minimizing Sidekiq restarts triggered by import jobs for GitLab Dedicated and self-managed clients is to route import jobs to a separate Sidekiq process with lower concurrency and more available memory, which isn't ideal as it requires infrastructure changes.
Here are some points we can begin to examine:
-
Is the import process retaining excessive information in Request Storage? For instance, whenever a policy is invoked, the data that indicates whether a user can perform a specific operation remains in Request Storage. Would deleting records from time to time help?
-
The Gitlab::ImportExport::Base::ObjectBuilder#LRU_CACHE_SIZE is currently set to 1000. Should we consider decreasing this number, at least for Direct Transfer?
Direct Transfer initiates many more jobs simultaneously, meaning each job might cache 1000 records in the LRU cache. This could potentially contribute to the elevated memory usage.
For Gitlab::Import::SourceUserMapper#LRU_CACHE_SIZE
, we reduce the number to 100 after investigating the high memory usage caused by user contribution mapping