Implement Ability to Regenerate All MR Diffs
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Release notes
Problem to solve
As an Administrator of a GitLab instance, I want to be able to regenerate MR diff data, so that I can omit this information from my backup and rest easily. I would also like to be able to regenerate this data if I notice a problem operationally - for example, if the storage storing the external_diffs data has been cleared out erroneously.
In the GitLab backup solution, it is possible to specify --skip external_diffs
to skip backup of MR Diffs. This is particularly valuable for self-hosted GitLab solutions when MR Diffs are configured to use Object Storage (i.e. S3). By skipping the backup of external diffs, it makes the backup more efficient by negating the need to fetch and tar every diff from object storage, just to (probably) upload the entire backup to object storage again at the end of the process. This has, in our testing, reduced backup times from around an hour to a few minutes.
This does mean that external_diff backup restores need to be treated slightly differently, as it won't be backed up or restored as part of the overall backup process. We've guaranteed the integrity of our external diff data by enabling S3 Versioning on the bucket containing it, and also intend to configure AWS Backup rules for this bucket.
However, in the strictest sense, it should not be needed to back this data up at all. Merge request diff information should be regeneratable, as it does not contain any new information than is already contained within the application.
It would be nice to have a rake task that could be run to say: "regenerate all of the merge request diff data, as we think something's wrong with it". This should be possible, as the repository data, merge request data and everything else should already be present.
It might be possible to add this to the current restore process - simply regenerate these diffs as part of a GitLab backup restore operation, rather than them needing to be a distinct part of the backup restore process (i.e. assume that this should be regenerated rather than restored in all restore instances). This would save customers a lot of disk space in their backups in large installations.
Intended users
Administrators, sysadmins, devops / platform / SRE engineers and developers responsibility for the operation of self-hosted GitLab instances.
User experience goal
As an administrator, I should be able to run a rake task that kicks off a regeneration of all merge request diff information, recreating it in external object storage if necessary. If I have a broken MR (i.e. with no external files available), it should be fixed after running this.
Proposal
A rake task that can be executed by the administrator would be the ideal solution. I should be able to run a rake task that kicks off a regeneration of all merge request diff information, recreating it in external object storage if necessary. If I have a broken MR (i.e. with no external files available), it should be fixed after running this.
In this way, the rake task could be considered a part of "disaster recovery" should something happen to the storage of diffs.
Further details
It is possible to do this on a "per-repository" basis already - see the note on this support issue: #214356 (comment 737579031)
This feature proposal involves potentially using the above solution but making it more applicable to an entire instance - i.e. a task to go through each MR and ensure valid diffs are present and exist, and recreate them if they're not.
Also see GitLab Support #289327 where I (under my organization's email) confirm that regenerating this data is possible.
Permissions and Security
Administrators with permission to take and restore backups are the primary audience, so those roles which apply to those tasks are suitable.
Documentation
Documentation changes to Backup and restore architectural documentation to describe this new feature, along with any other relevant backup/restore pages.
Availability & Testing
None; this feature request should make backups smaller and more reliable as it would be possible to regenerate the latest state of this data rather than relying on a "snapshot" from a previous backup which could be old.
Available Tier
I think this should be available in all tiers that can use the backup functionality, but happy to defer to you.
Feature Usage Metrics
Happy to leave this to GitLab to determine if appropriate.
What does success look like, and how can we measure that?
Customers can configure their backups to ignore external_diff data. Reporting on the size of this data and that it has been "saved" from backups might be a valid success metric. This in turn reduces cost to customers through object storage bills for backups (i.e. we store backups in S3) and space in on-premises backup solutions (SAN/NAS space, etc)
What is the type of buyer?
Those who are responsible for maintenance, backup, restore, availability and operation of GitLab. They will be reassured knowing that features like this are present in backup solutions.
Is this a cross-stage feature?
Unknown - happy to take GitLab's steer here.
What is the competitive advantage or differentiation for this feature?
Straightforward disaster recovery. Faster, more consistent backups?
Links / references
See GitLab support case #289327 for the original enquiry