[go: up one dir, main page]

Skip to content

Add Ability To Download All Account Data

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem to solve

Currently, GitLab does not offer any means to download your account information, in full, with ease. Instead, we are required to manually pull each repository we own/belong to one by one. In order to recover additional data such as wiki, issues, etc. then a full export is required from another Git service (ie. GitHub). Along with this, there is no means to download your profile information and/or other information that GitLab may collect about us while using the service.

Intended users

The intended users would be every one that makes use of GitLab. This is not limited to any specific group of people, but instead everyone with an account on GitLab.com and potentially self-hosted public solutions as well where people may wish to have this feature.

Further details

In an ideal setup, this would be a button added to the user's profile page somewhere that would allow us to request a full backup creation of our account data, including, but not limited to:

  • All your profile information. (Including SSH keys, and similar.)
  • A list of all the groups/organizations your account is part of.
  • A copy of every repository you have access to. (Including all data attached to that repo you have permission to access.)
  • A copy of any issues/pull requests/forks/etc. you have made on your account.
  • Basically everything that is attached to your account.

Some use cases where this could be useful:

  • In the event of a local data loss, this can be used as a means to restore all local repository data for multiple projects at once.
  • For use with archival purposes if a user wishes to backup all their repository information that they may not keep local copies of all the time.
  • For a sense of comfort to be able to do simple self-audits of the data being collected.
  • Compliance with newer world laws such as the EU GDPR.

Some benefits to having this available:

  • Being able to recover from data loss.
  • Being able to backup our information easier than one-by-one processing as it currently is.
  • Having the option to opt-out of using GitLab.com in the event that something like the Telemetry problems that happened this past week are actually pushed forward with, giving us a proper means to opt-out and collect our data and move on.
  • Having a much easier method of moving from GitLab.com to self-hosted instances of GitLab or EE versions.

Proposal

This is implemented in many ways across the web by other major websites such as Facebook, Google, Microsoft, etc. however, most commonly they use a general setup as follows:

  • User logs into their account.
  • User visits a page within their account profile settings.
  • User clicks a button that requests a backup be made of their data.
  • UI displays a message and selection screen to allow the user to:
    • Confirm that they wish to continue with generating the backup. (A means to save server resources in the event of a mis-click.)
    • Allow the user to select either a full backup or hand-pick parts to be backed up.
    • Offer an override of how to contact the user when the backup is complete and ready for download. (ie. a secondary email or phone number)
    • Offer a means to password the archive upon generation for extra security.
  • User confirms their selection.
  • The server then deals with the processing of their request as needed.

This is usually then setup in a manner of:

  • The server uses a low-resource means of collecting the user's information.
  • The server gathers all requested information and archives it (zip, tar, rar, etc.)
  • The server generates a unique key/token to access this data and sends this to the user via the given means of contact.
    • This token/link can be one-time use to prevent multiple user access.
  • The server marks the link and archive to be expired after a given amount of time, generally 24 hours to one week.

On the website side of things at this point:

  • The user will see a list of backups they have current access to.
  • The user will have the option to download or delete the backup.

Permissions and Security

Permissions wise, this should be easily added to the existing environment of GitLab with no additional changes to permissions.

At the user level, this would be no different than accessing our existing profile settings and making changes to our personal profile. At the group level, this would be no different than having group owners/administrators.

An API can be created to allow the user to request backup generations be made from the API level.

For UI, this would require additional UI changes to be implemented, but nothing that would require any new libraries or requirements to the project outside of what it already includes.

The main work here would be the server-side processing of the backup generation, maintaining a list of generated links/tokens/auth of some means, and so on. As well as having a cron job or similar to expire the backups and delete expired archives that have extended past the allowed timeframe.

Documentation

If included in self-hosted versions of GitLab, this would need to be outlined as being a new feature as well as documentation being added for proper folder permissions to where the backups would be accessible from (or other means of generating the download).

Documentation would need to be made for this involving:

  • Explaining the new feature.
  • Detailing how the feature works.
  • Detailing how the feature can be configured.
  • Detailing how the feature can be enabled/disabled if need be.
  • Detailing how the feature may impact system performance while backups are generated.

Testing

A risk this change can propose is incorrect configurations causing leaking of private/personal information. This could be as little as account details, to as big as leaking full source code backups of private repositories. Special care would need to be taken to ensure that the backup is not accessible to anyone without the proper link/token/access to it.

Testing would need to cover and ensure that:

  • The generated link is unique.
  • The generated link is not easily guessable or identifiable.
  • The generated archive is safely blocked from access without having said link.
  • The generated archive is safely stored (ie. passworded if desired).
  • The generated user data does not include full critical information such as:
  • Full credit card numbers.
  • Full social security numbers.
  • Plain-text passwords/password hashes. (These should not be included at all to be honest.)
  • The generated data can be deleted at any time before it auto-expires if the user requests it.

Cross-browser testing would be required mainly for the UI features to ensure things are properly rendered on the various available browsers and mobile devices.

What does success look like, and how can we measure that?

Success, for me, with this feature request would be the ability to download all of my data as listed above in a single click, protected archive manner allowing me to make personal choices with my user data. Be it to move all my repo's hosted here on GitLab.com to a local, self-hosted instance, or to even shut down my account after backing up all my data.

In terms of measuring what level of success I would put things at, that would ultimately depend on how the data is generated and given to me. How difficult is it to take that data and reimplement it somewhere else? How difficult is it, for example, to recover a repository, in full (code, commit history, issues, wiki, etc.) from the backup data? If these are all simple to do in some manner than I would say this is a high level of success.

Acceptance would be to include all requested data and not just portions. If I wish to backup a repository, I would expect that backup to be able to include all data regarding the repository. Full commit history, full backup of the open and closed issues/pull requests, full backup of the wiki, webhooks, etc. literally everything. It should be a full backup so that if I move it to a self-hosted GitLab instance, it is like nothing changed and it was always hosted there instead.

What is the type of buyer?

This should not be a paid feature. This is our own data we are requesting.

Links / references

Edited by 🤖 GitLab Bot 🤖