BulkImports: How to approach deeply nested associations?
Context
~"group::import" is working on a better Gitlab-to-Gitlab migration experience (&2771). The main goal is to provide a way to migrate between Gitlab instances with one click, without having to handle files. To achieve that, the backend engineers decided to use Gitlab GraphQL as the way to extract data from the source Gitlab instance.
The problem: Deeply nested associations
Gitlab's data-structure is a tree, something like:
- Group # Level 0
- Group Labels # Level 1
- Epics # Level 1
- Labels # Level 2
- AwardEmoji # Level 2
- Events # Level 2
- Notes # Level 2
- AwardEmoji # Level 3
We usually refer to Level 2+ entities as subrelations
Level 0 and Level 1 (Example: Migrate Group or Epic)
Data from this level, up to now, has being very straightforward to migrate without big challenges
Level 2 (Example: Migrate Epic's subrelations)
Let's imagine the following:
- A Group with
- 5000 Epics (E)
- about 1000 of these Epics have 5000 Notes (N)
- about 500 of these Notes have 5000 AwardEmojis (NA)
- about 1000 of these Epics have 5000 AwardEmojis (A)
- About 1000 of these Epics have 5000 Events (EE)
- about 1000 of these Epics have 5000 Notes (N)
- 5000 Epics (E)
Migrating data from this level start to have 2 main challenges:
- Dependency of the level above, Level N-1
- multiple-level pagination. For instance, we have to paginate epics and then the epic-subrelation
Our current approach is, for each Epic subrelation type (Notes, Events, etc) we iterate on the Epics and then fetch all the subrelation data, which might be paginated. Given that we have 5000 Epics, this means that we'll do at least 5000 network requests to the Gitlab Source Instance, it might be more if some Epics has more notes than a page size. This generates a lot of web requests, which is a performance problem, and will probably face rate limiting problems as well.
# Examplification
# Not real code
[Notes, AwardEmojis, Events].each do |subrelation_type|
group.epics.each do |epic|
RunMigrationFor(epic, subrelation: subrelation) # This might be paginated
end
end
Level 3 and beyond (Example: Migrate Epic notes' award emojis)
Basically the same challenges of above, Level 2, but with a deeper iteration/pagination.
What was/is being done
- To improve performance and enable partial retries of the migration, we're investing a bit on breaking the migration in concurrent jobs (&5544). This won't solve the number of web-requests, but can help to handle rate-limit and partial retries.
- There's an investigation/PoC about reducing the number of web-requests (!57754 (closed)). But, this wouldn't solve the Level-3 complexity.
Brainstormed solution within the ~"group::import" BE
-
(!57754 (closed)) Fetch subrelations within the relation and fetch subsequent pages only when required
- Pros
- Improves performance
- Cons
- does not improve subrelations of subrelations (Level 3)
- Pros
-
(!58404 (closed)) NDJSON Files by relation
sequenceDiagram
User->>+Destination: Import From Source
Destination-->>User: OK
Destination->>+Source: Generate epics.ndjson
note over Destination, Source: Order the epics.ndjson file <br> and passes the callback information <br> where Source can send the file <br> when it's done
Source-->>-Destination: epics.ndjson
Destination-->>-User: Done
- Pros
- We're already using ndjson on Project Import/Export
- Reduced number of web-requests
- Can generate files in batches to avoid huge files
- Cons
- I'm not sure about security concerns
-
Rest API: Create REST API that includes subrelations
/api/v4/projects/gitlab/exportwith{ relation: 'issues', include: [:labels, :notes, :events], per_page: 500, page: 1 }- Pros
- Can be done just to the deeply nested resources
- Well known techninology
- Cons
- Potential request timeouts
- Pros
-
Change GraphQL to enable us to query subrelations direct from the toplevel:
query groupEpicAwardEmoji() {
group(fullPath: $full_path) {
awardEmoji(awardable: 'Epic') {
page_info: pageInfo {
next_page: endCursor
has_next_page: hasNextPage
}
nodes {
awardableId
title
}
}
}
}
- Pros
- Would be the same approach that's already being used
- Cons
- Not sure if would solve the Level-3 queries
- Change GraphQL to list multiple types in the same Query (!58819 (closed))
- Based on Alex's suggestion (#326757 (comment 543355516))
query exportable($full_path: ID!, $per_page: Int, $cursor: String) {
group(fullPath: $full_path) {
id
exportableEntities(first: $per_page, after: $cursor, types: [Epic, Label]) {
page_info: pageInfo {
next_page: endCursor
has_next_page: hasNextPage
}
nodes {
__typename
... on Label {
title
color
}
... on Epic {
title
description
}
}
}
}
}
{
"data": {
"group": {
"id": "gid://gitlab/Group/110",
"exportableEntities": {
"page_info": {
"next_page": "MTA",
"has_next_page": false
},
"nodes": [
{
"__typename": "Epic",
"title": "g1-CHILDREN99",
"description": "Desc g1-CHILDREN99"
},
{
"__typename": "Epic",
"title": "g1-CHILDREN98",
"description": "Desc g1-CHILDREN98"
},
{
"__typename": "Label",
"title": "label::100",
"color": "#8f7649"
},
{
"__typename": "Label",
"title": "label::101",
"color": "#44b2d8"
}
]
}
}
}
}
-
❓ Any other ideas