Draft: Speed up subrelation import during Bulk Import (!57754) · Merge requests · GitLab.org / GitLab

What does this MR do?

Subrelations we currently import are epic events and epic award emojis. I did a test of importing 6k epics (similar number of epics to gitlab group) and the overall import was extremely slow. It never finished after almost 2 hours and I killed the process.

This is due to the fact that we do 1 network request per epic when importing subrelations and iterating through them one by one. This is a very time consuming process and as the amount of subrelations we import grows it will only get slower.

To import 6000 epics we do 6000 / 500 = 12 network requests.

To import award emojis we do 6000 network requests.

To import epic events we do another 6000 network requests.

Because of this, we need to find a better way to import subrelations.

Up until this point we were trying to avoid dealing with nested paginated subrelations but perhaps it's time we address it for epics and for any other future relation that has subrelations as well.

This MR is a draft of how it might look like. In order to handle nested paginations using GraphQL we need to have a way to chain 2 pipelines together and the overall idea is this:

Go through all epics (500 per page) and import first page of award emojis (currently set to 20 as graphql throws an error about query complexity if I set it to higher number). Page sizes are a subject to change as we might have to strike a better balance here.
Capture and track epics that have award emoji next page in Context which looks like this

context.extra[:subrelation_next_pages]

[["1", "eyJpZCI6Ijc5NDUifQ"], ["2", "eyJpZCI6Ijc5NDUifQ"], ["3", "eyJpZCI6Ijc5NDUifQ"]]

After current pipeline is complete, start a new one passing in tracked epics that require further import of award emojis
For a new pipeline, compose a query that includes multiple epics at once and sets corresponding end cursor for each of them. A query example would look like this:

{
  group(fullPath: "emoji") {
    epic_15: epic(iid: 15) {
      iid
      awardEmoji(first: 50, after: "eyJpZCI6Ijc5NDUifQ") {
        pageInfo {
          has_next_page: hasNextPage
          next_page: endCursor
        }
        nodes {
          name
          user {
            public_email: publicEmail
          }
        }
      }
    }
    epic_14: epic(iid: 14) {
      iid
      awardEmoji(first: 50, after: "eyJpZCI6IjczNjEifQ") {
        pageInfo {
          has_next_page: hasNextPage
          next_page: endCursor
        }
        nodes {
          name
          user {
            public_email: publicEmail
          }
        }
      }
    }
  ...
}

This way we can include more than one epic at a time, speeding things up significantly. We are, however, still constrained by GraphQL query complexity which depends on the type of data you query (afaik). Current example fetches 15 epics at a time with 50 award emoji per page. Anything above that fails with query complexity.

If an epic has another page to process, add it to the end of the array for processing.
Process all epics this way 15 at a time and reduce the dataset until no epics left to process.

Here are the test results before and after the change.

Data set: 1 group with 6000 epics and 7500 award emoji scattered across these epics. Skipping all other pipelines (members, subgroups, epic events, etc). Only group, epics and epic emoji pipelines to run.

Before the change: 1h+ and it did not finish. Something went wrong on epic ~2000 and I did not see any logs past that so I had to kill the process. After: 10 minutes

This is just an idea on how to improve the situation. I think as we add more subrelations it can only get slower and slower over time.

There are a few things that I don't currently like in the implementation:

New pipeline operates on a collection instead of 1 item at a time. This is not the first time and perhaps we just need to make our framework to accept such data (it already does).
This pipeline is started in a new on_complete method instead of being listed in the Group Importer explicitly. There are ways to avoid that if we store 'next_pages' info in redis for example, but at the same time new pipeline is part of overall award emoji import for epics so I think they belong together. Maybe we need to more flexible way to pipe the pipelines together.
Fine tuning the queries so GraphQL does not complain about query complexity was a tedious process. I would love to have a bigger batch to process but not sure if we can do it.
GraphQL can be optimized for our needs. Instead of fetching all epics and filtering out those that do not have award emojis we can add a new argument (e.g. hasAwardEmoji: true) to only return epics that do have award emojis. This, however, required touching EpicsFinder which is quite complex and I decided not to do it.

Finally, there is a lot of duplication between the pipeline but it's just a draft that we can improve/iterate on.

Related to #326218 (closed)

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

📋 Does this MR need a changelog?
- I have included a changelog entry.
- I have not included a changelog entry because _____.
Documentation (if required)
Code review guidelines
Merge request performance guidelines
Style guides
Database guides
Separation of EE specific content

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Edited Mar 30, 2021 by George Koltsov

Draft: Speed up subrelation import during Bulk Import