[go: up one dir, main page]

Skip to content

The Gitlab API makes it hard to do any sort of bulk queries / be up-to-date

(We're an EE customer on 10.3.6-ee gitlab-ee@45d0f746bffbb52916c1d1069fe781e71e3fe1bb and in coordination with @teemo are filing some more general issues we've encountered with Gitlab in addition to our more specific issues. These are not specific "this thing is broken" issues, but more general architecture issues with potentially open-ended solutions).

We have a use-case for extracting info from the API, e.g. all projects and deploy keys, or all users and their settings etc.

The structure of the API is inherently unfriendly to this use-case. You usually need to:

  1. Loop over e.g. /users or /projects N pages at a time (hardcoded max seems to be 100)
  2. While you're doing NR1 you may get duplicate data (or missing data?) because you're doing pagination over a dataset not ordered in new come last (this does happen in our scripts)
  3. Once you do NR1 you're going to need to do N API calls to get some sub-information about e.g. the user's deploy keys for all users.
  4. You often get a huge amount of data back (e.g. /projects) when all you need is one field

So just to answer the question "what are the users or projects on gitlab" you need a loop, and a state machine to de-duplicate data, then if you want e.g. a mapping from user -> ssh_key you need to loop over N number of users.

Proposed improvements (not in the same order as above):

  1. All APIs that return more than one element should have some since=* parameter, to return everything since a given change, this would a SELECT on a updated_at field in the relevant table.
  2. Continuing NR2, there should be some API to ask what changed since a given since=* list on various records, i.e. "have any projects/users etc. changed since timestamp?".
  3. All APIs that do pagination should do that in such a way that the pagination order corresponds to the insertion order into the table (i.e. usually return things in primary key order), so that API consumers don't need to deal with cases where the same record is returned twice because something got inserted
  4. You should be able to just request a list of fields to be returned, e.g. path_with_namespace for projects. Now you get a firehose of data (requiring splitting it into more pages with per_page, and surely more computation on Gitlab's side) when all you need is one or two fields.

Customers

https://gitlab.my.salesforce.com/0016100001CXro6

Edited by Patrick Harlan