Database race condition when uploading maven packages

Summary

When using official maven images tagged with eclipse-temurin(e.g. maven:3.9.3-eclipse-temurin-11) it is possible to encounter a race condition when uploading packages, ultimately resulting in the following output within the job and a {"message":"Validation failed: Name has already been taken"} error within our logs:

Could not transfer artifact calebw:maven-test2:pom:0.59.40 from/to gitlab-maven (https://gitlab.com/api/v4/projects/12345678/packages/maven): status code: 400, reason phrase: Bad Request (400)

This seems to only occur when both of the following conditions are true:

An official maven image tagged with eclipse-temurin is used
The flag -DdeployAtEnd=true is passed as a Maven argument, which configures all the packages to be pushed at the end.

After speaking with @10io, it seems like this is occurring due to how these specific images send the requests when the above conditions are met. It looks like we receive two authorize requests sequentially to the /authorize endpoint (one for the pom, one for the jar) before a request to upload either one of these files is made.

Steps to reproduce

Fork this project and run a pipeline. This project utilizes predefined CI variables and should need no configuration in order to build and attempt to publish packages to it's relevant package registry.

This project has a parent/child pom configuration that publishes multiple packages. You will see the job fail with the above error on a random package, and it will fail when attempting to publish either the POM or JAR file. Removing the -DdeployAtEnd=true flag from the .gitlab-ci.ym or changing the image to maven:3.9.3 will allow the job to succeed.

Relevant logs can be found in Kibana after a failed attempt is made by searching on the following: json.path: /api/v4/projects/PROJECT_ID/packages/maven/NAMESPACE/maven-test/*.

Example Project

https://gitlab.com/calebw/mvn-400-bad-request

What is the current bug behavior?

Deploy attempts will fail with a 400 bad request output in the job, and a {"message":"Validation failed: Name has already been taken"} error on the backend.

What is the expected correct behavior?

All packages are deployed without issue.

Relevant logs and/or screenshots

Output of checks

This bug happens on GitLab.com

Possible fixes

Implement support for parallel uploads. This could be challenging but we could have some leads:

Use upsert this way, the database itself will take care of the race condition. The challenge here is that we need to have the proper unique constraints in place. Use an exclusive lease on the backend so that second threads wait for the first one to finish. Not ideal but could work.

Alternatives

From #424238 (comment 1730266533):

Use a dummy exclusive lease to detect and log those race conditions. This would confirm our assumptions.
Investigate how to solve this. Some leads:
- Use a specific response code so that clients "try again or at a later time". I doubt that this is possible.
- Use an exclusive lease with 1 or 2 retries.
- Use a database upsert. If I recall correctly, this is not possible due to the requirement of a UNIQUE constraint that we can't have.
- Use the alternative to upsert: check existence and rescue
  - This one could provide a simple way to solve the issue at hand.

Possible solutions

Use upsert.
- This is the ideal solution. However, I recall seeing constraints on UNIQUE database indexes that were not compatible with the packages_packages table. I could be wrong though.
If the previous solution is not applicable and the amount of race condition situations is very low, we can use check existence and rescue.

Implementation plan

Summarized from this comment:

Investigate which one of the solutions would be applicable here:
- Solution (1.) should be possible but we might need to be cautious because packages can be marked for destruction (status set to pending_destruction) and these should not impact (up)inserting a new row. We might need a new UNIQUE index on maven packages with status set to default.
- Solution (2.) is possible. Here are all uploads in the last 24 hours and here are those that failed with 400 Bad Request (which by the way, can be triggered by other root causes than race conditions). At the time of this writing, they account for 0.007 % of all uploads. That's "low" enough to use the "check existence and rescue" solution.
Implement the chosen solution with a feature flag .

Follow-up issues:

Rollout the feature flag (Needs to be created).
Clean up the feature flag (Needs to be created).

Edited Oct 22, 2024 by Jaime Martinez