KR Database Team - Analyze reverted migrations over the past N months, identify common themes and solutions => 80%
This came from the groupdatabase discussion on OKRs - gitlab-com/www-gitlab-com#8300 (comment 387093420)
We should perform an analysis to understand why migrations are reverted to see if there are common themes that we can mitigate. So maybe the first step would be to have an objective like:
- Analyze the reverted migrations over the past N months
- Key result - Identify common themes in reverted migrations
- Key result - Create issues and implement solutions to mitigate common migration failures
- Key result - Measure the migration failure rate to show improvements
List of incidents
-
Production !42277 (merged)
-
Staging - gitlab-com/gl-infra/production#2439 (closed)
- User Permissions - Staging and production deployments are blocked due to a migration not being able to enable a database extension
- I believe this is in response
- Print error message when attempting to create database extension as non-superuser - #243729 (closed) => Andreas: This is targeted for installations with externally managed postgres clusters which didn’t manually create the required extension before upgrading. On GitLab.com, it would have led to the same state - we’d have seen an error on deploy, too, but a more verbose one.
- @abrandl: Shortcoming of not using the Omnibus deployed database (which would automatically have set things up)
- @abrandl: This has been caught in staging and never reached production before it was fixed. Marking with staging.
-
Production - gitlab-com/gl-infra/production#2589 (closed)
- Second unique identifier caused the problem gitlab-com/gl-infra/production#2589 (comment 403856863)
- @abrandl: Ability to test the migration on production data would have surfaced the migration problem
- @abrandl: This looks like a bug in code rather than a systematic problem with migrations
-
Staging - gitlab-com/gl-infra/production#2539 (closed)
- This is a precursor to the one above. It failed in staging, why did the same file make it to production?
- @abrandl: Ability to test the migration on production data would have surfaced the migration problem
-
Staging - gitlab-com/gl-infra/production#2514 (closed) - Staging migrations take over two hours to complete
- Related to logical replication issue gitlab-com/gl-infra/production#2514 (comment 395074719)
- @abrandl: This has been an issue because the staging postgres cluster has been used to perform a test of setting up logical replication (there are prerequisites for logical replication to work out, which we don’t fulfill). Using a separate cluster for testing logical replication would have prevented the conflict with the staging environment and deployments.
-
Production - gitlab-com/gl-infra/production#2672 (closed)
- Error - unsupported escape sequence
- Revert and test on database labs
- Why was this not done in the first place
- Andreas: This is a bug in the code for the background migration. #databaselabs only lets you test individual database queries, but not the corresponding ruby code.
- @abrandl: Why didn’t this show in staging? Thread => Didn't have the same data, only few records were problematic
Scenarios
Development team successfully tests migration in lower environments, but fails in production
- We need a production equivalent database to test on
Opinion - Database Team owns the migrations
- Scaling problems - only 3 developers on the DB team responsible for all migrations
- This implies that the migration design is approved before shipping to production
- @abrandl: We currently perform thorough database review, so a database maintainer checks migrations before they are being merged. Developers own their migrations, similar to performance and bugs.
Failing in staging is ok
- Does this require Infra support to roll back?
- If so, can we fix that?
How can database team help
- Work toward providing better means to test against production type data => Enabling developments teams to verify migrations and ship more robust migration code.
- Providing working on a different concept for migrations back
- Education
Possible solutions
- (Promising) Bring staging closer to production https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10756
- (Likely no-go) Postgres.ai feature to test migration - practically difficult to set up, and cost is prohibitive
- (Feasible?) Add DB migration to end-to-end testing pipeline if not yet
- (Feasible?) Enforce developers testing DB migration locally when their MRs touch DB schema
-
@iroussos: This is already a requirement, but most migration issues relate to:
- Scale → local dev environments are minimal. This is mitigated by using database-labs, but that’s only for the querying part
- Actual data → local dev environments have no real data or no data at all for most tables.
- Our deployment process → online updates and gradual deployment of code, canary deployments co-existing with old code in production, post deployment migrations running at a later stage, etc
-
@iroussos: This is already a requirement, but most migration issues relate to:
Edited by Craig Gomes