Skip to content

KR Database Team - Analyze reverted migrations over the past N months, identify common themes and solutions => 80%

This came from the groupdatabase discussion on OKRs - gitlab-com/www-gitlab-com#8300 (comment 387093420)

We should perform an analysis to understand why migrations are reverted to see if there are common themes that we can mitigate. So maybe the first step would be to have an objective like:

  • Analyze the reverted migrations over the past N months
    • Key result - Identify common themes in reverted migrations
    • Key result - Create issues and implement solutions to mitigate common migration failures
    • Key result - Measure the migration failure rate to show improvements

List of incidents

Scenarios

Development team successfully tests migration in lower environments, but fails in production

  • We need a production equivalent database to test on

Opinion - Database Team owns the migrations

  • Scaling problems - only 3 developers on the DB team responsible for all migrations
  • This implies that the migration design is approved before shipping to production
  • @abrandl: We currently perform thorough database review, so a database maintainer checks migrations before they are being merged. Developers own their migrations, similar to performance and bugs.

Failing in staging is ok

  • Does this require Infra support to roll back?
  • If so, can we fix that?

How can database team help

  • Work toward providing better means to test against production type data => Enabling developments teams to verify migrations and ship more robust migration code.
  • Providing working on a different concept for migrations back
  • Education

Possible solutions

  • (Promising) Bring staging closer to production https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10756
  • (Likely no-go) Postgres.ai feature to test migration - practically difficult to set up, and cost is prohibitive
  • (Feasible?) Add DB migration to end-to-end testing pipeline if not yet
  • (Feasible?) Enforce developers testing DB migration locally when their MRs touch DB schema
    • @iroussos: This is already a requirement, but most migration issues relate to:
      • Scale → local dev environments are minimal. This is mitigated by using database-labs, but that’s only for the querying part
      • Actual data → local dev environments have no real data or no data at all for most tables.
      • Our deployment process → online updates and gradual deployment of code, canary deployments co-existing with old code in production, post deployment migrations running at a later stage, etc
Edited by Craig Gomes
OSZAR »