CI_JOB_TOKEN and ID_TOKENS invalidated on cancelled jobs

Summary

When using CI_JOB_TOKEN or ID_TOKENS in the after_script section of a CI/CD job, it currently behaves differently in the following scenarios listed in our docs.

after_script commands also run when:

The job is cancelled while the before_script or script sections are still running.

The job fails with failure type of script_failure, but not other failure types.

In the first case, if the token is used on git operations (e.g git clone), it would raise a permission error in after_script section.
In the second case, if the token is used on git operations (e.g git clone), it would still work fine in the after_script section.

It's also worthy to note that when a job is cancelled, it creates a brand new container as opposed to a failed job which re-uses the existing container. This might be up for another issue but it's likely related to how we handle the validity of the tokens.

# example error
git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<group>/<project>.git
Cloning into '<project>'...
remote: HTTP Basic: Access denied. The provided password or token is incorrect or your account has 2FA enabled and you must use a personal access token instead of a password. See https://gitlab.com/help/topics/git/troubleshooting_git#error-on-git-fetch-http-basic-access-denied
fatal: Authentication failed for 'https://gitlab.com/<group>/<project>.git/'

Affected tokens

Steps to reproduce

Create a Dummy project A.
Create a project B.
Add project B to project A's allowlist.
Create the following .gitlab-ci.yml.

stages:          # List of stages for jobs, and their order of execution
  - build

non-cancelled-job:       # This job runs in the build stage, which runs first.
  stage: build
  variables:
    CI_DEBUG_TRACE: true
    TEST_VAR: '${CI_JOB_TOKEN}'
    GIT_STRATEGY: clone
  script:
    - echo "Running non-cancelled-job..."
    - git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
    - ls -la <project>
    - exit 1 # force job to fail to trigger after_scriopt
    - sleep 300
  after_script:
    - echo "Execute this command after the `script` section completes."
    - ls -la <project>
    - rm -rf <project>
    - git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git

cancelled-job:       # This job runs in the build stage, which runs first.
  stage: build
  variables:
    CI_DEBUG_TRACE: true
    TEST_VAR: '${CI_JOB_TOKEN}'
    GIT_STRATEGY: clone
  script:
    - echo "Running non-cancelled-job..."
    - git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
    - ls -la <project>
    - sleep 300
  after_script:
    - echo "Execute this command after the `script` section completes."
    - ls -la <project>
    - rm -rf <project>
    - git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git

Let the non-cancelled-job fail on it's own.
After the cancelled-job starts the sleep command, cancel it.
Observe the permission error on the after_script.

Example Project

https://gitlab.com/kballon-bug-report/zd549518_ci_job_token_after_script/-/pipelines/1380128080

What is the current bug behavior?

Tokens encounters a permission error on the cancelled job.

What is the expected correct behavior?

Tokens should not encounter a permission error on the cancelled job.

Relevant logs and/or screenshots

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Proposal

We should be able to fix this in Ci::AuthJobFinder which uses validate_running_job! and change that code to check that a job is canceling or running. We can use the EXECUTING_STATUSES constant to check that it is still executing? instead of running. The executing method would need to be defined.

We should check that the runner side is also equipt to auth the job in after_script since it runs in a separate shell, but I think it should be given kent's description that it's working on script failure. Note: currently we only allow running so this quote from the issue is surprising:

The job fails with failure type of script_failure, but not other failure types.`

Edited Nov 04, 2024 by Allison Browne