Identifying Gitlab runner failures using v4 API calls

March 22, 2024 - 4 mins read

Recently I was asked to look into a number of pipeline failures in Gitlab-CI. Specifically, we were interested in discovering how many jobs were failing due to runner system failures, ie. not for reason of the actual job script failing. Those runner failures comprise a number of failure modes including DNS resolution failures.

Gitlab runners can be polled for failures using the API in a programmatic way. This serves as an information collection mechanism for further reporting and metric aggregation. We can do this using a simple script, either in python (using the python-gitlab package) or in shell using curl. I chose to use python in this case because it is easier to manipulate the returned data directly.

Finding runners with failed jobs

We can find runner failures by simply polling all the runners that are accessible to the identity associated with the supplied token. This is a bit more efficient than the ‘standard’ way which involves making requests against all groups and then all projects in each group to find failed jobs.

The following is a simple demonstration of using the runners endpoint to accomplish this:

#!/usr/bin/env python3
import os
import gitlab


def get_runner_failures(gitlab_obj):

    system_failures = []

    for runner_it in gitlab_obj.runners_all.list(get_all=True):
        runner_id = runner_it.id  # id of the runner
        runner = gitlab_obj.runners.get(runner_id)  # get runner object for inspection
        # check all the jobs associated with the runner for failures
        # get the job ids for jobs that failed due to runner system failure
        runner_failures = [
            (runner_id, job_it.id)
            for job_it in runner.jobs.list(get_all=True, status="failed")
            if job_it.failure_reason == "runner_system_failure"
        ]

        if runner_failures != []:
            system_failures += runner_failures

    # list of (runner_id, job_id) where failure reason was runner failure
    return system_failures


def main():
    gitlab_obj = gitlab.Gitlab(
        os.getenv("GITLAB_API"), private_token=os.getenv("GITLAB_API_TOKEN")
    )
    system_failures = get_runner_failures(gitlab_obj)
    print(f"Gitlab runner system failures: {len(system_failures)}")
    print(f"{system_failures}")


if __name__ == "__main__":
    main()

And this will print something like:

Gitlab runner system failures: 6
[(5, 540), (5, 554), (5, 555), (5, 556), (5, 570), (5, 571)]

where the list contains elements consisting of:

the identity of the runner
the identity of the job that failed

Accessing job logs for failed jobs

We would like to know more about these failed jobs. For example, some of these might have failed due to DNS errors. Unfortunately to be able to know more about these failed jobs we would need to access the logs and the Gitlab API does not expose job logs at this surface. Instead this is only accessible at the projects endpoint (because jobs are associated to projects and pipelines). The jobs api endpoint specifies the following for accessing job logs:

GET /projects/:id/jobs/:job_id/trace

This means that we necessarily need to first make requests to obtain all the project objects using the projects endpoint.

#!/usr/bin/env python3
import os
import logging
import gitlab

logging.basicConfig(filename="runner-failures.log", level=logging.INFO)
logger = logging.getLogger(__name__)


def get_all_failed_job_logs(gitlab_obj):
    # poll all the projects that aren't archived
    for project_it in gitlab_obj.projects.list(iterator=True, archived=False):
        try:
            logger.debug("Polling project %s", project_it.id)
            # get the project for inspection
            project = gitlab_obj.projects.get(id=project_it.id)
            # check all the jobs for failures due to runner_system_failure
            for job_it in project.jobs.list(iterator=True):
                if (
                    job_it.status == "failed"
                    and job_it.failure_reason == "runner_system_failure"
                ):
                    logger.debug(
                        "Runner system failure detected for %s:", job_it.web_url
                    )
                    job = project.jobs.get(job_it.id)
                    msg = (
                        f"Project:{project_it.id}, "
                        f"Job:{job_it.id}@{job_it.web_url}: "
                        f"{job.trace()}"
                    )
                    logger.info(msg)
                    print(msg)
        except gitlab.GitlabListError:
            # your api token probably doesn't have access to *all* projects and jobs
            # 403s should be expected and we'll just continue after making a note
            logger.error(
                "Authentication failure while polling jobs in project %s", project_it.id
            )
            continue
        except gitlab.GitlabGetError:
            logger.error("Unable to get project or jobs for %s", project_it.id)
            continue


def main():
    gitlab_obj = gitlab.Gitlab(
        os.getenv("GITLAB_API"), private_token=os.getenv("GITLAB_API_TOKEN")
    )
    get_all_failed_job_logs(gitlab_obj)


if __name__ == "__main__":
    main()

As you can see from the above, we also need to be sensitive to the fact that jobs/projects may not be available using the supplied token. However, this is a relatively straightforward way to scrape the runners for system failures. The job logs will be dumped into runner-failures.log for further inspection and manipulation, eg:

Project:56103180, Job:6457130599@https://gitlab.com/gitlab-qa-sandbox-group-6/qa-test-2024-03-22-13-47-25-22e94c3079080c8b/project-with-secure-64add0e2e0825acf/-/jobs/6457130599: b'\x1b[0KRunning with gitlab-runner 16.9.1 (782c6ecb)\x1b[0;m\n\x1b[0K  on green-1.saas-linux-small-amd64.runners-manager.gitlab.com/default JLgUopmM, system ID: s_deaa2ca09de7\x1b[0;m\n\x1b[0K  feature flags: FF_USE_IMPROVED_URL_MASKING:true\x1b[0;m\nsection_start:1711115280:resolve_secrets\r\x1b[0K\x1b[0K\x1b[36;1mResolving secrets\x1b[0;m\x1b[0;m\nsection_end:1711115280:resolve_secrets\r\x1b[0Ksection_start:1711115280:prepare_executor\r\x1b[0K\x1b[0K\x1b[36;1mPreparing the "docker+machine" executor\x1b[0;m\x1b[0;m\n\x1b[0KUsing Docker executor with image registry.example.com/brakeman:4 ...\x1b[0;m\n\x1b[0KPulling docker image registry.example.com/brakeman:4 ...\x1b[0;m\n\x1b[0;33mWARNING: Failed to pull image with policy "always": Error response from daemon: Get https://registry.example.com/v2/: dial tcp: lookup registry.example.com on 169.254.169.254:53: no such host (manager.go:250:0s)\x1b[0;m\nsection_end:1711115284:prepare_executor\r\x1b[0K\x1b[31;1mERROR: Job failed: failed to pull image "registry.example.com/brakeman:4" with specified policies [always]: Error response from daemon: Get https://registry.example.com/v2/: dial tcp: lookup registry.example.com on 169.254.169.254:53: no such host (manager.go:250:0s)\n\x1b[0;m\n'