Identifying Gitlab runner failures using v4 API calls
Recently I was asked to look into a number of pipeline failures in Gitlab-CI. Specifically, we were interested in discovering how many jobs were failing due to runner system failures, ie. not for reason of the actual job script failing. Those runner failures comprise a number of failure modes including DNS resolution failures.
Gitlab runners can be polled for failures using the API in a programmatic way. This serves as an information collection mechanism for further reporting and metric aggregation. We can do this using a simple script, either in python (using the python-gitlab package) or in shell using curl. I chose to use python in this case because it is easier to manipulate the returned data directly.
Finding runners with failed jobs
We can find runner failures by simply polling all the runners that are accessible to the identity associated with the supplied token. This is a bit more efficient than the ‘standard’ way which involves making requests against all groups and then all projects in each group to find failed jobs.
The following is a simple demonstration of using the runners endpoint to accomplish this:
#!/usr/bin/env python3
import os
import gitlab
def get_runner_failures(gitlab_obj):
system_failures = []
for runner_it in gitlab_obj.runners_all.list(get_all=True):
runner_id = runner_it.id # id of the runner
runner = gitlab_obj.runners.get(runner_id) # get runner object for inspection
# check all the jobs associated with the runner for failures
# get the job ids for jobs that failed due to runner system failure
runner_failures = [
(runner_id, job_it.id)
for job_it in runner.jobs.list(get_all=True, status="failed")
if job_it.failure_reason == "runner_system_failure"
]
if runner_failures != []:
system_failures += runner_failures
# list of (runner_id, job_id) where failure reason was runner failure
return system_failures
def main():
gitlab_obj = gitlab.Gitlab(
os.getenv("GITLAB_API"), private_token=os.getenv("GITLAB_API_TOKEN")
)
system_failures = get_runner_failures(gitlab_obj)
print(f"Gitlab runner system failures: {len(system_failures)}")
print(f"{system_failures}")
if __name__ == "__main__":
main()
And this will print something like:
Gitlab runner system failures: 6
[(5, 540), (5, 554), (5, 555), (5, 556), (5, 570), (5, 571)]
where the list contains elements consisting of:
- the identity of the runner
- the identity of the job that failed
Accessing job logs for failed jobs
We would like to know more about these failed jobs. For example, some of these might have failed due to DNS errors. Unfortunately to be able to know more about these failed jobs we would need to access the logs and the Gitlab API does not expose job logs at this surface. Instead this is only accessible at the projects
endpoint (because jobs are associated to projects and pipelines). The jobs api endpoint specifies the following for accessing job logs:
GET /projects/:id/jobs/:job_id/trace
This means that we necessarily need to first make requests to obtain all the project objects using the projects endpoint.
#!/usr/bin/env python3
import os
import logging
import gitlab
logging.basicConfig(filename="runner-failures.log", level=logging.INFO)
logger = logging.getLogger(__name__)
def get_all_failed_job_logs(gitlab_obj):
# poll all the projects that aren't archived
for project_it in gitlab_obj.projects.list(iterator=True, archived=False):
try:
logger.debug("Polling project %s", project_it.id)
# get the project for inspection
project = gitlab_obj.projects.get(id=project_it.id)
# check all the jobs for failures due to runner_system_failure
for job_it in project.jobs.list(iterator=True):
if (
job_it.status == "failed"
and job_it.failure_reason == "runner_system_failure"
):
logger.debug(
"Runner system failure detected for %s:", job_it.web_url
)
job = project.jobs.get(job_it.id)
msg = (
f"Project:{project_it.id}, "
f"Job:{job_it.id}@{job_it.web_url}: "
f"{job.trace()}"
)
logger.info(msg)
print(msg)
except gitlab.GitlabListError:
# your api token probably doesn't have access to *all* projects and jobs
# 403s should be expected and we'll just continue after making a note
logger.error(
"Authentication failure while polling jobs in project %s", project_it.id
)
continue
except gitlab.GitlabGetError:
logger.error("Unable to get project or jobs for %s", project_it.id)
continue
def main():
gitlab_obj = gitlab.Gitlab(
os.getenv("GITLAB_API"), private_token=os.getenv("GITLAB_API_TOKEN")
)
get_all_failed_job_logs(gitlab_obj)
if __name__ == "__main__":
main()
As you can see from the above, we also need to be sensitive to the fact that jobs/projects may not be available using the supplied token. However, this is a relatively straightforward way to scrape the runners for system failures. The job logs will be dumped into runner-failures.log
for further inspection and manipulation, eg:
Project:56103180, Job:6457130599@https://gitlab.com/gitlab-qa-sandbox-group-6/qa-test-2024-03-22-13-47-25-22e94c3079080c8b/project-with-secure-64add0e2e0825acf/-/jobs/6457130599: b'\x1b[0KRunning with gitlab-runner 16.9.1 (782c6ecb)\x1b[0;m\n\x1b[0K on green-1.saas-linux-small-amd64.runners-manager.gitlab.com/default JLgUopmM, system ID: s_deaa2ca09de7\x1b[0;m\n\x1b[0K feature flags: FF_USE_IMPROVED_URL_MASKING:true\x1b[0;m\nsection_start:1711115280:resolve_secrets\r\x1b[0K\x1b[0K\x1b[36;1mResolving secrets\x1b[0;m\x1b[0;m\nsection_end:1711115280:resolve_secrets\r\x1b[0Ksection_start:1711115280:prepare_executor\r\x1b[0K\x1b[0K\x1b[36;1mPreparing the "docker+machine" executor\x1b[0;m\x1b[0;m\n\x1b[0KUsing Docker executor with image registry.example.com/brakeman:4 ...\x1b[0;m\n\x1b[0KPulling docker image registry.example.com/brakeman:4 ...\x1b[0;m\n\x1b[0;33mWARNING: Failed to pull image with policy "always": Error response from daemon: Get https://registry.example.com/v2/: dial tcp: lookup registry.example.com on 169.254.169.254:53: no such host (manager.go:250:0s)\x1b[0;m\nsection_end:1711115284:prepare_executor\r\x1b[0K\x1b[31;1mERROR: Job failed: failed to pull image "registry.example.com/brakeman:4" with specified policies [always]: Error response from daemon: Get https://registry.example.com/v2/: dial tcp: lookup registry.example.com on 169.254.169.254:53: no such host (manager.go:250:0s)\n\x1b[0;m\n'