Suddenly getting ErrImagePull

I’ve had AWX version 21.12.0 installed and working just fine for over a year. It is installed on an Azure Ubuntu 22.04 virtual machine in our Azure V-Net.
It was quite a feat getting this all working because I know virtually nothing about kubernetes, rancher, or any of the other necessary components to get AWX installed and configured. Much thanks to a Youtube video from Calvin Remsburg which shows how to do this. ( https://www.youtube.com/watch?v=Nvjo2A2cBxI )

I am suddenly getting this error when attempting to run templates. Here is the job output:
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py”, line 593, in run
res = receptor_job.run()
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 317, in run
res = self._run_internal(receptor_ctl)
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 444, in _run_internal
raise RuntimeError(detail)
RuntimeError: Error creating pod: container failed to start, ImagePullBackOff

If I try to take a look at my pods while attempting to run the template file in AWX, I see this error:
automation-job-17206-c896b 0/1 ErrImagePull 0

I honestly don’t have any idea how to troubleshoot this ErrImagePull error.
Can anyone provide some guidance?
Thank you.

I’m not sure if I am doing this correctly but using the following command
kubectl get events --all-namespaces

I was able to see that it is attempting to pull image “awx-ee:latest” from ‘Quay

Is ‘Quay’ somewhere on the internet or would this be stored on my local host after installing AWX using awx-operator?

Sorry, all this kubernetes stuff is a bit overwhelming. If this image is off the internet, that would be indicative of internet access being blocked on my AWX Ubuntu host. I am pretty sure that I have not done anything to block this but with my experience in Azure, it is possible that somewhere in the jungle of things required to get hosts talking to the internet that either I or Azure has made a change.

Since this has just worked from the time that I installed it, I had never thought about internet access on this host so I don’t know if that is an issue or not in this case.

Can anyone verify whether or not the AWX installation requires internet access to pull awx-ee:latest image?

I’m right there with you on the overwhelm! I’m getting there and trust you will, too.

Yes, that means it tried to pull the image from quay.io and failed. The automation job is an ephemeral image spun up to run the actual play, and it failed to pull the Execution Image it needs to run the play.

Getting the reason it’s not pulling is your next step. It seems kubectl describe <podname> will get you that piece of data.

Hello,

I’ve had this very same problem twice or so when deploying AWX on minikube. If I recall correctly, what I did was to just login onto my minikube VM and then pull the EE image manually with the docker image pull <EE image> command. Then re-deploying the AWX operator did the trick.

Also, it might be that in the very moment you tried to deploy the AWX operator, quay.io was down - I’ve seen some people here in the forum commenting on that. So, you may just have to try again a few hours later to make it work.

Edit:

Sorry, I failed to see that:

Yes, you need access to the Internet to pull the EE image.

Hope it helps!

2 Likes

Thank you.
Unfortunately this error is constant. None of my jobs are working.
I’m struggling to think what could have changed on my linux host where AWX was installed. This was all working flawlessly for 10 months or so but suddenly every job is failing with the same error
Failed to pull image “Quay”: rpc error: code = Unknown desc = faile d to pull and unpack image “Quay”: failed to extract layer sha256:b4bbae57b10b4c10585b70e04ce63d4fae26099228dfa62d7124e6b02451c d80: failed to unmount /var/lib/rancher/k3s/agent/containerd/tmpmounts/containerd-mount3462773482: failed to unmount target /var/lib/rancher/k3s/agent/c ontainerd/tmpmounts/containerd-mount3462773482: device or resource busy: unknown

\

I see… The first thing that comes to my mind is that you might have AWX configured to pull the EE image every time you run a JT, so as a temporary fix I’d check if you have the “pull” option on the EE config section as in the image (always pull…):

th-4001733942

If so, try to change it to “missing”. This would avoid pulling and mounting the EE image for each JT you launch.

However, to fix the problem permanently, you will have to investigate what is using the EE container so it cannot be unmounted, as noted in the error log:

Failed to unmount target /var/lib/rancher/k3s/agent/c ontainerd/tmpmounts/containerd-mount3462773482: device or resource busy: unknown

Maybe you have a JT that’s still running, or a workflow template waiting for an approval action? (I’m just doing a bit of guessworking here). If memory serves me well, you could check if there is any stuck task pod running with kubectl get pods -n awx

1 Like