Hi.
I’m hoping someone here can give me advice on how to troubleshoot this, maybe point me at logs that I haven’t found yet or settings that produce more debugging information, or whatever.
TLDR: During AWX jobs that take 10-15 minutes, Jenkins seems to sometimes lose connection to AWX.
Details:
We have AWX 9 installed in Openshift (as STS).
We also use Jenkins, also installed in Openshift, and Jenkins uses the “Ansible Tower Plugin” to communicate with AWX. This combination has worked fine for years.
At random times, a problem will show where Jenkins is waiting for AWX to complete a job (deploying an image and settings to Openshift) and somehow not getting the correct communication back, so Jenkins fails the job even though AWX actually finishes the task successfully.
So there is some kind of communication issue, and I need to track this down. I posted a similar message in the Jenkins mailing list but have not received a response yet, so I am hoping to maybe get some troubleshooting advice here.
All we get in Jenkins is:
“ERROR: Failed to get job status from Tower: Unexpected error code returned (503)”
The second half of the message is a Java code exception caught in the plugin code on runtime while it verifies if the job is complete in AWX.
I can’t find anything related to the issue in the Jenkins logs, but maybe I’m not looking for the right thing.
The only mention of a 503 error in AWX around the time of the issue is this from the awx:web pod in Openshift:
[Ansible-Tower] Building GET request to https://awx/api/v2/jobs/57602/
[Ansible-Tower] Forcing cert trust
[Ansible-Tower] Request completed with (503)
[Ansible-Tower] Deleting oAuth token 15396 for awx
[Ansible-Tower] Forcing cert trust
[Ansible-Tower] Calling for oAuth token delete at https://awx/api/v2/tokens/15396/
[Ansible-Tower] Request completed with (200)
I don’t know if this is related or not.
From what I can tell, this seems to happen the most whenever AWX is waiting for Openshift to finish deployment on larger apps that have multiple pods and take 10-15 minutes to complete.
Again, both Openshift and AWX finish their jobs normally.
But Jenkins for some reason loses communication and fails its side of things.
We had this issue a year ago. It went away. Now it is back and badly affecting our work.
I have restarted AWX and Jenkins already as well as updated the Ansible Tower Plugin in Jenkins.
As far as I can tell, there were no changes anywhere leading up to this.
AWX is on version 9.0.0.0 (installed in Openshift as Stateful Set or STS) with Ansible 2.8.5.
Jenkins is on version 2.235.1 (also installed in Openshift).
I’ve been given the job of handling AWX and Jenkins for our team with very little training after the ones who installed them have left us, so I’m feeling a bit lost.
I have looked at all the logs I’ve been able to find so far, but have found nothing definite that has helped so far.
Thanks for any help you can give me!