Need help with communication issue from Jenkins

Hi.

I’m hoping someone here can give me advice on how to troubleshoot this, maybe point me at logs that I haven’t found yet or settings that produce more debugging information, or whatever.

TLDR: During AWX jobs that take 10-15 minutes, Jenkins seems to sometimes lose connection to AWX.

Details:

We have AWX 9 installed in Openshift (as STS).

We also use Jenkins, also installed in Openshift, and Jenkins uses the “Ansible Tower Plugin” to communicate with AWX. This combination has worked fine for years.

At random times, a problem will show where Jenkins is waiting for AWX to complete a job (deploying an image and settings to Openshift) and somehow not getting the correct communication back, so Jenkins fails the job even though AWX actually finishes the task successfully.

So there is some kind of communication issue, and I need to track this down. I posted a similar message in the Jenkins mailing list but have not received a response yet, so I am hoping to maybe get some troubleshooting advice here.

All we get in Jenkins is:

“ERROR: Failed to get job status from Tower: Unexpected error code returned (503)”

The second half of the message is a Java code exception caught in the plugin code on runtime while it verifies if the job is complete in AWX.

I can’t find anything related to the issue in the Jenkins logs, but maybe I’m not looking for the right thing.

The only mention of a 503 error in AWX around the time of the issue is this from the awx:web pod in Openshift:

[Ansible-Tower] Building GET request to https://awx/api/v2/jobs/57602/
[Ansible-Tower] Forcing cert trust

[Ansible-Tower] Request completed with (503)

[Ansible-Tower] Deleting oAuth token 15396 for awx

[Ansible-Tower] Forcing cert trust

[Ansible-Tower] Calling for oAuth token delete at https://awx/api/v2/tokens/15396/

[Ansible-Tower] Request completed with (200)

I don’t know if this is related or not.

From what I can tell, this seems to happen the most whenever AWX is waiting for Openshift to finish deployment on larger apps that have multiple pods and take 10-15 minutes to complete.

Again, both Openshift and AWX finish their jobs normally.

But Jenkins for some reason loses communication and fails its side of things.

We had this issue a year ago. It went away. Now it is back and badly affecting our work.

I have restarted AWX and Jenkins already as well as updated the Ansible Tower Plugin in Jenkins.

As far as I can tell, there were no changes anywhere leading up to this.

AWX is on version 9.0.0.0 (installed in Openshift as Stateful Set or STS) with Ansible 2.8.5.

Jenkins is on version 2.235.1 (also installed in Openshift).

I’ve been given the job of handling AWX and Jenkins for our team with very little training after the ones who installed them have left us, so I’m feeling a bit lost.

I have looked at all the logs I’ve been able to find so far, but have found nothing definite that has helped so far.

Thanks for any help you can give me!

Quick update: I just had the pipeline fail NOT during a long wait but while Ansible was setting a variable only a few seconds after the playbook was started, so the issue does NOT appear to have to do with long wait times.

Bump - still hoping someone can help me with this.

Thanks!

What version of the Ansible Tower Jenkins plug-in are you running? A 503 error code sounds like the plug-in is expecting a specific response from AWX and it’s receiving something different. I was going to ask if there’s any firewalls between AWX and Jenkins but it appears all the network communication is within OpenShift. Do they share a worker node? Are they distributed across datacenters? What does the pod topology look like for network communication between AWX and Jenkins?

Thanks for the reply.

The Ansible Tower Plugin is on version 0.16.0, and I don’t see an update available.

Our servers are distributed in multiple physical locations, and AWX and Jenkins are located in separate locations. So they are not sharing nodes.

It shouldn’t be a firewall issue as we have many apps that have to communicate back and forth. We even have a subdomain installed for the AWX instance.

I suspected a networking issue since we occasionally have glitches, but I just heard back from the network team earlier this week, and they claim that a 503 error is NOT a network issue but an issue with Openshift. Sounds about right, every team blaming another while we are stuck in the middle without a solution…

I don’t know the network topology. I just know that 95-99% of the time, the Jenkins-AWX connection is fine, but when it fails, it’s this 503 error.

Please let me know if there’s any additional information that I can provide that might help.

Thanks!

I see the same, 0.16.0, as the latest version of the Ansible Tower Jenkins plug-in. I’m looking but I’ve been unable to find information on the compatibility of this plugin with AWX 9.0.0. You mentioned that you had 503 errors starting last year and I noticed that this plug-in was updated last in 2020, I wonder if there is any correlation. Since you’re receiving a 503 message from AWX, the network is available it’s just that AWX is not responding with what the plug-in is expecting. Are you able to pull logs from the AWX container during the time of a 503 error and see how AWX is reacting?

Last year when we had the 503 errors for the first time, we were on the previous version of the Ansible Tower Plugin. As far as I know, there was no update or any other change that happened before the errors started showing up. We updated the plugin in the hopes that it would fix the issue. Then the issue went away for nearly a year, and now it is back. Some days we get so many errors that we can’t get a pipeline to deploy at all, some days there are no errors. I cannot find any consistency in this at all…

As for AWX, I kept checking the logs again and again. Any connection that ends up in a 503 error does not show up in the AWX logs at all, and I have never seen anything out of the ordinary in the log just before the 503 hits.

Here is an example:

Two requests in Jenkins through the Ansible Tower Plugin directed at AWX - the first is successful, the second fails, the third (delete) succeeds:

[Ansible-Tower] Building GET request to https://awxserver/api/v2/jobs/58857/job_events/?id__gt=1334939
[Ansible-Tower] Forcing cert trust

[Ansible-Tower] Request completed with (200)

[Ansible-Tower] {“…json data…”}

[Ansible-Tower] Building GET request to https://awxserver/api/v2/jobs/58857/job_events/?id__gt=1334964

[Ansible-Tower] Forcing cert trust

[Ansible-Tower] Request completed with (503)

[Ansible-Tower] Deleting oAuth token 15964 for awx

[Ansible-Tower] Forcing cert trust

[Ansible-Tower] Calling for oAuth token delete at https://awxserver/api/v2/tokens/15964/

[Ansible-Tower] oAuth Token deleted

Log output in AWX for these requests:

2021-05-27 15:48:39,909 INFO awx.api.authentication User awx performed a GET to /api/v2/jobs/58857/job_events/ through the API using OAuth 2 token 15964.

[pid: 846|app: 0|req: 49/56610] 10.129.44.1 () {42 vars in 701 bytes} [Thu May 27 15:48:39 2021] GET /api/v2/jobs/58857/job_events/?id__gt=1334939 => generated 49884 bytes in 249 msecs (HTTP/1.1 200) 9 headers in 258 bytes (1 switches on core 0)

10.129.44.1 - - [27/May/2021:15:48:40 +0000] “GET /api/v2/jobs/58857/job_events/?id__gt=1334939 HTTP/1.1” 200 49884 “-” “-” “10.49.85.200, 10.204.113.7, 10.204.120.7”

2021-05-27 15:48:42,554 INFO awx.api.authentication User awx performed a DELETE to /api/v2/tokens/15964/ through the API

[pid: 846|app: 0|req: 50/56611] 10.128.50.1 () {42 vars in 646 bytes} [Thu May 27 15:48:42 2021] DELETE /api/v2/tokens/15964/ => generated 0 bytes in 232 msecs (HTTP/1.1 204) 7 headers in 227 bytes (1 switches on core 0)

10.128.50.1 - awx [27/May/2021:15:48:42 +0000] “DELETE /api/v2/tokens/15964/ HTTP/1.1” 204 0 “-” “-” “10.49.85.200, 10.204.113.7, 10.204.120.7”

We see the first request (id__gt=1334939) come through, but there is no mention of the second request (id__gt=1334964). The delete request makes it into the log again.

Thanks!

Set AWX to log DEBUG messages in the System Logging Settings and see if there’s more details during a 503 response. Enable debugging in the Ansible Tower Jenkins plug-in too, might be some useful info there. I just started recently using the Jenkins plug-in so I’m not familiar with it’s usage with AWX before 17.1.0.

Usually in my experience a 503 error comes from nginx or whatever is in front of AWX itself (e.g. when AWX is not up) and this would explain why you don’t see the erroring request in the AWX logs. Maybe you will find something in your web server or ingress logs?

Debugging is enabled with the Tower Plugin :frowning: It does not give me any more information…

I will see if I can find a place in AWX to enable more logging, but so far I didn’t find any.

In my case, that would be Openshift, and I only have access to areas required for deploying, not for the infrastructure. I’m waiting to hear back from our Openshift team on the issue.