Error opening pod stream' or 'Stream error running pod' randomly on some jobs

Hi I have k3s cluster with awx installed on it everything works fine except one thing (and this is very important thing).

Sometimes in a random way 1 job will not start and in the output I have this error :

Error opening pod stream: Get “https://ip_of_my_server:10250/containerLogs/awx/automation-job-567099-7hw6l/worker?follow=true”: EOF

or more rarely

Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed

But if I restart the job it will start and run normally.

I noticed that it happens when several jobs are running at the same time (10 or more), but it not always happens ; for example I have a big workflow with 32 jobs starting in the same time, twice the morning at 7:00 am and 8:00 am, and sometimes both will succeeded, or sometimes in the first workflow 2 or 3 jobs will failed but, the second launch will be ok. It’s very weird.

I looked for a way to limit the number of jobs running in parallel on AWX but I couldn’t find it, and my number of forks in the AWX configuration is 100. My server has 12 CPUs and 20gb of ram, so I don’t think it’s a resource issue.

My version of K3S is : v1.24.4+k3s1
My version of AWX is : 21.5.0
My system is : CentOS Linux release 7.9.2009

Maybe someone have a solution for this or a workaround ? Because it’s very problematic sometimes when I have some workflows that runs, the workflow failed because of 1 job with this error.

Thanks

Hi,
I had similar behavior, so I downgraded AWX version but was still having issues just not as frequent.
For now I am running:
K3s: v1.21.9+k3s1
operator: 0.26.0
AWX: 21.4.0

Never experience the issue again, with several scheduled jobs running, some every 5 minutes.

My setup can be found here: https://github.com/antuelle78/deploy-awx-k3s-ubuntu

Regards,
Antuelle78

If there is any kind of hiccup while transmitting the inputs to ansible-runner in the pod we can not resume the connection because the ansible-runner process stdin will be pouted. Because of this we have to fail the job.

-The AWX Team