Hi I have k3s cluster with awx installed on it everything works fine except one thing (and this is very important thing).
Sometimes in a random way 1 job will not start and in the output I have this error :
Error opening pod stream: Get “https://ip_of_my_server:10250/containerLogs/awx/automation-job-567099-7hw6l/worker?follow=true”: EOF
or more rarely
Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed
But if I restart the job it will start and run normally.
I noticed that it happens when several jobs are running at the same time (10 or more), but it not always happens ; for example I have a big workflow with 32 jobs starting in the same time, twice the morning at 7:00 am and 8:00 am, and sometimes both will succeeded, or sometimes in the first workflow 2 or 3 jobs will failed but, the second launch will be ok. It’s very weird.
I looked for a way to limit the number of jobs running in parallel on AWX but I couldn’t find it, and my number of forks in the AWX configuration is 100. My server has 12 CPUs and 20gb of ram, so I don’t think it’s a resource issue.
My version of K3S is : v1.24.4+k3s1
My version of AWX is : 21.5.0
My system is : CentOS Linux release 7.9.2009
Maybe someone have a solution for this or a workaround ? Because it’s very problematic sometimes when I have some workflows that runs, the workflow failed because of 1 job with this error.
Thanks