Bug Summary
Hi Team, I am testing AWX in AKS cluster. This is a private cluster, so I am unable to do this with AWX operator. So I created a different cluster in public, used that deployments, secret, config maps, service account, role bindings and put this in my private cluster. I was able to bring up the AWX in this cluster. However when I run the jobs, I see this error:
Traceback (most recent call last):
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/[receptor.py]”, line 372, in run_internal
resultsock = receptor_ctl.get_work_results(self.unit_id, return_sockfile=True)
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket [interface.py]”, line 248, in get_work_results
self.writestr(f"work results {unit_id}\n")
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_[interface.py]”, line 49, in writestr
self._sockfile.flush()
File “/usr/lib64/python3.9/socket.py”, line 722, in write
return self._sock.send(b)
BrokenPipeError: [Errno 32] Broken pipe
In the AWX task logs, I see:
2024-04-18 06:56:20,999 ERROR [9c1dd129d187483c8fd7d2d13aa2739c] awx.main.tasks.receptor An error was encountered while getting status for work unit iv1jPZsF
Traceback (most recent call last):
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 356, in _run_internal
ConnectionResetError: [Errno 104] Connection reset by peer
When I investigated further, in receptor, I see this:
{
“iv1jPZsF”: {
“Detail”: “Pod created”,
“ExtraData”: {
“Command”: “”,
“Image”: “”,
“KubeConfig”: “”,
“KubeNamespace”: “”,
“KubePod”: “”,
“Params”: “”,
“PodName”: “automation-job-44-69vvn”
},
“State”: 0,
“StateName”: “Pending”,
“StdoutSize”: 0,
“WorkType”: “kubernetes-incluster-auth”
}
}
bash-5.1$ /var/lib/awx/venv/awx/bin/receptorctl --socket /run/receptor/receptor.sock work list
ERROR: [Errno 111] Connection refused
By any chance do you know what could be happening here? Any pointers would be really helpful. I am using AWX 21.4 as that is what we are using in our Prod clusters. We have deployed this in AKS clusters.
I see this error in AWX EE:
DEBUG 2024/04/18 08:41:38 Client connected to control service @
DEBUG 2024/04/18 08:41:38 Control service closed
DEBUG 2024/04/18 08:41:38 Client disconnected from control service @
DEBUG 2024/04/18 08:41:38 Client connected to control service @
DEBUG 2024/04/18 08:41:40 Stdout complete - closing channel for: BDE6e0Ef
WARNING 2024/04/18 08:41:40 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2024/04/18 08:41:40 Client disconnected from control service @
WARNING 2024/04/18 08:41:40 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2024/04/18 08:41:40 Client connected to control service @
DEBUG 2024/04/18 08:41:40 Control service closed
DEBUG 2024/04/18 08:41:40 Client disconnected from control service @
DEBUG 2024/04/18 08:41:40 Client connected to control service @
DEBUG 2024/04/18 08:41:40 Control service closed
DEBUG 2024/04/18 08:41:40 Client disconnected from control service @
DEBUG 2024/04/18 08:41:40 Client connected to control service @
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11ba558]
and then I see error on awx_task saying
2024-04-18 08:41:40,986 DEBUG [a6dbacaa718a4cc6a108738b0a75349b] awx.analytics.job_lifecycle inventoryupdate-64 work unit id assigned
2024-04-18 08:41:43,453 INFO [a6dbacaa718a4cc6a108738b0a75349b] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 64
2024-04-18 08:41:43,453 INFO [a6dbacaa718a4cc6a108738b0a75349b] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 64
2024-04-18 08:41:43,469 ERROR [a6dbacaa718a4cc6a108738b0a75349b] awx.main.tasks.receptor An error was encountered while getting status for work unit BurUgPkP
Traceback (most recent call last):
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 356, in _run_internal
unit_status = receptor_ctl.simple_command(f’work status {self.unit_id}')
In AWX Task receptor I see this:
{
“iv1jPZsF”: {
“Detail”: “Pod created”,
“ExtraData”: {
“Command”: “”,
“Image”: “”,
“KubeConfig”: “”,
“KubeNamespace”: “”,
“KubePod”: “”,
“Params”: “”,
“PodName”: “automation-job-44-69vvn”
},
“State”: 0,
“StateName”: “Pending”,
“StdoutSize”: 0,
“WorkType”: “kubernetes-incluster-auth”
}
}
bash-5.1$ /var/lib/awx/venv/awx/bin/receptorctl --socket /run/receptor/receptor.sock work list
ERROR: [Errno 111] Connection refused
Please find the github issue raised for this: Receptor error when starting jobs in new AWX deployment in AKS. · Issue #1831 · ansible/awx-operator · GitHub
Thanks and Regards,
Mani