Receptor error when starting jobs in new AWX deployment in AKS

Bug Summary

Hi Team, I am testing AWX in AKS cluster. This is a private cluster, so I am unable to do this with AWX operator. So I created a different cluster in public, used that deployments, secret, config maps, service account, role bindings and put this in my private cluster. I was able to bring up the AWX in this cluster. However when I run the jobs, I see this error:

Traceback (most recent call last):
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/[receptor.py]”, line 372, in run_internal
resultsock = receptor_ctl.get_work_results(self.unit_id, return_sockfile=True)
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket
[interface.py]”, line 248, in get_work_results
self.writestr(f"work results {unit_id}\n")
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_[interface.py]”, line 49, in writestr
self._sockfile.flush()
File “/usr/lib64/python3.9/socket.py”, line 722, in write
return self._sock.send(b)
BrokenPipeError: [Errno 32] Broken pipe

In the AWX task logs, I see:
2024-04-18 06:56:20,999 ERROR [9c1dd129d187483c8fd7d2d13aa2739c] awx.main.tasks.receptor An error was encountered while getting status for work unit iv1jPZsF
Traceback (most recent call last):
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 356, in _run_internal
ConnectionResetError: [Errno 104] Connection reset by peer

When I investigated further, in receptor, I see this:
{
“iv1jPZsF”: {
“Detail”: “Pod created”,
“ExtraData”: {
“Command”: “”,
“Image”: “”,
“KubeConfig”: “”,
“KubeNamespace”: “”,
“KubePod”: “”,
“Params”: “”,
“PodName”: “automation-job-44-69vvn”
},
“State”: 0,
“StateName”: “Pending”,
“StdoutSize”: 0,
“WorkType”: “kubernetes-incluster-auth”
}
}
bash-5.1$ /var/lib/awx/venv/awx/bin/receptorctl --socket /run/receptor/receptor.sock work list
ERROR: [Errno 111] Connection refused

By any chance do you know what could be happening here? Any pointers would be really helpful. I am using AWX 21.4 as that is what we are using in our Prod clusters. We have deployed this in AKS clusters.

I see this error in AWX EE:
DEBUG 2024/04/18 08:41:38 Client connected to control service @
DEBUG 2024/04/18 08:41:38 Control service closed
DEBUG 2024/04/18 08:41:38 Client disconnected from control service @
DEBUG 2024/04/18 08:41:38 Client connected to control service @
DEBUG 2024/04/18 08:41:40 Stdout complete - closing channel for: BDE6e0Ef
WARNING 2024/04/18 08:41:40 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2024/04/18 08:41:40 Client disconnected from control service @
WARNING 2024/04/18 08:41:40 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2024/04/18 08:41:40 Client connected to control service @
DEBUG 2024/04/18 08:41:40 Control service closed
DEBUG 2024/04/18 08:41:40 Client disconnected from control service @
DEBUG 2024/04/18 08:41:40 Client connected to control service @
DEBUG 2024/04/18 08:41:40 Control service closed
DEBUG 2024/04/18 08:41:40 Client disconnected from control service @
DEBUG 2024/04/18 08:41:40 Client connected to control service @
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11ba558]

and then I see error on awx_task saying
2024-04-18 08:41:40,986 DEBUG [a6dbacaa718a4cc6a108738b0a75349b] awx.analytics.job_lifecycle inventoryupdate-64 work unit id assigned
2024-04-18 08:41:43,453 INFO [a6dbacaa718a4cc6a108738b0a75349b] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 64
2024-04-18 08:41:43,453 INFO [a6dbacaa718a4cc6a108738b0a75349b] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 64
2024-04-18 08:41:43,469 ERROR [a6dbacaa718a4cc6a108738b0a75349b] awx.main.tasks.receptor An error was encountered while getting status for work unit BurUgPkP
Traceback (most recent call last):
File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 356, in _run_internal
unit_status = receptor_ctl.simple_command(f’work status {self.unit_id}')

In AWX Task receptor I see this:
{
“iv1jPZsF”: {
“Detail”: “Pod created”,
“ExtraData”: {
“Command”: “”,
“Image”: “”,
“KubeConfig”: “”,
“KubeNamespace”: “”,
“KubePod”: “”,
“Params”: “”,
“PodName”: “automation-job-44-69vvn”
},
“State”: 0,
“StateName”: “Pending”,
“StdoutSize”: 0,
“WorkType”: “kubernetes-incluster-auth”
}
}
bash-5.1$ /var/lib/awx/venv/awx/bin/receptorctl --socket /run/receptor/receptor.sock work list
ERROR: [Errno 111] Connection refused

Please find the github issue raised for this: Receptor error when starting jobs in new AWX deployment in AKS. · Issue #1831 · ansible/awx-operator · GitHub

Thanks and Regards,
Mani

We have configured service mesh istio in our system. I am wondering whether this is causing any issues!

Which image is used for awx-ee container in awx-task pod? If it is awx-ee:latest, is it the old latest exported from the running Prod environment or the new latest newly pulled?

We are retrieving the image from our ACR. We are referencing with AWX EE 2.12.5, because AWX version is 21.4. We went with AWX EE 2.12.5 because that was the recommended version for AWX 21.4.

I see the image getting pulled, and I see that the underlying receptor interaction between awx task and awx ee is having some issue on the initial connection, which is strange.

Which was why I was thinking that it could be due to some underlying AKS networking or Istio service mesh.

There is no such tag 2.12.5: Quay

Where your EE image came from? Isn’t it mirror of official quay.io/ansible/awx-ee or customized using ansible-builder?
Which version of Receptor running in the awx-ee container in awx-task pod?

Yes that is accurate. I am using ansible-core 2.12.5 and building a custom AWX EE image.

Can you please let me know how I can check the version for receptor in awx-task and awx-ee pods?

Hi Kurokobo, Can you please let me know how I can check the version for receptor in awx-task and awx-ee pods?

Just run receptor --version in the awx-ee container in awx-task pod.

By the way for this type of extremely simple (and easy) questions, you can get the answer faster (in 5 minutes) by referring the official receptor docs, asking ChatGPT, or Google than just wait for 5 days with doing nothing. You may also need to make an effort to find out the solutions for yourself. Thanks for your understanding.

Thanks for your response and help. Apologies for the delayed response. We figured out the issue. The issue was Istio was messing the connectivity between AWX and AWX EE pods.

@mani3887
Hi, thanks for updating, glad to hear that the issue has been solved.

It’s okay to close this thread but if possible, could you update here with how you discovered the root cause and how you resolved it?

The Istio is one of the major CNI plugin so someone would face the same issue in the future, and this thread could be a good references to troubleshoot them if you can provide additional information.

Thanks!

I observed that the AWX task was transmitting messages to the Receptor websocket connection, and AWX EE received these messages. However, the connection closed when AWX EE attempted to respond. AWX is deployed in an AKS cluster using the kubenet plugin. I suspected that an external process was terminating this connection. During my investigation with Azure support for the kubenet plugin, they suggested that using the kubenet plugin might introduce an additional hop. I also considered Istio as a potential cause, since I had installed a sidecar with AWX. After disabling the sidecar, I noticed the problem was resolved.

1 Like

@mani3887
Thanks for your controbution :smiley:

1 Like