Would anyone be able to help me investigate an issue we are having with long running Ansible jobs that just seem to crash/die in middle of job? I realize that is a vague statement. Right now I can re-produce the problem in both my lab and production environments which are running the same version of AWX on the same version of OpenShift.
Versions
AWX 23.8.0
AWX operator (awx-operator.v2.12.0)
Red Hat OpenShift 4.11.57.
kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – env | grep ANSIBLE_RUNNER_KEEPALIVE_SECOND
ANSIBLE_RUNNER_KEEPALIVE_SECONDS=30
kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – receptor --version
1.4.4+gc75b1f6
kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – ansible-runner --version
2.3.5
I’m happy to provide more information but I am pretty new to AWX.
I did increase our containerLogMaxSize to 200mb for better visibility.
I also set K8S Ansible Runner Keep-Alive Message Interval to 30.
Right now I am just trying to run a simple Ansible playbook that simply pauses for 60 minutes for troubleshooting. This job will always fail.
After anywhere from 3 to 10+ minutes, the AWX Job will fail with: Task was canceled due to receiving a shutdown signal.
I wonder if changing the behavior of RECEPTOR_KUBE_SUPPORT_RECONNECT would help?
I am seeing these errors in the awx-task pod which I assume might be related?
RuntimeError: Task got bad yield: <coroutine object WebSocketRelayManager.cleanup_offline_host at 0x7fb39c53b4c0>
2024-03-04 20:06:15,885 WARN exited: wsrelay (exit status 1; not expected)
2024-03-04 20:06:15,885 WARN exited: wsrelay (exit status 1; not expected)
2024-03-04 20:06:16,888 INFO spawned: 'wsrelay' with pid 241
2024-03-04 20:06:16,888 INFO spawned: 'wsrelay' with pid 241
2024-03-04 20:06:19,309 INFO [-] awx.main.wsrelay Active instance with hostname awx-lab-task-845bbc4f89-w6wkz is registered.
2024-03-04 20:06:39,326 INFO [-] awx.main.wsrelay Adding {'awx-lab-web-6855c4d646-9f8wc'} to websocket broadcast list
2024-03-04 20:06:39,343 WARNING [-] awx.main.wsrelay Connection from awx-lab-task-845bbc4f89-w6wkz to 10.129.6.8 failed for unknown reason: '403, message='Invalid response status', url=URL('http://10.129.6.8:8052/websocket/relay/')'.
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 89, in connect
async with session.ws_connect(uri, ssl=self.verify_ssl, heartbeat=20) as websocket:
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
self._resp = await self._coro
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 795, in _ws_connect
raise WSServerHandshakeError(
aiohttp.client_exceptions.WSServerHandshakeError: 403, message='Invalid response status', url=URL('http://10.129.6.8:8052/websocket/relay/')
2024-03-04 20:06:47,352 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-03-04 20:06:47,352 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-03-04 20:06:49,333 INFO [-] awx.main.wsrelay Removing {'awx-lab-web-6855c4d646-9f8wc'} from websocket broadcast list
/usr/lib64/python3.9/asyncio/events.py:80: RuntimeWarning: coroutine 'WebSocketRelayManager.cleanup_offline_host' was never awaited
self._context.run(self._callback, *self._args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in <module>
sys.exit(manage())
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 175, in manage
execute_from_command_line(sys.argv)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 458, in execute
output = self.handle(*args, **options)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/run_wsrelay.py", line 171, in handle
asyncio.run(websocket_relay_manager.run())
File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in run
await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in <genexpr>
await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)
RuntimeError: Task got bad yield: <coroutine object WebSocketRelayManager.cleanup_offline_host at 0x7f7216d729c0>
2024-03-04 20:06:49,757 WARN exited: wsrelay (exit status 1; not expected)
2024-03-04 20:06:49,757 WARN exited: wsrelay (exit status 1; not expected)
2024-03-04 20:06:50,760 INFO spawned: 'wsrelay' with pid 249
2024-03-04 20:06:50,760 INFO spawned: 'wsrelay' with pid 249
2024-03-04 20:06:53,165 INFO [-] awx.main.wsrelay Active instance with hostname awx-lab-task-845bbc4f89-w6wkz is registered.
2024-03-04 20:07:13,187 INFO [-] awx.main.wsrelay Adding {'awx-lab-web-6855c4d646-9f8wc'} to websocket broadcast list
2024-03-04 20:07:13,209 WARNING [-] awx.main.wsrelay Connection from awx-lab-task-845bbc4f89-w6wkz to 10.129.6.8 failed for unknown reason: '403, message='Invalid response status', url=URL('http://10.129.6.8:8052/websocket/relay/')'.
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 89, in connect
async with session.ws_connect(uri, ssl=self.verify_ssl, heartbeat=20) as websocket:
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
self._resp = await self._coro
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 795, in _ws_connect
raise WSServerHandshakeError(
aiohttp.client_exceptions.WSServerHandshakeError: 403, message='Invalid response status', url=URL('http://10.129.6.8:8052/websocket/relay/')
2024-03-04 20:07:21,218 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-03-04 20:07:21,218 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-03-04 20:07:23,195 INFO [-] awx.main.wsrelay Removing {'awx-lab-web-6855c4d646-9f8wc'} from websocket broadcast list
/usr/lib64/python3.9/asyncio/events.py:80: RuntimeWarning: coroutine 'WebSocketRelayManager.cleanup_offline_host' was never awaited
self._context.run(self._callback, *self._args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in <module>
sys.exit(manage())
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 175, in manage
execute_from_command_line(sys.argv)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 458, in execute
output = self.handle(*args, **options)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/run_wsrelay.py", line 171, in handle
asyncio.run(websocket_relay_manager.run())
File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in run
await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in <genexpr>
await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)