Long running Ansible jobs in AWX timeout/fail

Would anyone be able to help me investigate an issue we are having with long running Ansible jobs that just seem to crash/die in middle of job? I realize that is a vague statement. Right now I can re-produce the problem in both my lab and production environments which are running the same version of AWX on the same version of OpenShift.

Versions
AWX 23.8.0
AWX operator (awx-operator.v2.12.0)
Red Hat OpenShift 4.11.57.

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – env | grep ANSIBLE_RUNNER_KEEPALIVE_SECOND
ANSIBLE_RUNNER_KEEPALIVE_SECONDS=30

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – receptor --version
1.4.4+gc75b1f6

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – ansible-runner --version
2.3.5

I’m happy to provide more information but I am pretty new to AWX.

I did increase our containerLogMaxSize to 200mb for better visibility.
I also set K8S Ansible Runner Keep-Alive Message Interval to 30.

Right now I am just trying to run a simple Ansible playbook that simply pauses for 60 minutes for troubleshooting. This job will always fail.

After anywhere from 3 to 10+ minutes, the AWX Job will fail with: Task was canceled due to receiving a shutdown signal.

I wonder if changing the behavior of RECEPTOR_KUBE_SUPPORT_RECONNECT would help?

I am seeing these errors in the awx-task pod which I assume might be related?

RuntimeError: Task got bad yield: <coroutine object WebSocketRelayManager.cleanup_offline_host at 0x7fb39c53b4c0>
2024-03-04 20:06:15,885 WARN exited: wsrelay (exit status 1; not expected)
2024-03-04 20:06:15,885 WARN exited: wsrelay (exit status 1; not expected)
2024-03-04 20:06:16,888 INFO spawned: 'wsrelay' with pid 241
2024-03-04 20:06:16,888 INFO spawned: 'wsrelay' with pid 241
2024-03-04 20:06:19,309 INFO     [-] awx.main.wsrelay Active instance with hostname awx-lab-task-845bbc4f89-w6wkz is registered.
2024-03-04 20:06:39,326 INFO     [-] awx.main.wsrelay Adding {'awx-lab-web-6855c4d646-9f8wc'} to websocket broadcast list
2024-03-04 20:06:39,343 WARNING  [-] awx.main.wsrelay Connection from awx-lab-task-845bbc4f89-w6wkz to 10.129.6.8 failed for unknown reason: '403, message='Invalid response status', url=URL('http://10.129.6.8:8052/websocket/relay/')'.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 89, in connect
    async with session.ws_connect(uri, ssl=self.verify_ssl, heartbeat=20) as websocket:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 795, in _ws_connect
    raise WSServerHandshakeError(
aiohttp.client_exceptions.WSServerHandshakeError: 403, message='Invalid response status', url=URL('http://10.129.6.8:8052/websocket/relay/')
2024-03-04 20:06:47,352 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-03-04 20:06:47,352 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-03-04 20:06:49,333 INFO     [-] awx.main.wsrelay Removing {'awx-lab-web-6855c4d646-9f8wc'} from websocket broadcast list
/usr/lib64/python3.9/asyncio/events.py:80: RuntimeWarning: coroutine 'WebSocketRelayManager.cleanup_offline_host' was never awaited
  self._context.run(self._callback, *self._args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 175, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/run_wsrelay.py", line 171, in handle
    asyncio.run(websocket_relay_manager.run())
  File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in run
    await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in <genexpr>
    await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)
RuntimeError: Task got bad yield: <coroutine object WebSocketRelayManager.cleanup_offline_host at 0x7f7216d729c0>
2024-03-04 20:06:49,757 WARN exited: wsrelay (exit status 1; not expected)
2024-03-04 20:06:49,757 WARN exited: wsrelay (exit status 1; not expected)
2024-03-04 20:06:50,760 INFO spawned: 'wsrelay' with pid 249
2024-03-04 20:06:50,760 INFO spawned: 'wsrelay' with pid 249
2024-03-04 20:06:53,165 INFO     [-] awx.main.wsrelay Active instance with hostname awx-lab-task-845bbc4f89-w6wkz is registered.
2024-03-04 20:07:13,187 INFO     [-] awx.main.wsrelay Adding {'awx-lab-web-6855c4d646-9f8wc'} to websocket broadcast list
2024-03-04 20:07:13,209 WARNING  [-] awx.main.wsrelay Connection from awx-lab-task-845bbc4f89-w6wkz to 10.129.6.8 failed for unknown reason: '403, message='Invalid response status', url=URL('http://10.129.6.8:8052/websocket/relay/')'.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 89, in connect
    async with session.ws_connect(uri, ssl=self.verify_ssl, heartbeat=20) as websocket:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/aiohttp/client.py", line 795, in _ws_connect
    raise WSServerHandshakeError(
aiohttp.client_exceptions.WSServerHandshakeError: 403, message='Invalid response status', url=URL('http://10.129.6.8:8052/websocket/relay/')
2024-03-04 20:07:21,218 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-03-04 20:07:21,218 INFO success: wsrelay entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
2024-03-04 20:07:23,195 INFO     [-] awx.main.wsrelay Removing {'awx-lab-web-6855c4d646-9f8wc'} from websocket broadcast list
/usr/lib64/python3.9/asyncio/events.py:80: RuntimeWarning: coroutine 'WebSocketRelayManager.cleanup_offline_host' was never awaited
  self._context.run(self._callback, *self._args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 175, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/run_wsrelay.py", line 171, in handle
    asyncio.run(websocket_relay_manager.run())
  File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in run
    await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py", line 342, in <genexpr>
    await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)

This is a known issue in 23.8.0 and it is not recommended to use this version. Please update to 23.8.1.

awx.main.wsrelay records repeated errors due to 403 with ‘Invalid response status’ over ‘/websocket/relay/’ · Issue #14876 · ansible/awx (github.com)

2 Likes

Thanks! I will upgrade to 23.8.1 and test again.
Much appreciated!