AWX 22.5 upgrade on 2 K8S cluster - facing issue with Load balancer - Urgent business is down

We have an AWX with 2 cluster K8S configuration (common external postgress db), with container instances as execution envs.

When we fire up only 1 cluster, all works fine.

When we bring up the second cluster, the “awx.main.wsrelay” will try to connect from pods in cluster1 to pods on cluster2 (and the other way around).
Because it can’t find the other pods coroutine ‘WebSocketRelayManager.cleanup_offline_host’ fails, and it’s marking its own pod as failing.

In the end, all TASK pods are restarted until Backoff.

Can we isolate somehow the Websocket relay system for “Heartbeet” & “Wsrelay”, and group the pods per cluster?

Or this behaviour is a bug? (https://github.com/ansible/awx/blob/devel/docs/websockets.md)

Logs:

awx-test2-task 2023-07-20 10:34:45,625 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

│ awx-test2-task 2023-07-20 10:34:45,625 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

│ awx-test2-task 2023-07-20 10:34:46,739 INFO [-] awx.main.commands.run_callback_receiver Callback receiver started with pid=50

│ awx-test2-task 2023-07-20 10:34:46,764 INFO [-] awx.main.wsrelay Active instance with hostname awx-test2-task-5<>7bsn8 is registered.

│ awx-test2-task 2023-07-20 10:34:46,807 WARNING [-] awx.main.dispatch.periodic periodic beat started

│ awx-test2-task 2023-07-20 10:34:46,832 INFO [-] awx.main.dispatch Running worker dispatcher listening to queues [‘tower_broadcast_all’, ‘tower_settings_change’, ‘awx-test2-task-<>-7bsn8’] │

│ awx-test2-task 2023-07-20 10:34:56,776 INFO [-] awx.main.wsrelay Adding {‘awx-test2-web-6<>d7-tzscp’, ‘awx-test2-web-6<>7cdd7-xqw29’, ‘awx-test1-web-6<>c-29wzn’, 'awx-test1-web-6<>8 │

│ awx-test2-task 2023-07-20 10:34:56,794 INFO [-] awx.main.wsrelay Connection from awx-test2-task-5<>5-7bsn8 to 198.0.0.0 established.

│ awx-test2-task 2023-07-20 10:34:56,795 INFO [-] awx.main.wsrelay Starting producer for metrics

│ awx-test2-task 2023-07-20 10:34:56,798 INFO [-] awx.main.wsrelay Connection from awx-test2-task-584bdc44f5-7bsn8 to 198.0.0.0 established.

│ awx-test2-task 2023-07-20 10:34:56,798 INFO [-] awx.main.wsrelay Starting producer for metrics

│ awx-test2-task 2023-07-20 10:35:06,780 INFO [-] awx.main.wsrelay Removing {‘awx-test1-web-6<>c-29wzn’, ‘awx-test1-web-68<>fc-zx8sf’} from websocket broadcast list │

│ awx-test2-task /usr/lib64/python3.9/asyncio/events.py:80: RuntimeWarning: coroutine ‘WebSocketRelayManager.cleanup_offline_host’ was never awaited

│ awx-test2-task self._context.run(self._callback, *self._args)

│ awx-test2-task RuntimeWarning: Enable tracemalloc to get the object allocation traceback

│ awx-test2-task 2023-07-20 10:35:06,789 WARNING [-] awx.main.wsrelay Connection from awx-test2-task-5<>5-7bsn8 to 172.0.0.x cancelled. ->> Cluster1

│ awx-test2-task 2023-07-20 10:35:06,790 WARNING [-] awx.main.wsrelay Connection from awx-test2-task-5<>5-7bsn8 to 172.x.x.x.x cancelled. ->> Cluster1

│ awx-test2-task 2023-07-20 10:35:06,791 WARNING [-] awx.main.wsrelay Connection from awx-test2-task-5<>5-7bsn8 to 198.x.x.x cancelled. ->> Cluster2

│ awx-test2-task 2023-07-20 10:35:06,793 WARNING [-] awx.main.wsrelay Connection from awx-test2-task-5<>5-7bsn8 to 198.x.x.x cancelled. ->> Cluster2

awx-test2-task Traceback (most recent call last):

│ awx-test2-task File “/usr/bin/awx-manage”, line 8, in

│ awx-test2-task sys.exit(manage())

│ awx-test2-task File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/init.py”, line 200, in manage

│ awx-test2-task execute_from_command_line(sys.argv)

│ awx-test2-task File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/init.py”, line 442, in execute_from_command_line

│ awx-test2-task utility.execute()

│ awx-test2-task File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/init.py”, line 436, in execute

│ awx-test2-task self.fetch_command(subcommand).run_from_argv(self.argv)

│ awx-test2-task File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py”, line 412, in run_from_argv

│ awx-test2-task self.execute(*args, **cmd_options)

│ awx-test2-task File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py”, line 458, in execute

│ awx-test2-task output = self.handle(*args, **options)

│ awx-test2-task File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/run_wsrelay.py”, line 168, in handle

│ awx-test2-task asyncio.run(websocket_relay_manager.run())

│ awx-test2-task File “/usr/lib64/python3.9/asyncio/runners.py”, line 44, in run

│ awx-test2-task return loop.run_until_complete(main)

│ awx-test2-task File “/usr/lib64/python3.9/asyncio/base_events.py”, line 647, in run_until_complete

│ awx-test2-task return future.result()

│ awx-test2-task File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py”, line 330, in run

│ awx-test2-task await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)

│ awx-test2-task File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/wsrelay.py”, line 330, in

│ awx-test2-task await asyncio.gather(self.cleanup_offline_host(h) for h in deleted_remote_hosts)

│ awx-test2-task RuntimeError: Task got bad yield: <coroutine object WebSocketRelayManager.cleanup_offline_host at 0x<>40>

│ awx-test2-task 2023-07-20 10:35:08,314 WARN exited: wsrelay (exit status 1; not expected)

│ awx-test2-task 2023-07-20 10:35:08,314 WARN exited: wsrelay (exit status 1; not expected)

│ awx-test2-task 2023-07-20 10:35:09,317 INFO spawned: ‘wsrelay’ with pid 133

│ awx-test2-task 2023-07-20 10:35:09,317 INFO spawned: ‘wsrelay’ with pid 133

│ awx-test2-task 2023-07-20 10:35:11,359 INFO [-] awx.main.wsrelay Active instance with hostname awx-test2-task-58<>5-7bsn8 is registered.

Repeats N times,

and then: removed self from capacit

2023-07-20 11:00:48,825 INFO gave up: wsrelay entered FATAL state, too many start retries too quickly │

│ awx-test2-task Processing Event: ver:3.0 server:supervisor serial:0 pool:superwatcher poolserial:0 eventname:PROCESS_STATE_FATAL len:64 │

│ awx-test2-task 2023-07-20 11:00:49,827 WARN received SIGQUIT indicating exit request │

│ awx-test2-task 2023-07-20 11:00:49,827 WARN received SIGQUIT indicating exit request │

│ awx-test2-task 2023-07-20 11:00:49,827 INFO waiting for superwatcher, dispatcher, callback-receiver to die │

│ awx-test2-task 2023-07-20 11:00:49,827 INFO waiting for superwatcher, dispatcher, callback-receiver to die │

│ awx-test2-task 2023-07-20 11:00:49,829 WARNING [24ff42c8c9c64921a6097197bec680a3] awx.main.dispatch received SIGTERM, stopping │

│ awx-test2-task 2023-07-20 11:00:49,828 WARNING [-] awx.main.commands.run_callback_receiver received SIGTERM, stopping │

│ awx-test2-task 2023-07-20 11:00:49,893 WARNING [24ff42c8c9c64921a6097197bec680a3] awx.main.tasks.system Normal shutdown signal for instance awx-test2-task-584bdc44f5-qfs4d, removed self from capacit │

│ awx-test2-task 2023-07-20 11:00:50,432 INFO stopped: dispatcher (exit status 0)

Hello Community,

Any input on how we can resolve the issue? Thank you in advance.