Websockets not working after a little while

Hi,

I’m having an issue where websockets stop working ~20-60 minutes after the application has been deployed. This impacts the task containers ability to post job stdout as well as removes the ability to view job details in the UI.

I saw that an issue was opened here: https://github.com/ansible/awx/issues/1861 and have added comments with my findings as I go.

I have enabled verbose logging on Daphne but I’m a little perplexed.

Daphne is showing that the websocket is opened:

2018-06-27 03:19:21,372 DEBUG Upgraded connection daphne.response.XbupPxYRcS!aPmLgJGDZd to WebSocket daphne.response.XbupPxYRcS!hTzJudfDoM

Then suddenly nginx reports that the client closed the connection

10.255.0.2 - - [27/Jun/2018:03:19:24 +0000] “GET /websocket/ HTTP/1.1” 499 0 “-” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682”

And then daphne reports that the websocket has closed

2018-06-27 03:19:25,571 DEBUG WebSocket closed for daphne.response.XbupPxYRcS!hTzJudfDoM

The browser itself reports:

WebSocket connection to ‘wss://…/websocket/’ failed: WebSocket is closed before the connection is established.

And the Task container reports (when running the job):

[2018-07-02 19:03:47,717: DEBUG/Worker-4] using channel_id: 2
2018-07-02 19:03:47,718 ERROR awx.main.models.unified_jobs job 15 (running) failed to emit channel msg about status change
Traceback (most recent call last):
File “/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py”, line 1169, in _websocket_emit_status
emit_channel_notification(‘jobs-status_changed’, status_data)
File “/usr/lib/python2.7/site-packages/awx/main/consumers.py”, line 70, in emit_channel_notification
Group(group).send({“text”: json.dumps(payload, cls=DjangoJSONEncoder)})
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/channels/channel.py”, line 88, in send
self.channel_layer.send_group(self.name, content)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py”, line 190, in send_group
self.send(channel, message)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py”, line 95, in send
self.recover()
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py”, line 77, in recover
self.tdata.consumer.revive(self.tdata.connection.channel())
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/connection.py”, line 255, in channel
chan = self.transport.create_channel(self.connection)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/transport/pyamqp.py”, line 92, in create_channel
return connection.channel()
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/connection.py”, line 282, in channel
return self.Channel(self, channel_id)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py”, line 101, in init
self._x_open()
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py”, line 427, in _x_open
self._send_method((20, 10), args)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/abstract_channel.py”, line 56, in _send_method
self.channel_id, method_sig, args, content,
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/method_framing.py”, line 221, in write_method
write_frame(1, channel, payload)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/transport.py”, line 182, in write_frame
frame_type, channel, size, payload, 0xce,
File “/usr/lib64/python2.7/socket.py”, line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer

Can anyone help with what the next troubleshooting steps might be or with any wisdom on additional logging that could be enabled?

We can see similar behavior on awx 1.0.6.x.

We use ha-proxy in a separate continer in front for ssl.
Reload of ha-proxy, awx web and awx task continer use to solve ower issue.

Does it fix it permanently or does it come back as an issue shortly after?

I’m having trouble figuring out what to test/try next. It works after a restart of the containers for some short period of time (normally less than an hour) and then stops working at some point but it doesnt seem to make sense when and why it stops working.

No, we use to need to reload every week.