Issues with celery work pidbox and connection reset by peer for jobs over 5 minute run time

Hello,

I’m having issues with the tasks showing as “failed” in the UI but they are still running in the background. According to UI they are failed, but the tasks are still running in the background and when all the processes catch up later it some how will update the status to sucucessful. However this process can take an hour or two to show the correct status.

These jobs tend to be jobs that are OVER 5 MIN… I haven’t seen this on jobs less than 5 minute run times.

Log from “docker logs -f awx_task”

[2018-02-09 03:23:16,883: DEBUG/ForkPoolWorker-5512] using channel_id: 1 [2018-02-09 03:23:16,884: DEBUG/ForkPoolWorker-5512] Channel open [2018-02-09 03:23:16,887: DEBUG/MainProcess] pidbox received method active_queues() [reply_to:{u'routing_key': u'86ab3717-36cc-3dc5-964e-37b60d736f6c', u'exchange': u'reply.celery.pidbox'} ticket:52a253c6-eb16-48c1-8811-9d59999396ab] [2018-02-09 03:23:16,889: ERROR/MainProcess] Control command error: error(104, 'Connection reset by peer') Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/worker/pidbox.py", line 42, in on_message self.node.handle_message(body, message) File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/pidbox.py", line 129, in handle_message return self.dispatch(**body) File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/pidbox.py", line 112, in dispatch ticket=ticket) File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/pidbox.py", line 135, in reply serializer=self.mailbox.serializer) File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/pidbox.py", line 265, in _publish_reply **opts File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish exchange_name, declare, File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/messaging.py", line 203, in _publish mandatory=mandatory, immediate=immediate, File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 1734, in _basic_publish (0, exchange, routing_key, mandatory, immediate), msg File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/abstract_channel.py", line 50, in send_method conn.frame_writer(1, self.channel_id, sig, args, content) File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/method_framing.py", line 166, in write_frame write(view[:offset]) File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/transport.py", line 258, in write self._write(s) File "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) error: [Errno 104] Connection reset by peer [2018-02-09 03:23:16,892: DEBUG/MainProcess] Closed channel #3 [2018-02-09 03:23:16,892: DEBUG/MainProcess] using channel_id: 3 [2018-02-09 03:23:16,892: DEBUG/MainProcess] Channel open [2018-02-09 03:23:17,906: DEBUG/MainProcess] pidbox received method add_consumer(queue=u'Tower Servers', exchange=None, routing_key=None, exchange_type=u'direct') [reply_to:{u'routing_key': u'86ab3717-36cc-3dc5-964e-37b60d736f6c', u'exchange': u'reply.celery.pidbox'} ticket:52302123-3dca-4a2a-ad1d-5351e84fa669]

./rabbitmq-server status

Status of node rabbit@ff0d8db98a56 …
[{pid,368},
{running_applications,
[{rabbitmq_management,“RabbitMQ Management Console”,“3.7.3”},
{rabbitmq_web_dispatch,“RabbitMQ Web Dispatcher”,“3.7.3”},
{rabbitmq_management_agent,“RabbitMQ Management Agent”,“3.7.3”},
{rabbit,“RabbitMQ”,“3.7.3”},
{amqp_client,“RabbitMQ AMQP Client”,“3.7.3”},
{rabbit_common,
“Modules shared by rabbitmq-server and rabbitmq-erlang-client”,
“3.7.3”},
{recon,“Diagnostic tools for production use”,“2.3.2”},
{ranch_proxy_protocol,“Ranch Proxy Protocol Transport”,“1.4.4”},
{cowboy,“Small, fast, modern HTTP server.”,“2.0.0”},
{ranch,“Socket acceptor pool for TCP protocols.”,“1.4.0”},
{ssl,“Erlang/OTP SSL application”,“8.2.3”},
{public_key,“Public key infrastructure”,“1.5.2”},
{asn1,“The Erlang ASN1 compiler version 5.0.4”,“5.0.4”},
{inets,“INETS CXC 138 49”,“6.4.5”},
{cowlib,“Support library for manipulating Web protocols.”,“2.0.0”},
{os_mon,“CPO CXC 138 46”,“2.4.4”},
{mnesia,“MNESIA CXC 138 12”,“4.15.3”},
{jsx,“a streaming, evented json parsing toolkit”,“2.8.2”},
{crypto,“CRYPTO”,“4.2”},
{xmerl,“XML parser”,“1.3.16”},
{lager,“Erlang logging framework”,“3.5.1”},
{goldrush,“Erlang event stream processor”,“0.1.9”},
{compiler,“ERTS CXC 138 10”,“7.1.4”},
{syntax_tools,“Syntax tools”,“2.1.4”},
{sasl,“SASL CXC 138 11”,“3.1.1”},
{stdlib,“ERTS CXC 138 10”,“3.4.3”},
{kernel,“ERTS CXC 138 10”,“5.4.1”}]},
{os,{unix,linux}},
{erlang_version,
“Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:128] [hipe] [kernel-poll:true]\n”},
{memory,
[{connection_readers,328448},
{connection_writers,35480},
{connection_channels,238144},
{connection_other,772368},
{queue_procs,384888},
{queue_slave_procs,0},
{plugins,3553568},
{other_proc,21404792},
{metrics,294408},
{mgmt_db,1782432},
{mnesia,128880},
{other_ets,2232392},
{binary,6181984},
{msg_index,33952},
{code,28290112},
{atom,1123529},
{other_system,13436575},
{allocated_unused,41015552},
{reserved_unallocated,0},
{strategy,rss},
{total,[{erlang,80221952},{rss,113106944},{allocated,121237504}]}]},
{alarms,},
{listeners,[{clustering,25672,“::”},{amqp,5672,“::”},{http,15672,“::”}]},
{vm_memory_calculation_strategy,rss},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,4971990220},
{disk_free_limit,50000000},
{disk_free,87438770176},
{file_descriptors,
[{total_limit,65436},
{total_used,22},
{sockets_limit,58890},
{sockets_used,13}]},
{processes,[{limit,1048576},{used,530}]},
{run_queue,0},
{uptime,1500},
{kernel,{net_ticktime,60}}]

Found this in the logs for rabbitmq…

“operation basic.ack caused a channel exception precondition_failed: unknown delivery tag 1”