issue with worker nodes not registering and not executing any jobs

Serge_van_Ginderacht · November 16, 2018, 1:53pm

Hi,

I’m seeing issues with worker nodes not registering and not executing any jobs. This is AWX running on Openshift, in some variations (version 1.7 which is preprod/testing, and 2.1 which is a migrated instance from the former I’m testing, I also tried variations of running these with a single pod, and scaled to 3 pods), all showing the same issue.

The issue is, when worker nodes do not seem to register in the (default tower) instance group. Sometimes that can get fixed by redeploying a single pod (after scaling to 0), and only in a second step scaling back to three (race condition in rabbitmq?), sometimes it just never happens (2.1 setup, starting with one pod, scaled to three later, …)

When hitting the plus button to add instances to a group, I can see the list of pod names that were spawned (including older scaled down pods, which seemingly are not cleaned up). When added, they come as a member of the group, but are mentioned as ‘UNAVAILABLE’. witching them off and on again in that interface, makes them mentioned available…

But even then, jobs don’t start, and stay in there

STATUS New or Pending
STARTED Not Started
FINISHED Not Finished

Looking at the (docker) logs, the only error that looks important is this stacktrace, repeating continuously on the worker/task container:

2018-11-16 13:32:11,077 INFO spawned: ‘dispatcher’ with pid 1428
2018-11-16 13:32:12,079 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Traceback (most recent call last):
File “/usr/bin/awx-manage”, line 9, in
load_entry_point(‘awx==2.1.0’, ‘console_scripts’, ‘awx-manage’)()
File “/usr/lib/python2.7/site-packages/awx/init.py”, line 108, in manage
execute_from_command_line(sys.argv)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/init.py”, line 364, in execute_from_command_line
utility.execute()
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/init.py”, line 356, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/base.py”, line 283, in run_from_argv
self.execute(*args, **cmd_options)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/base.py”, line 330, in execute
output = self.handle(*args, **options)
File “/usr/lib/python2.7/site-packages/awx/main/management/commands/run_dispatcher.py”, line 122, in handle
AutoscalePool(min_workers=4)
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/worker/base.py”, line 45, in init
self.pool.init_workers(self.worker.work_loop)
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/pool.py”, line 209, in init_workers
self.up()
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/pool.py”, line 378, in up
idx = random.choice(range(len(self.workers)))
File “/usr/lib64/python2.7/random.py”, line 274, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range
2018-11-16 13:32:13,175 INFO exited: dispatcher (exit status 1; not expected)

A brief look at the code, I think this means the worker pool is full, it wants assign a job to a random worker, but there are none workers…?

I checked memory, there’s 8GB memory per pod and 4cpu with my last test, whih I believe should be enough.

I could use some pointers to how further troubleshoot this.

Thanks,

Serge

rpetrell · November 16, 2018, 2:00pm

Serge,

It looks to me like you’ve discovered a bug in a recent change we made to AWX. Would you mind reposting these details as a bug report at https://github.com/ansible/awx/issues/new?labels=&template=bug_report.md

Thanks!

Serge_van_Ginderacht · November 16, 2018, 2:19pm

OK, done: https://github.com/ansible/awx/issues/2705

FYI, I just also noticed this on the awx_celery worker container:

sh-4.2$ ps auxwf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
awx 4847 0.0 0.0 11816 1708 ? Ss 14:10 0:00 /bin/sh
awx 4854 0.0 0.0 51704 1684 ? R+ 14:10 0:00 _ ps auxwf
awx 1 0.0 0.0 4352 348 ? Ss 13:49 0:00 /tini – /bin/sh -c /usr/bin/launch_awx_task.sh
awx 7 0.0 0.0 11684 1500 ? S 13:49 0:00 bash /usr/bin/launch_awx_task.sh
awx 186 0.1 0.0 114940 13920 ? S 13:50 0:01 _ /usr/bin/python2 /usr/bin/supervisord -c /supervisor_task.conf
awx 189 0.0 0.0 103524 10896 ? S 13:50 0:00 _ python /usr/bin/config-watcher
awx 190 0.1 0.3 333232 104492 ? S 13:50 0:02 _ python /usr/bin/awx-manage runworker --only-channels websocket.*
awx 191 0.1 0.3 328492 104224 ? S 13:50 0:02 _ python /usr/bin/awx-manage run_callback_receiver
awx 223 0.0 0.2 326024 93688 ? S 13:50 0:00 _ python /usr/bin/awx-manage run_callback_receiver
awx 225 0.0 0.2 326308 93740 ? S 13:50 0:00 _ python /usr/bin/awx-manage run_callback_receiver
awx 226 0.0 0.2 326332 93756 ? S 13:50 0:00 _ python /usr/bin/awx-manage run_callback_receiver
awx 227 0.0 0.2 326356 93880 ? S 13:50 0:00 _ python /usr/bin/awx-manage run_callback_receiver

I believe that confirms there are no worker nodes spawned?

Serge

Topic		Replies	Views
Jobs not executing AWX Project awx	7	5	February 14, 2018
AWX job stuck at Waiting when executed on an Isolated Node AWX Project awx	1	12	February 24, 2021
How to troubleshoot all AWX jobs stuck in pending state? AWX Project awx , kubernetes	1	161	November 3, 2020
In 4 worker node cluster all automation jobs get scheduled on single worker while job slicing AWX Project awx , kubernetes	6	28	February 22, 2023
AWX kubernetes deployment having several group instances don't run jobs on an specific instance... AWX Project awx , kubernetes	3	22	December 20, 2019

issue with worker nodes not registering and not executing any jobs

Related topics