issue with worker nodes not registering and not executing any jobs

Hi,

I’m seeing issues with worker nodes not registering and not executing any jobs. This is AWX running on Openshift, in some variations (version 1.7 which is preprod/testing, and 2.1 which is a migrated instance from the former I’m testing, I also tried variations of running these with a single pod, and scaled to 3 pods), all showing the same issue.

The issue is, when worker nodes do not seem to register in the (default tower) instance group. Sometimes that can get fixed by redeploying a single pod (after scaling to 0), and only in a second step scaling back to three (race condition in rabbitmq?), sometimes it just never happens (2.1 setup, starting with one pod, scaled to three later, …)

When hitting the plus button to add instances to a group, I can see the list of pod names that were spawned (including older scaled down pods, which seemingly are not cleaned up). When added, they come as a member of the group, but are mentioned as ‘UNAVAILABLE’. witching them off and on again in that interface, makes them mentioned available…

But even then, jobs don’t start, and stay in there

STATUS New or Pending
STARTED Not Started
FINISHED Not Finished

Looking at the (docker) logs, the only error that looks important is this stacktrace, repeating continuously on the worker/task container:

2018-11-16 13:32:11,077 INFO spawned: ‘dispatcher’ with pid 1428
2018-11-16 13:32:12,079 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Traceback (most recent call last):
File “/usr/bin/awx-manage”, line 9, in
load_entry_point(‘awx==2.1.0’, ‘console_scripts’, ‘awx-manage’)()
File “/usr/lib/python2.7/site-packages/awx/init.py”, line 108, in manage
execute_from_command_line(sys.argv)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/init.py”, line 364, in execute_from_command_line
utility.execute()
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/init.py”, line 356, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/base.py”, line 283, in run_from_argv
self.execute(*args, **cmd_options)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/base.py”, line 330, in execute
output = self.handle(*args, **options)
File “/usr/lib/python2.7/site-packages/awx/main/management/commands/run_dispatcher.py”, line 122, in handle
AutoscalePool(min_workers=4)
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/worker/base.py”, line 45, in init
self.pool.init_workers(self.worker.work_loop)
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/pool.py”, line 209, in init_workers
self.up()
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/pool.py”, line 378, in up
idx = random.choice(range(len(self.workers)))
File “/usr/lib64/python2.7/random.py”, line 274, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range
2018-11-16 13:32:13,175 INFO exited: dispatcher (exit status 1; not expected)

A brief look at the code, I think this means the worker pool is full, it wants assign a job to a random worker, but there are none workers…?

I checked memory, there’s 8GB memory per pod and 4cpu with my last test, whih I believe should be enough.

I could use some pointers to how further troubleshoot this.

Thanks,

Serge

Serge,

It looks to me like you’ve discovered a bug in a recent change we made to AWX. Would you mind reposting these details as a bug report at https://github.com/ansible/awx/issues/new?labels=&template=bug_report.md

Thanks!

OK, done: https://github.com/ansible/awx/issues/2705

FYI, I just also noticed this on the awx_celery worker container:

sh-4.2$ ps auxwf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
awx 4847 0.0 0.0 11816 1708 ? Ss 14:10 0:00 /bin/sh
awx 4854 0.0 0.0 51704 1684 ? R+ 14:10 0:00 _ ps auxwf
awx 1 0.0 0.0 4352 348 ? Ss 13:49 0:00 /tini – /bin/sh -c /usr/bin/launch_awx_task.sh
awx 7 0.0 0.0 11684 1500 ? S 13:49 0:00 bash /usr/bin/launch_awx_task.sh
awx 186 0.1 0.0 114940 13920 ? S 13:50 0:01 _ /usr/bin/python2 /usr/bin/supervisord -c /supervisor_task.conf
awx 189 0.0 0.0 103524 10896 ? S 13:50 0:00 _ python /usr/bin/config-watcher
awx 190 0.1 0.3 333232 104492 ? S 13:50 0:02 _ python /usr/bin/awx-manage runworker --only-channels websocket.*
awx 191 0.1 0.3 328492 104224 ? S 13:50 0:02 _ python /usr/bin/awx-manage run_callback_receiver
awx 223 0.0 0.2 326024 93688 ? S 13:50 0:00 _ python /usr/bin/awx-manage run_callback_receiver
awx 225 0.0 0.2 326308 93740 ? S 13:50 0:00 _ python /usr/bin/awx-manage run_callback_receiver
awx 226 0.0 0.2 326332 93756 ? S 13:50 0:00 _ python /usr/bin/awx-manage run_callback_receiver
awx 227 0.0 0.2 326356 93880 ? S 13:50 0:00 _ python /usr/bin/awx-manage run_callback_receiver

I believe that confirms there are no worker nodes spawned?

Serge