Hi,
I’m seeing issues with worker nodes not registering and not executing any jobs. This is AWX running on Openshift, in some variations (version 1.7 which is preprod/testing, and 2.1 which is a migrated instance from the former I’m testing, I also tried variations of running these with a single pod, and scaled to 3 pods), all showing the same issue.
The issue is, when worker nodes do not seem to register in the (default tower) instance group. Sometimes that can get fixed by redeploying a single pod (after scaling to 0), and only in a second step scaling back to three (race condition in rabbitmq?), sometimes it just never happens (2.1 setup, starting with one pod, scaled to three later, …)
When hitting the plus button to add instances to a group, I can see the list of pod names that were spawned (including older scaled down pods, which seemingly are not cleaned up). When added, they come as a member of the group, but are mentioned as ‘UNAVAILABLE’. witching them off and on again in that interface, makes them mentioned available…
But even then, jobs don’t start, and stay in there
STATUS New or Pending
STARTED Not Started
FINISHED Not Finished
Looking at the (docker) logs, the only error that looks important is this stacktrace, repeating continuously on the worker/task container:
2018-11-16 13:32:11,077 INFO spawned: ‘dispatcher’ with pid 1428
2018-11-16 13:32:12,079 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Traceback (most recent call last):
File “/usr/bin/awx-manage”, line 9, in
load_entry_point(‘awx==2.1.0’, ‘console_scripts’, ‘awx-manage’)()
File “/usr/lib/python2.7/site-packages/awx/init.py”, line 108, in manage
execute_from_command_line(sys.argv)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/init.py”, line 364, in execute_from_command_line
utility.execute()
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/init.py”, line 356, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/base.py”, line 283, in run_from_argv
self.execute(*args, **cmd_options)
File “/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/management/base.py”, line 330, in execute
output = self.handle(*args, **options)
File “/usr/lib/python2.7/site-packages/awx/main/management/commands/run_dispatcher.py”, line 122, in handle
AutoscalePool(min_workers=4)
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/worker/base.py”, line 45, in init
self.pool.init_workers(self.worker.work_loop)
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/pool.py”, line 209, in init_workers
self.up()
File “/usr/lib/python2.7/site-packages/awx/main/dispatch/pool.py”, line 378, in up
idx = random.choice(range(len(self.workers)))
File “/usr/lib64/python2.7/random.py”, line 274, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range
2018-11-16 13:32:13,175 INFO exited: dispatcher (exit status 1; not expected)
A brief look at the code, I think this means the worker pool is full, it wants assign a job to a random worker, but there are none workers…?
I checked memory, there’s 8GB memory per pod and 4cpu with my last test, whih I believe should be enough.
I could use some pointers to how further troubleshoot this.
Thanks,
Serge