I am running an Docker instance of AWX on a Kubernetes box, but while testing the maintenance jobs, it seems like the jobs are getting stuck in the waiting state. However, when I run the exact same Docker instance on the local development server, it seems to be running fine and completing successfully.
I put many debug statements in the code in efforts to see where the K8 deployment was getting stuck. Everything leading upto tthe system job reaching the waiting stage is exactly as normal on both deployments.
The following is the log output from the Development box:
`
14:22:32 celeryd.1 | 2018-06-28 14:22:32,325 INFO awx.main.scheduler ***** Disabling activity stream *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,325: INFO/ForkPoolWorker-3] ***** Disabling activity stream *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,326 INFO awx.main.models.unified_jobs ***** getting status before the save *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,326: INFO/ForkPoolWorker-3] ***** getting status before the save *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,326 INFO awx.main.models.unified_jobs waiting
14:22:32 celeryd.1 | [2018-06-28 14:22:32,326: INFO/ForkPoolWorker-3] waiting
14:22:32 celeryd.1 | 2018-06-28 14:22:32,327 INFO awx.main.models.unified_jobs ***** job exists before, get its information *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,327: INFO/ForkPoolWorker-3] ***** job exists before, get its information *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,332 INFO awx.main.models.unified_jobs ***** check job has already in a failed state *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,332: INFO/ForkPoolWorker-3] ***** check job has already in a failed state *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,332 INFO awx.main.models.unified_jobs ***** check if job is in running state *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,332: INFO/ForkPoolWorker-3] ***** check if job is in running state *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,333 INFO awx.main.models.unified_jobs ***** check if job is in a finished state *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,333: INFO/ForkPoolWorker-3] ***** check if job is in a finished state *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,333 INFO awx.main.models.unified_jobs ***** check if job template info is current *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,333: INFO/ForkPoolWorker-3] ***** check if job template info is current *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,343 INFO awx.main.models.unified_jobs ***** updating calling super class save method *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,343: INFO/ForkPoolWorker-3] ***** updating calling super class save method *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,350 INFO awx.main.models.unified_jobs ***** if status had changed, update the parent info status *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,350: INFO/ForkPoolWorker-3] ***** if status had changed, update the parent info status *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,351 INFO awx.main.models.unified_jobs ***** updating parent info status *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,351: INFO/ForkPoolWorker-3] ***** updating parent info status *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,368 INFO awx.main.models.unified_jobs ***** returning result *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,368: INFO/ForkPoolWorker-3] ***** returning result *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,368 INFO awx.main.models.unified_jobs None
14:22:32 celeryd.1 | [2018-06-28 14:22:32,368: INFO/ForkPoolWorker-3] None
14:22:32 celeryd.1 | 2018-06-28 14:22:32,369 INFO awx.main.scheduler ***** task details saved *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,369: INFO/ForkPoolWorker-3] ***** task details saved *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,370: DEBUG/ForkPoolWorker-3] system_job 4 (waiting) consumed 150 capacity units from tower with prior total of 0
14:22:32 celeryd.1 | 2018-06-28 14:22:32,410 INFO awx.main.models.unified_jobs ***** Starting celery task *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,410: INFO/ForkPoolWorker-3] ***** Starting celery task *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,410 INFO awx.main.models.jobs ***** Task class is of RunSystemJob *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,410: INFO/ForkPoolWorker-3] ***** Task class is of RunSystemJob *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,411 INFO awx.main.models.unified_jobs ***** Task class is *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,411: INFO/ForkPoolWorker-3] ***** Task class is *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,411 INFO awx.main.models.unified_jobs <class ‘awx.main.tasks.RunSystemJob’>
14:22:32 celeryd.1 | [2018-06-28 14:22:32,411: INFO/ForkPoolWorker-3] <class ‘awx.main.tasks.RunSystemJob’>
14:22:32 celeryd.1 | 2018-06-28 14:22:32,414 INFO awx.main.models.unified_jobs ***** Setting task queue *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,414: INFO/ForkPoolWorker-3] ***** Setting task queue *****
14:22:32 celeryd.1 | 2018-06-28 14:22:32,414 INFO awx.main.models.unified_jobs ***** Applying task asychronously *****
14:22:32 celeryd.1 | [2018-06-28 14:22:32,414: INFO/ForkPoolWorker-3] ***** Applying task asychronously *****
`
The following is the log output from the K8 box:
`
16:40:33 celeryd.1 | [2018-06-28 16:40:33,668: INFO/ForkPoolWorker-2] ***** Disabling activity stream *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,668 INFO awx.main.models.unified_jobs ***** getting status before the save *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,668: INFO/ForkPoolWorker-2] ***** getting status before the save *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,668 INFO awx.main.models.unified_jobs waiting
16:40:33 celeryd.1 | [2018-06-28 16:40:33,668: INFO/ForkPoolWorker-2] waiting
16:40:33 celeryd.1 | 2018-06-28 16:40:33,668 INFO awx.main.models.unified_jobs ***** job exists before, get its information *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,668: INFO/ForkPoolWorker-2] ***** job exists before, get its information *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,677 INFO awx.main.models.unified_jobs ***** check job has already in a failed state *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,677: INFO/ForkPoolWorker-2] ***** check job has already in a failed state *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,677 INFO awx.main.models.unified_jobs ***** check if job is in running state *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,677: INFO/ForkPoolWorker-2] ***** check if job is in running state *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,677 INFO awx.main.models.unified_jobs ***** check if job is in a finished state *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,677: INFO/ForkPoolWorker-2] ***** check if job is in a finished state *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,678 INFO awx.main.models.unified_jobs ***** check if job template info is current *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,678: INFO/ForkPoolWorker-2] ***** check if job template info is current *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,702 INFO awx.main.models.unified_jobs ***** updating calling super class save method *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,702: INFO/ForkPoolWorker-2] ***** updating calling super class save method *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,717 INFO awx.main.models.unified_jobs ***** if status had changed, update the parent info status *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,717: INFO/ForkPoolWorker-2] ***** if status had changed, update the parent info status *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,718 INFO awx.main.models.unified_jobs ***** updating parent info status *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,718: INFO/ForkPoolWorker-2] ***** updating parent info status *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,756 INFO awx.main.models.unified_jobs ***** returning result *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,756: INFO/ForkPoolWorker-2] ***** returning result *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,756 INFO awx.main.models.unified_jobs None
16:40:33 celeryd.1 | [2018-06-28 16:40:33,756: INFO/ForkPoolWorker-2] None
16:40:33 celeryd.1 | 2018-06-28 16:40:33,756 INFO awx.main.scheduler ***** task details saved *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,756 INFO awx.main.scheduler ***** task details saved *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,756: INFO/ForkPoolWorker-2] ***** task details saved *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,756 DEBUG awx.main.scheduler system_job 4 (waiting) consumed 150 capacity units from tower with prior total of 0
16:40:33 celeryd.1 | 2018-06-28 16:40:33,756 DEBUG awx.main.scheduler system_job 4 (waiting) consumed 150 capacity units from tower with prior total of 0
16:40:33 celeryd.1 | [2018-06-28 16:40:33,756: DEBUG/ForkPoolWorker-2] system_job 4 (waiting) consumed 150 capacity units from tower with prior total of 0
16:40:33 celeryd.1 | 2018-06-28 16:40:33,798 INFO awx.main.models.unified_jobs ***** Starting celery task *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,798: INFO/ForkPoolWorker-2] ***** Starting celery task *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,805 INFO awx.main.models.unified_jobs ***** Setting task queue *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,805: INFO/ForkPoolWorker-2] ***** Setting task queue *****
16:40:33 celeryd.1 | 2018-06-28 16:40:33,805 INFO awx.main.models.unified_jobs ***** Applying task asychronously *****
16:40:33 celeryd.1 | [2018-06-28 16:40:33,805: INFO/ForkPoolWorker-2] ***** Applying task asychronously *****
`
As you can see, both of the logs output the same exact statements until the point where the task details are being saved. At that point, for some reason, the K8 box calls it twice though the the celery worker still only prints the a single debug line. Similarly, the consumed capacity is also printed twice. After those statements, the celery task start is run, however none of the statements between “start celery task” and “setting task queue” are being called or printed despite being at the same depth as those debug statements. None of them are in any if-else clauses nor in any try-except blocks. The K8 box just seems to skip the identification of the sub class with with which the task will be run. The following is the code from the start celery task function from awx/main/models/unified_jobs.py
`
def start_celery_task(self, opts, error_callback, success_callback, queue):
kwargs = {
‘link_error’: error_callback,
‘link’: success_callback,
‘queue’: None,
‘task_id’: None,
}
logger.info(“***** Starting celery task “)
if not self.celery_task_id:
raise RuntimeError(“Expected celery_task_id to be set on model.”)
kwargs[‘task_id’] = self.celery_task_id
task_class = self._get_task_class()
logger.info(” Task class is “)
logger.info(task_class)
from awx.main.models.ha import InstanceGroup
ig = InstanceGroup.objects.get(name=queue)
args = [self.pk]
if ig.controller_id:
if self.supports_isolation(): # case of jobs and ad hoc commands
logger.info(” instance group id exists and task spports isolation “)
isolated_instance = ig.instances.order_by(‘-capacity’).first()
args.append(isolated_instance.hostname)
else: # proj & inv updates, system jobs run on controller
logger.info(” Queue set to instance group name “)
queue = ig.controller.name
logger.info(” Setting task queue “)
kwargs[‘queue’] = queue
logger.info(task_class)
logger.info(” Applying task asychronously “)
logger.info(” Trying to find task class to apply task asynch with following args *****”)
logger.info(args)
logger.info(opts)
logger.info(kwargs)
task_class().apply_async(args, opts, **kwargs)
“”" following method is _get_task_class() from awx/main/models/jobs.py “”"
@classmethod
def _get_task_class(cls):
from awx.main.tasks import RunSystemJob
logger.info(“***** Task class is of RunSystemJob *****”)
return RunSystemJob
`
My question is whether it is this unusual repetition of calls by celery which is causing the issue where the task is just deciding not to run and getting stuck in the waiting state, or is this a compatibility issue between Celery and Kubernetes? Has anyone else encountered this issue before?