AWX Instance group workload management

I have an Instance group having 2 ec2 instances attached to the group. When I execute the workflow template (Heavy load), I observer that the job template gets executed only one instance which is part of the instance group. Load is not being shared between the instances, rather the task templates are in waiting for the previous job template to be completed as its being executed only on 1 instance. How can we distribute the load, so that the overall execution time can be reduced.

are the workflow nodes using the same template? if so, you will want to enable “concurrent jobs” on the job template

Seth

2 Likes

No the workflow template has different individual job templates

Hi Prateek!

You should still try to have the ‘concurrent jobs’ option enabled in both templates, regardless of whether they are the same or not.

Let me enable the ‘concurrent jobs’ option and test. Will share the results here.

Below is the structure of the work flow template I’m using.

I have enabled concurrent jobs option on all the individual job template. For the entire Workflow template I’m passing the inventory with host of 15.

In the inventory i have selected the instance group, which will be applied to the entire work flow template.

This instance group has 2 instances (A and B) But even after enabling concurrent jobs option as suggested, i see all the job templates in the work flow template gets executed against instance A. Where as the instance B is in idle state.

Only observation is that the instance A specs are higher in terms of RAM, CPU and memory when compared to instance B

a couple of things to check for

  1. make sure Prevent Instance Group Fallback is disabled on JT (and inventory and organization if IGs are set there too)

  2. You should see in the task logs information as to why pending jobs are not running
    e.g. “needs capacity”

2024-06-11 17:53:25,847 INFO     [2f8865a6] awx.analytics.job_lifecycle job-7 needs capacity {"type": "job", "task_id": 7, "state": "needs_capacity", "work_unit_id": null, "task_name": "pause"}

what does it say for you?

  1. when starting the workflow and things are running, what is the current capacity % of instance A and instance B?

It took some time to validate all the above mentioned points and below are the results:

  1. make sure Prevent Instance Group Fallback is disabled on JT (and inventory and organization if IGs are set there too) : Disabled and tested. It didnt help

For the 2nd and 3rd questions below is the answers:

Yes, when we triggered a Workflow template with more than 50+ nodes we are observing this “needs capacity” in the logs:

But the surprising part in this testing is the load of the instance. Below is the screenshot taken of the instance at various time point.

a> Right at the point when the Work Flow Template was triggered:

b> Max load in terms of % (never crossed beyond this)

c> When the used capacity was at 28%, below is the jobs in pending

Observation: Even if the log states needs capacity, second instance in the instance group is not utilized at all during the entire workflow template execution duration.

your control node also has capacity that gets consumed in addition to the execution nodes - what does that report?

Can you turn on log level debug and grep the task container logs for “capacity”?

does it provide extra reasoning for why these jobs need capacity?

Seth