Hello,
Few of our jobs in AWX are randomly failing with message “Task was marked as running but was not present in the job queue, so it has been marked as failed.”
Some info on our setup:
AWX 21.2.0, Operator 0.24.0 on AKS (Azure Kubernetes). Externally managed Postgres DB.
This is when I started exploring some of the open reported issues on github and happened to notice one.
Since the issue is not resolved yet and there are comments hinting towards probable resources issues, I’m trying to understand how the instance groups and container groups work.
I’m not seeing any resource crunch on our Kubernetes cluster during the job run (last time when I monitored a job which ended up failing after running for ~40 minutes)
Few questions, appreciate if someone can help understand these or point to a right documentation which explains these:
1 - What is task_resource_requirement setting? Where this value is used? When a job is triggered and a respective pod is created, it’s taking CPU/mem requests values from somewhere else and not from this configuration.
2 - Same for web_resource and ee_resource requirements? Where are they used?
Sometimes the UI doesn’t show the job output - could it be related to lesser web_resources? For ee_resource - > again a triggered job is not taking the container resource reservation from this setting. If that is true, where is this expected to be used?
3 - What are instance groups and container groups? What setting is controlling how many parallel jobs can be run?
Regards,
Deepanshu
task_resource_requirement - requests and limits for the task container running the task pod
web_resource_requirement - affects web container running in the web pod
ee_resource_requirement – affects ee container in the task pod
Jobs that get ran as part of a container group use whatever resources are setup on the container pod spec. You can view these by going to Instance Group > click Default container group > Edit > check the customize pod spec and you will see the pod spec used (and you can change it to whatever you wish)
Sometimes the UI doesn’t show the job output - could it be related to lesser web_resources
Most likely a different issue is causing job output to not show. These resources determine if the pod can be created or not. As long as the pod is running and not restarting, you should be good.
What setting is controlling how many parallel jobs can be run?
for K8S there is no limit by default. You can set a limit by going to instance group > container group > max concurrent jobs
AWX Team
Thank you.
I have a few follow up questions, appreciated in advance for answers:
- In AWX version 21.2.0 (now updated to 21.3 last week), task, ee & web containers are all within one pod itself - which is no issue but the earlier question remains as is - what is the role of task and ee resource requirement? Where are these resources used?
What gets impacted if I reduce or increase resources given to these two containers?
- Above is still a mystery because as mentioned in your response as well, the job run takes resources values that are either default or customized under container group. If that is true, then where the ‘task’ and ‘ee’ resources are being used?
- The consistent error I’m getting in one of my jobs - ‘Task was marked as running but was not present in the job queue, so it has been marked as failed’ - is happening when I’m connecting to 27 VMs to run a few commands on 100 Network devices from each VM.
The output is getting registered in a ‘dictionary’ (per inventory host).
- This issue is consistently reproducible with the number of devices mentioned. Now, I’m suspecting, somewhere the job is going out of memory or CPU, as the number of dictionaries grows. Same job runs fine every time with lesser number of devices (haven’t rigorously tested limits).
- So, that is where I’m trying to understand what resources and limits might be contributing to my issue.
- I haven’t created an issue on github since I saw one already existed with no much updates/progress.
- What’s the role of the ‘control plane’ instance group? It shows certain numbers in max fork etc. Are they relevant in K8s?
Regards,
Deepanshu
(attachments)
Hi,
I’ll try and give you a brief answer to all three questions in line:
- In AWX version 21.2.0 (now updated to 21.3 last week), task, ee & web containers are all within one pod itself - which is no issue but the earlier question remains as is - what is the role of task and ee resource requirement? Where are these resources used?
What gets impacted if I reduce or increase resources given to these two containers?
- Above is still a mystery because as mentioned in your response as well, the job run takes resources values that are either default or customized under container group. If that is true, then where the ‘task’ and ‘ee’ resources are being used?
Starting in Operator version 22 the task and web containers are broken out into their own deployments(pods). So it should be possible to scale these individually. That said, the task container contains the redis server which is responsible for holding events in a queue from receptor/runner that get pushed to the database. If task container crashes, there could be messages like the one you’ve seen in question 2. If you reduce the resources given to these containers you’ll likely see more performance issues and less hosts/events before the containers crash. Task container is responsible for all system level jobs and project updates, ee is where receptor runs(job events).
- The consistent error I’m getting in one of my jobs - ‘Task was marked as running but was not present in the job queue, so it has been marked as failed’ - is happening when I’m connecting to 27 VMs to run a few commands on 100 Network devices from each VM.
The output is getting registered in a ‘dictionary’ (per inventory host).
- This issue is consistently reproducible with the number of devices mentioned. Now, I’m suspecting, somewhere the job is going out of memory or CPU, as the number of dictionaries grows. Same job runs fine every time with lesser number of devices (haven’t rigorously tested limits).
- So, that is where I’m trying to understand what resources and limits might be contributing to my issue.
- I haven’t created an issue on github since I saw one already existed with no much updates/progress.
should be the same as the answer for 1. resource limits can cause tasks to start failing as containers crash.
- What’s the role of the ‘control plane’ instance group? It shows certain numbers in max fork etc. Are they relevant in K8s?
control plane instance group is there to have a stable interface where system jobs can be run without worrying about user provided execution environments.
References:
https://www.ansible.com/blog/peeling-back-the-layers-and-understanding-automation-mesh
https://www.ansible.com/blog/scaling-automation-controller-for-api-driven-workloads
Thanks,
AWX Team