I would like to understand and address how Workflow Jobs are managed in AWX better.
I made a comment to a Github issue where I thought it was appropriate, but I realize the community would have a hard time finding it.
While the Github issue has the information and questions I have with an example architecture, I’m curious how others have managed to do maintenance in AWX on K8s while preventing unexpected job termination within workflows.
Even if jobs from a workflow are on AWX K8s nodes that are not undergoing maintenance, bringing down a controller node in AWX can terminate those workflow jobs across said other nodes.
Much more information in the relevant Github issue below:
https://github.com/ansible/awx/issues/13848
Since I added the documentation tag, I figured I would include relevant docs I have been perusing which may apply here:
Background Tasks in AWX
=======================
In this document, we will go into a bit of detail about how and when AWX runs Python code _in the background_ (_i.e._, **outside** of the context of an HTTP request), such as:
* Any time a Job is launched in AWX (a Job Template, an Ad Hoc Command, a Project
Update, an Inventory Update, a System Job), a background process retrieves
metadata _about_ that job from the database and forks some process (_e.g._,
`ansible-playbook`, `awx-manage inventory_import`)
* Certain expensive or time-consuming tasks running in the background
asynchronously (_e.g._, when deleting an inventory).
* AWX runs a variety of periodic background tasks on a schedule. Some examples
are:
- AWX's "Task Manager/Scheduler" wakes up periodically and looks for
`pending` jobs that have been launched and are ready to start running
- AWX periodically runs code that looks for scheduled jobs and launches
them
- AWX runs a variety of periodic tasks that clean up temporary files, and
performs various administrative checks
- Every node in an AWX cluster runs a periodic task that serves as
This file has been truncated. show original
# Task Manager System Overview
The task management system is made up of three separate components:
1. Dependency Manager
2. Task Manager
3. Workflow Manager
Each of these run in a separate dispatched task and can run at the same time as one another.
This system is responsible for deciding when tasks should be scheduled to run. When choosing a task to run, the considerations are:
1. Creation time
2. Job dependencies
3. Capacity
Independent tasks are run in order of creation time, earliest first. Tasks with dependencies are also run in creation time order within the group of task dependencies. Capacity is the final consideration when deciding to release a task to be run by the dispatcher.
## Dependency Manager
Responsible for looking at each pending task and determining whether it should create a dependency for that task.
This file has been truncated. show original
Last year we made an attempt add adding more resiliency on AWX around k8s node maintenance
I made a YouTube video about it https://www.youtube.com/watch?v=EqYl2hDs90c
(like and subscribe and ring the bell )
This feature in conjunction with Disruptions | Kubernetes should allow you to do a rolling upgrade of K8S cluster underneath without disruption to jobs
I had a draft blog about how to do this (that I didn’t finish >.<)…
2 Likes
(like and subscribe and ring the bell )
(already subscribed!)
I appreciate the feedback!
I have already seen the YouTube video (great video, btw!) and looked at pod disruption budgets.
They all work great but do not prevent disruption, just manages expectations and tolerates some disruptions.
The part I am having concerns with is while job templates are easy enough to handle without killing jobs (since all jobs are exposed as pods), the workflow jobs are not exposed in the same way.
So, if you bring down an AWX node that is part of a workflow, it seems that it can kill the workflow job (which may contain other workflows or regular jobs).
That’s the part where I haven’t been able to work around.
Gotcha. Thanks for pointing out the gap. Let me discuss with the team and see if we can figure something out. Can you put up an RFE and provide detail of how you configure your environment and the “minimal” reproducer
I did a fairly exhaustive breakdown in the Github issue here (including breaking down the architecture with examples):
opened 02:54PM - 12 Apr 23 UTC
type:enhancement
component:api
### Please confirm the following
- [X] I agree to follow this project's [code o… f conduct](https://docs.ansible.com/ansible/latest/community/code_of_conduct.html).
- [X] I have checked the [current issues](https://github.com/ansible/awx/issues) for duplicates.
- [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
### Feature type
New Feature
### Feature Summary
Scenario:
- A job that sleeps for several minutes is launched on an execution node, so `controller_node` != `execution_node`
- The job successfully starts and starts producing events
- Mid-run, services are restarted on the job's controller node
Right now, this job will be reaped. We _hate_ it when jobs get reaped.
Details of proposed solution, subject to change:
- replace the `reap` method called from `cluster_node_heatbeat` with a different reconciliation method called from `awx_receptor_workunit_reaper`. This method has access to `receptorctl status`.
- in addition, this method will be given access to the process list, `worker_tasks`
- in addition, this method may pull the list of running jobs from the database, as needed
- In the event that the database status is "running" but the process is missing (timing issues assumed to be worked out), it will send a message back to the dispatcher main process if the _receptor_ status is still active.
- When the dispatcher gets a message that an active job is orphaned, it will launch `RunJob` or its equivalent.
- Instead of starting the job, it will pick up `process`ing from the last line processed, which can be ascertained from the saved events or some other means of tracking.
### Select the relevant components
- [ ] UI
- [X] API
- [ ] Docs
- [ ] Collection
- [ ] CLI
- [ ] Other
### Steps to reproduce
See feature summary
### Current results
Reaper message, job canceled
### Sugested feature result
Jobs should never be reaped - just have processing resumed.
Receptor becomes source of ultimate truth.
### Additional information
_No response_
I’m happy to do an RFE if needed, let me know where to submit and I can work something out.
Thanks again!
the one created by @AlanCoding works
1 Like