AWX Job Workflow Management

sirjaren · November 14, 2023, 2:29pm

I would like to understand and address how Workflow Jobs are managed in AWX better.

I made a comment to a Github issue where I thought it was appropriate, but I realize the community would have a hard time finding it.

While the Github issue has the information and questions I have with an example architecture, I’m curious how others have managed to do maintenance in AWX on K8s while preventing unexpected job termination within workflows.

Even if jobs from a workflow are on AWX K8s nodes that are not undergoing maintenance, bringing down a controller node in AWX can terminate those workflow jobs across said other nodes.

Much more information in the relevant Github issue below:
https://github.com/ansible/awx/issues/13848

Since I added the documentation tag, I figured I would include relevant docs I have been perusing which may apply here:

github.com

ansible/awx/blob/devel/docs/tasks.md

Background Tasks in AWX
=======================

In this document, we will go into a bit of detail about how and when AWX runs Python code _in the background_ (_i.e._, **outside** of the context of an HTTP request), such as:

* Any time a Job is launched in AWX (a Job Template, an Ad Hoc Command, a Project
  Update, an Inventory Update, a System Job), a background process retrieves
  metadata _about_ that job from the database and forks some process (_e.g._,
  `ansible-playbook`, `awx-manage inventory_import`)
* Certain expensive or time-consuming tasks running in the background
  asynchronously (_e.g._, when deleting an inventory).
* AWX runs a variety of periodic background tasks on a schedule.  Some examples
  are:
    - AWX's "Task Manager/Scheduler" wakes up periodically and looks for
      `pending` jobs that have been launched and are ready to start running
    - AWX periodically runs code that looks for scheduled jobs and launches
      them
    - AWX runs a variety of periodic tasks that clean up temporary files, and
      performs various administrative checks
    - Every node in an AWX cluster runs a periodic task that serves as

This file has been truncated. show original

github.com

ansible/awx/blob/devel/docs/task_manager_system.md

# Task Manager System Overview

The task management system is made up of three separate components:
1. Dependency Manager
2. Task Manager
3. Workflow Manager

Each of these run in a separate dispatched task and can run at the same time as one another.

This system is responsible for deciding when tasks should be scheduled to run. When choosing a task to run, the considerations are:
1. Creation time
2. Job dependencies
3. Capacity

Independent tasks are run in order of creation time, earliest first. Tasks with dependencies are also run in creation time order within the group of task dependencies. Capacity is the final consideration when deciding to release a task to be run by the dispatcher.


## Dependency Manager

Responsible for looking at each pending task and determining whether it should create a dependency for that task.

This file has been truncated. show original

TheRealHaoLiu · February 7, 2024, 7:48pm

Last year we made an attempt add adding more resiliency on AWX around k8s node maintenance

I made a YouTube video about it https://www.youtube.com/watch?v=EqYl2hDs90c

(like and subscribe and ring the bell )

This feature in conjunction with Disruptions | Kubernetes should allow you to do a rolling upgrade of K8S cluster underneath without disruption to jobs

I had a draft blog about how to do this (that I didn’t finish >.<)…

sirjaren · February 7, 2024, 8:40pm

(like and subscribe and ring the bell )

(already subscribed!)

I appreciate the feedback!

I have already seen the YouTube video (great video, btw!) and looked at pod disruption budgets.

They all work great but do not prevent disruption, just manages expectations and tolerates some disruptions.

The part I am having concerns with is while job templates are easy enough to handle without killing jobs (since all jobs are exposed as pods), the workflow jobs are not exposed in the same way.

So, if you bring down an AWX node that is part of a workflow, it seems that it can kill the workflow job (which may contain other workflows or regular jobs).

That’s the part where I haven’t been able to work around.

TheRealHaoLiu · February 8, 2024, 2:59am

Gotcha. Thanks for pointing out the gap. Let me discuss with the team and see if we can figure something out. Can you put up an RFE and provide detail of how you configure your environment and the “minimal” reproducer

sirjaren · February 8, 2024, 3:54am

I did a fairly exhaustive breakdown in the Github issue here (including breaking down the architecture with examples):

github.com/ansible/awx

Recovery of running job after its control process is lost

opened 02:54PM - 12 Apr 23 UTC

AlanCoding

type:enhancement component:api

### Please confirm the following - [X] I agree to follow this project's [code o…f conduct](https://docs.ansible.com/ansible/latest/community/code_of_conduct.html). - [X] I have checked the [current issues](https://github.com/ansible/awx/issues) for duplicates. - [X] I understand that AWX is open source software provided for free and that I might not receive a timely response. ### Feature type New Feature ### Feature Summary Scenario: - A job that sleeps for several minutes is launched on an execution node, so `controller_node` != `execution_node` - The job successfully starts and starts producing events - Mid-run, services are restarted on the job's controller node Right now, this job will be reaped. We _hate_ it when jobs get reaped. Details of proposed solution, subject to change: - replace the `reap` method called from `cluster_node_heatbeat` with a different reconciliation method called from `awx_receptor_workunit_reaper`. This method has access to `receptorctl status`. - in addition, this method will be given access to the process list, `worker_tasks` - in addition, this method may pull the list of running jobs from the database, as needed - In the event that the database status is "running" but the process is missing (timing issues assumed to be worked out), it will send a message back to the dispatcher main process if the _receptor_ status is still active. - When the dispatcher gets a message that an active job is orphaned, it will launch `RunJob` or its equivalent. - Instead of starting the job, it will pick up `process`ing from the last line processed, which can be ascertained from the saved events or some other means of tracking. ### Select the relevant components - [ ] UI - [X] API - [ ] Docs - [ ] Collection - [ ] CLI - [ ] Other ### Steps to reproduce See feature summary ### Current results Reaper message, job canceled ### Sugested feature result Jobs should never be reaped - just have processing resumed. Receptor becomes source of ultimate truth. ### Additional information _No response_

I’m happy to do an RFE if needed, let me know where to submit and I can work something out.

Thanks again!

TheRealHaoLiu · February 8, 2024, 6:18pm

the one created by @AlanCoding works

Topic		Replies	Views
AWX job terminated unexpectedly AWX Project awx	22	13	September 25, 2023
Two awx-task replicas and orphaned job reconnection Get Help awx , awx-operator , receptor , kubernetes	0	157	May 22, 2024
AWX workflow job is running even though child job is completed AWX Project awx , kubernetes	1	2	November 4, 2022
AWX pause between jobs Ansible Project awx	0	9	January 7, 2019
AWX job terminated unexpectedly Get Help awx , collections , kubernetes , ee	29	2883	April 16, 2024

AWX Job Workflow Management

Related topics