AWX Job Workflow Management

I would like to understand and address how Workflow Jobs are managed in AWX better.

I made a comment to a Github issue where I thought it was appropriate, but I realize the community would have a hard time finding it.

While the Github issue has the information and questions I have with an example architecture, I’m curious how others have managed to do maintenance in AWX on K8s while preventing unexpected job termination within workflows.

Even if jobs from a workflow are on AWX K8s nodes that are not undergoing maintenance, bringing down a controller node in AWX can terminate those workflow jobs across said other nodes.

Much more information in the relevant Github issue below:

Since I added the documentation tag, I figured I would include relevant docs I have been perusing which may apply here:

Last year we made an attempt add adding more resiliency on AWX around k8s node maintenance

I made a YouTube video about it https://www.youtube.com/watch?v=EqYl2hDs90c

(like and subscribe and ring the bell :wink:)

This feature in conjunction with Disruptions | Kubernetes should allow you to do a rolling upgrade of K8S cluster underneath without disruption to jobs

I had a draft blog about how to do this (that I didn’t finish >.<)…

2 Likes

(like and subscribe and ring the bell :wink:)

(already subscribed!)

I appreciate the feedback! :heart_eyes:

I have already seen the YouTube video (great video, btw!) and looked at pod disruption budgets.

They all work great but do not prevent disruption, just manages expectations and tolerates some disruptions.

The part I am having concerns with is while job templates are easy enough to handle without killing jobs (since all jobs are exposed as pods), the workflow jobs are not exposed in the same way.

So, if you bring down an AWX node that is part of a workflow, it seems that it can kill the workflow job (which may contain other workflows or regular jobs).

That’s the part where I haven’t been able to work around. :frowning:

Gotcha. Thanks for pointing out the gap. Let me discuss with the team and see if we can figure something out. Can you put up an RFE and provide detail of how you configure your environment and the “minimal” reproducer

I did a fairly exhaustive breakdown in the Github issue here (including breaking down the architecture with examples):

I’m happy to do an RFE if needed, let me know where to submit and I can work something out.

Thanks again!

the one created by @AlanCoding works

1 Like