Relaunch a failed job automatically in AWX

I can’t seem to find any mechanism for this. What I need is if a job fails to relaunch it (like you can do in the ui) automatically.

We run AWX in kubernetes in AWS and sometimes jobs can fail because container is killed or something transient. If the job were simply relaunched it will succeed.

What I am looking for is some flag or setting to relaunch the job on failure

We’ve solved this by utilizing a workflow template with a “run on fail” step to rerun the task we know can fail.

Other ideas could involve blocks with rescue statements or using an event-based trigger to rerun the job number but these seem to be more hassle than what it’s worth. Curious what others have to say as we’ve run into similar issues with inventory syncs and certain systems that a rerun of the problem task fixes the issue 90% of the time.

Best regards,

Joe

2 Likes

This caught my eye because I had just come from approving a commit from a colleague involving

  register: mw_gitlab_gitlab_restart_result
  retries: 3
  delay: 10
  until: mw_gitlab_gitlab_restart_result.rc == 0

But if your issue is unexpected termination of the AWX container running that task, or block if you’re trying block rescue, etc., then that’s not going to help.

We have a few “fragile points” as well, particularly in our post-commit pipelines on Jenkins. It’s just frequent enough to be annoying but not so annoying as to be intolerable, so nobody has taken the time to understand what’s actually failing.

Putting all my blathering aside, the re-run on failure workflow outlined by @trippinnik above is probably your best next step to get this working. After that, try to figure out what’s killing your containers, because working around this symptom won’t fix the problem.

1 Like

Hi all,

I have a similar scenario where approximately 20-30% of the hosts in my inventory are consistently offline due to specific business constraints. Whenever an AWX job template runs against these hosts, the entire job is marked as Failed even if just one host is unreachable.

Currently, I’m considering creating a workflow template that first generates a dynamic host list, excluding the unreachable hosts, and then runs the actual automation. However, there’s still a possibility that hosts might become unreachable during execution, causing the entire job to be reported as failed.

Is there a way to mark the job as Successful, but still have visibility into the hosts that failed? Ideally, I’d like to easily identify these failed hosts and re-launch the job specifically against them.

Any suggestions or best practices for handling this scenario?

I found this article: How to ignore errors for unreachable hosts in AWX and I will try to test some of the scenarios there, but I was wondering can I handle this is some other robust way.

Looking at Ignoring unreachable host errors, have you tried

- hosts: all
  ignore_unreachable: true
  tasks:
    […]

    # then, as your last task,
    - name: Drop unreachable host errors
      ansible.builtin.meta: clear_host_errors

This may have unintended consequences in your case, @RaveoNmooN , if there’s a possibility that unreachable hosts might become reachable during execution. In that case, later tasks may run without prior tasks running.