Exit role without failure (stop processing role on condition) without ending play or stoping execution of other roles in a stack/ Ensure all roles in a stack get run even if one ends with failure

Hi,

Trying what I believe to be common use case where a playbook has set of roles to configure system.

this looks something like

  • name: Do X, Y and Z
    hosts: all

roles:

  • role: roleA
    var1: value1
    var2: value2
  • role: roleB
    var1: value1
    var2: value2
  • role: roleC
    var1: value1
    var2: value2

I found 2 issues with this set up.

1 - I have checks in each role to end_host in case there is no need to run the playbook. I do understand that with correctly written (idempotent) playbook this is not necessary but I found it to be very useful as one, it does save a lot of execution time, especially when your play is big and inventory even bigger but also when you are creating timestamped audit logs that you want to be executed only if change happened. I also considered breaking up tasks in more files so that they can be included on condition but depending on the role it can get bit messy and hard to follow for other contributors.
I’ve been using end_host but it ends play and not a role. Is there a way to end role and not play, and most of all not stop processing of other roles in the list?

2 - If role fails due to unhandled error, fail hat role but again not a play, and most of all do not stop processing of other roles in the list?

I’ve looked into ignoring_errors but again that work on play level and not role level. If role A fails ignore_errors will allow next play to be executed but not roles B and C

So far the only solutions I can think of are:

  • Have play for each role - not the end of the world, just a lot of unnecessary code
    - in AWX, have a playbook/job template for each role and string them together using workflow - again a lot of unnecessary code and potentially job templates.

Is there a way to deal with 2 issues mentioned above? What is the standard approach to dealing with roles failing or exiting roles on condition when executed in a stack/list?

Thanks!

You can do a sanity check at the start of each role, if the end result is good, then you can skip everything in between by blocking them and then using a when condition on the blocks. SOME “unnecessary” code adds, but gives you also the ability to “chunk” it up and add an overall time-stamp on each chunk.

Then set the “failed when” condition on the items you can safely ignore. Again it’s a bit outside the idempotent method, as you shouldn’t be failing, but in some cases that’s a generally acceptable method to fail-forward.

Generally speaking, I prefer to sanity check for end result on something that would “fail” to apply or perform correctly, then use the sanity check variable to “when” clause the actual work.

ie: check for an application version, failed when host is offline, register output, then use output result in the next to install application… 1 extra task, 1 extra line on the actual work task.

Thank you for your suggestion.

I’m afraid though it does not help. Blocking this way in large role makes it quick complex and adds a lot of code, making it harder to read and maintain.

fails same as end_host will prevent other roles in the stack from being executed which is really my main issue.

I do also have checks which is really source of this issue. Consider below
if condition A is true exit else continue checking Condition B (if true exit else continue ) → etc. Now you have 8 or 10 of those. It’s much cleaner to say end role if condition is met (or not depending on what you are trying to achieve) then having block for all of those conditions.
Simple example of this might be roleA supports only RHEL while roleB supports only Ubuntu - Let’s say we are installing something that only is available on one of those (more realistic would be if we consider versions of distros but using distors to keep is simple).

plybook:

  • name: Install A and B
    hosts: linux_servers

roles:

  • roleA # Installs software A that only supports RHEL
  • roleB # Installs software B that only supports Ubuntu

both of the roles have a condition at the top of the main file that checks if this is indeed supported OS and exists if it is not.

In that case if we execute role A on system that is not RHEL we will never get to execute roleB on the same server. This means Ubuntu servers will never be managed.

Again I know this can be done with include_taks for specyfic distro without need to end role but this is very simple example, to illustrate that it often makes more sense to end role if condition is/is not met then trying to work around it.

Even in this simple example if we would have inlcude_taks for specyfic distro and we suddently run on distro we have not included in our logic then role fails. This again results in skipping all other roles in the stack (end if RHEL - where we don’t really care what distro it is unless it’s RHEL VS include tasks for RHEL or Ubuntu where we need to have one for each distro we might need to run the role on - and there are many :slight_smile: )

Hopefully this makes sense :slight_smile:

Lukasz

Have you considered breaking each role out into its own playbook, then combining the playbooks into a single workflow template? You can transfer all the data you need between roles with the set_stats command, and you can set each play to run whether the previous succeeds or fails.

Yeah, I’d do this as a workflow job and either let smart inventory handle the whole mess or set them all to run regardless of failure state, that’s going to be the only way I can see to make something easy to maintain and working.

Otherwise you are going to have to set the variables and do the when checks. Which frankly isn’t as cumbersome because you can re-use the variables and if you take a holistic approach isn’t a bad option…

Regardless you are going to be adding/maintaining a larger stack of code.

The question is really HOW you want to be maintaining it for the future.

If you are installing packages, use package instead of rpm or apt. If you are checking for things based off OS version, you have to sanity check it. If your environment is that diverse, you have to go with what makes sense in the long haul.

Thank you for all the comments and advice.

I did considered using workflows as well and in general that not the worst idea however it requires all roles be job templates which again might not be end of the world but also maybe not always necessary. My biggest problem with this approach is a mess that comes out as a result of workflows and slices when it comes to logging. In my example I have 11 roles I need to execute as part of a baseline and this will only grow with time, each job is executed in 4 slices due to the size of the inventory. This means going through the logs to find information about particular host across all jobs (which there’s going to be 44 of) is not not going to be fun. :slight_smile:

Still I do agree all your suggestions are viable options and probably best that can be done if there really is not way to just end role and not play. Seems like pretty handy functionality to have. Or even way to change default behaviour of processing list of roles where hosts for which role failed/ended are not removed from inventory (marked as hosts_with_errors) that is used by next role. It would be maybe even nicer to be able to run meat: clear_host_errors between stacked roles.

Once again thanks for replies.

Lukasz

In case anyone find this helpful I have ended up doing below. From initial testing it seems to be doing what I need but further testing is required.

ignore_errors on role level allows for using end_host in roles and rescue block will clear all failed hosts making sure next play targets full inventory.

  • name: Play1
    hosts: all
    gather_facts: false
    ignore_errors: true

tasks:

  • name: Stuff1
    block:

  • name: Include stuff1
    ansible.builtin.include_role:
    name: stuff.yml
    rescue:

  • name: Clear errors
    ansible.builtin.meta: clear_host_errors

  • name: Play2
    hosts: all
    gather_facts: false
    ignore_errors: true

tasks:

  • name: Stuff2
    block:
  • name: Include stuff2
    ansible.builtin.include_role:
    name: stuff2.yml
    vars:
    var1: ‘XXX’
    rescue:
  • name: Clear errors
    ansible.builtin.meta: clear_host_errors

Lukasz

Well found. Thank you for sharing back. This is a thing of beauty.

Kevin