I opened an issue that was closed because the developer believes the problem is a race condition that can’t be dealt with in code. I created a playbook where two hosts that both mount a shared file system test for the existence of a file. The playbook starts with the file present in the shared file system. The playbook then executes the following steps:
- stat the file on both hosts (output shows the file is there)
- remove the file from host1 using a when option to limit the action to the desired host (output shows “skipping & changed”)
- stat the file on both hosts (output shows the file does not exist)
- create the file on host2 using a when option to limit the action to the desired host (output shows “skipping & changed” again, but on opposite hosts as step 2)
- stat the file on both hosts (output shows the file exists on host2 but not on host1)
I don’t understand how this is a race condition. This isn’t a case where something outside ansible is creating the file. The task that creates the file clearly completes before the stat task that checks for the file’s existence is started. Further, the check for the file’s existence is run concurrently on both machines and the task run on the host that created the file sees the file whereas the other does not. A race condition would imply that the machine that doesn’t see that the file exists would have had to have checked before the task that creates the file finished.
While it’s possible that I’m being fooled by the order of output of the “failing” stat output in step 5, past experience tells me that the tasks in step 5 won’t be executed by any host without all hosts in step4 being completed. I had a set of WebSphere patches that I needed to execute against both linux and windows hosts. Though the patches were installed in exactly the same manner, the tasks were different between the two types of hosts. The Linux based task had a when option for the Linux OS type and it was followed by the Windows task with a when option for the Windows OS type. Though all of the Linux machines processed the task in parallel, the Windows machines didn’t start until the Linux machines had completed. In order to get both to operate in parallel, I had to add an async option to both tasks and then add more tasks to wait for the results.
If I’m confused, please set me straight so I understand how I’m creating a race condition. Otherwise, I’d like to reopen the issue.