Hello
I believe my team and I have come across an Ansible bug where variables passed in via the --extra-vars parameter are being ignored or overwritten in favor of values from previous runs of the same playbook role.
I need some advice on how we can verify if we are seeing a bug, what logs or files we can gather, and if there are any settings we can adjust.
Details:
The most recent example of this issue shed a little more light on the problem.
In this case, we have 2 Ansible playbook automation flows that do different work, but do share some common Ansible roles. However, the variable values passed to the common roles are different because they need different settings.
To illustrate:
Automation flow 1 runs these roles: A, B, C
Automation flow 2 runs these roles: D, B, E
Automation flow 1 ran earlier in the day and ran normally and Automation flow 2 ran later in the day and hit an error.
The common role, “B”, copies a file over to a host from local based on a file name passed in via extra-vars, the “file_name” variable. We just so happen to generate these files with names that contain an ID we can use to tie them to individual automation flow runs.
Example file_name values:
Automation_Flow1_ID123_copyfile
Automation_Flow2_ID456_copyfile
The error Automation flow 2 returned said that “file_name” could not be found as it was looking for a file named “Automation_Flow1_ID123_copyfile”.
Through our logging and the Ansible artifacts folder command file, we were able to determine that the correct “file_name”, Automation_Flow2_ID456_copyfile, was passed into extra-vars.
Re-running Automation flow 2 consistently reproduced the error. It kept picking up the “file_name” value from when Automation flow 1 ran the common role “B” earlier. We run the automation on pods, so I redeployed the pod with a clean image and Automation flow 2 functioned properly after that.
We have faced this issue 2 or 3 times in the past few months and it happened in similar situations, but other roles. Redeploying the pod resolved the issue. There seems to be some cached value that gets propagated to future runs of the same role.
We are not sure how this issue can be reproduced, it just seems to happen after a certain period of time. Once it does happen, we can consistently reproduce the error when re-running the automation flows. I did backup the artifacts folder from the most recent issue in case that is helpful for further review.
When we do hit this issue, I would like some advice on what files we can look at or know what evidence we can gather to investigate this further.
Additional context:
Our use case is that we are managing the lifecycle of hosts and their respective pieces of software at scale. Deploy, configure, delete. We run the Ansible automation often and against multiple hosts at once. This may be a contributing factor due to scale. Maybe there is an overloaded artifacts or caching file somewhere?
We execute ansible playbooks from python using ansible-runner and this is all running on containers/pods.
I have to be a little vague on the details as the code my team and I develop is closed source and proprietary. However, what we are doing are common ansible automation tasks like, logging into a number of hosts, configuring some settings, copying files around, running scripts, etc.
I will provide what information I can and my intent is to open a bug report with enough details to do proper debug if we can verify we are facing a real issue.
Versions:
ansible-playbook [core 2.15.13]
python version = 3.10.8
Thanks!