Investigating extra-vars Stale Value Bug

Hello

I believe my team and I have come across an Ansible bug where variables passed in via the --extra-vars parameter are being ignored or overwritten in favor of values from previous runs of the same playbook role.

I need some advice on how we can verify if we are seeing a bug, what logs or files we can gather, and if there are any settings we can adjust.

Details:

The most recent example of this issue shed a little more light on the problem.
In this case, we have 2 Ansible playbook automation flows that do different work, but do share some common Ansible roles. However, the variable values passed to the common roles are different because they need different settings.

To illustrate:

Automation flow 1 runs these roles: A, B, C
Automation flow 2 runs these roles: D, B, E

Automation flow 1 ran earlier in the day and ran normally and Automation flow 2 ran later in the day and hit an error.

The common role, “B”, copies a file over to a host from local based on a file name passed in via extra-vars, the “file_name” variable. We just so happen to generate these files with names that contain an ID we can use to tie them to individual automation flow runs.

Example file_name values:

Automation_Flow1_ID123_copyfile
Automation_Flow2_ID456_copyfile

The error Automation flow 2 returned said that “file_name” could not be found as it was looking for a file named “Automation_Flow1_ID123_copyfile”.

Through our logging and the Ansible artifacts folder command file, we were able to determine that the correct “file_name”, Automation_Flow2_ID456_copyfile, was passed into extra-vars.

Re-running Automation flow 2 consistently reproduced the error. It kept picking up the “file_name” value from when Automation flow 1 ran the common role “B” earlier. We run the automation on pods, so I redeployed the pod with a clean image and Automation flow 2 functioned properly after that.

We have faced this issue 2 or 3 times in the past few months and it happened in similar situations, but other roles. Redeploying the pod resolved the issue. There seems to be some cached value that gets propagated to future runs of the same role.

We are not sure how this issue can be reproduced, it just seems to happen after a certain period of time. Once it does happen, we can consistently reproduce the error when re-running the automation flows. I did backup the artifacts folder from the most recent issue in case that is helpful for further review.

When we do hit this issue, I would like some advice on what files we can look at or know what evidence we can gather to investigate this further.

Additional context:

Our use case is that we are managing the lifecycle of hosts and their respective pieces of software at scale. Deploy, configure, delete. We run the Ansible automation often and against multiple hosts at once. This may be a contributing factor due to scale. Maybe there is an overloaded artifacts or caching file somewhere?

We execute ansible playbooks from python using ansible-runner and this is all running on containers/pods.

I have to be a little vague on the details as the code my team and I develop is closed source and proprietary. However, what we are doing are common ansible automation tasks like, logging into a number of hosts, configuring some settings, copying files around, running scripts, etc.

I will provide what information I can and my intent is to open a bug report with enough details to do proper debug if we can verify we are facing a real issue.

Versions:
ansible-playbook [core 2.15.13]
python version = 3.10.8

Thanks!

I think this is too vague for anyone to really help. If you could provide any examples of your playbook, role B, or the copy task that would help.

You can also try running the playbook with verbose logging to see the values of variables as the play progresses. Either use the debug task to occasionally print a variable or run with -vvvv to dump a lot of info for every task.

1 Like

Thank you for your reply Mike.

Mostly what I am looking for are ideas on how I can gather evidence next time it happens. I know what I am seeing is a bug, but I’m not sure what I can do to effectively gather evidence for someone to investigate the problem.

I think the problem is happening at a level that is not apparent to the end user, so I am kind of just guessing at what files, folders, logs might be useful. It is happening before any playbook code is ran.

It just occurred to me, maybe a better question from my end is the following:

  • When someone executes an Ansible playbook with extra-vars, how does ansible compile all the variables before running any playbooks? Where does it load all the data from?

I know ansible does some work to load variables and various bits of info from other files so maybe that is some homework for me to do.

Debug wise:

From what we have been able to gather through debug task output is that the variable is wrong from the start. Putting a debug statement as the first step shows it is the stale value. We only read the value.

This also only happens uncommonly, most of the time it just works as intended.

Looking at the code there is one thing I notice and remember from previous times we saw this. The problematic playbook was running against a group of hosts, while the one without the issue was running against one host at a time.

Most notably is the difference in how we pass the file_name variable through to the role.

  • The working playbook adds one host and then imports role B and assigns the single host as the target. The role consumes the file_name variable directly.
  • The problematic playbook adds all the hosts and once, but defines as a host variable, the file_name variable since each host needs a different file. The overall playbook still imports role B, but is targeted at the group of hosts. Role B has to consume the file_name variable from the host info. So each host should have a unique value for file_name.

Maybe the above point narrows down the search area, but it still doesn’t explain how two completely separate playbooks run at different times, by different processes had interference with variable values. I just don’t understand how ansible is populating that value. How does it know what the previous value even was?

Playbook 1 is like this

- hosts: localhost
  connection: local
  gather_facts: yes
  tasks:
    - name: add host
      add_host:
        name: example_host
        ansible_ssh_host: "{{ hostname }}"
        ansible_ssh_user: "{{ user }}"
        ansible_ssh_pass: "{{ pass }}"
        ansible_become: yes
      no_log: True

- name: Run role B
  hosts: example_host
  tasks:
    - include_role:
        name: role_B
        tasks_from: main.yml  

Playbook 2 is like this

- hosts: localhost
  connection: local
  gather_facts: yes
  tasks:
    - name: add host
      add_host:
        groups: example_multi_hosts
        name: "{{ item.name }}"
        ansible_ssh_host: "{{ item.hostname }}"
        ansible_ssh_user: "{{ user }}"
        ansible_ssh_pass: "{{ pass }}"
        ansible_become: yes
        file_name: "{{ item.file_name }}"
      no_log: True
      with_items: "{{ multiple_host_info }}"


- name: run against multiple hosts
  hosts: example_multi_hosts
  strategy: free
  tasks:
    - include_role:
        name: role_B
        tasks_from: main.yml

Role B main.yml looks like this

- name: Set file path
  set_fact:
    example_file_path: "/example/file/path/{{ file_name }}"

- name: Show file path
  debug: msg="Example path {{ example_file_path }}"

- name: Copy example file over
  copy:
    src: "{{example_file_path}}"
    dest: /example/destination/path/file_name

..... Do more work

Hope this helps.

Its hard to say definitively how Ansible is getting variable values, since it depends a lot on the playbook, environment, and command line. There is a precedence list here, if you havnt already seen it.

Putting a debug statement as the first step shows it is the stale value.

I would run the playbook with -vvvv. At the start of the playbook, it will print out what extra vars were parsed at the command line. That would be useful to know; is the new input actually being passed in or not.

An extra-var takes precedence over everything for the duration of the playbook, even set_fact later on. So:

  1. If the stale value is not expected to be passed in, why is it being passed in?
  2. If the stale value is expected to be passed in, are we trying to overwrite the variable file_name somewhere (intentionally or unintentionally)?

If the new input actually is being parsed as an extra var, but then it changes value in the first task…well im not sure

The next time we hit the problem, I’ll make sure to give -vvvv a go to see if we can get some more debug info out of it. This is an intermittent problem, but it’ll probably happen again sooner rather than later.

In regards to the questions

If the stale value is not expected to be passed in, why is it being passed in?

In terms of our automation, I am certain that the correct extra-var value is being passed to Ansible. However, I’m very curious where ansible is pulling the stale value from. The stale value comes from an entirely different execution of a different playbook that only ran a common role. From a black box perspective, Ansible should have no idea the previous value existed.

If the stale value is expected to be passed in, are we trying to overwrite the variable file_name somewhere (intentionally or unintentionally)?

The extra-var value we pass in is the expected value, the stale value should not be there and is creating errors or misconfigurations. We are not overwriting it anywhere, beyond setting the value in the host vars, like I showed in my code snippet above. Even then, it should be the same value passed to extra-vars.

Additional notes:

I was able to dig up our previous work items where we saw this problem.
We saw:

  • When adding a host, Ansible used the correct name, but the hostname came from a stale value. i.e. it tried connecting to the wrong host despite being given different host info since it used stale host information
  • Another case where a different variable passed into extra-vars used the stale value.
    • This case also showed us that once this bug happens, all future runs of the playbook are stuck using the stale values.