Investigating extra-vars Stale Value Bug

Hello

I believe my team and I have come across an Ansible bug where variables passed in via the --extra-vars parameter are being ignored or overwritten in favor of values from previous runs of the same playbook role.

I need some advice on how we can verify if we are seeing a bug, what logs or files we can gather, and if there are any settings we can adjust.

Details:

The most recent example of this issue shed a little more light on the problem.
In this case, we have 2 Ansible playbook automation flows that do different work, but do share some common Ansible roles. However, the variable values passed to the common roles are different because they need different settings.

To illustrate:

Automation flow 1 runs these roles: A, B, C
Automation flow 2 runs these roles: D, B, E

Automation flow 1 ran earlier in the day and ran normally and Automation flow 2 ran later in the day and hit an error.

The common role, “B”, copies a file over to a host from local based on a file name passed in via extra-vars, the “file_name” variable. We just so happen to generate these files with names that contain an ID we can use to tie them to individual automation flow runs.

Example file_name values:

Automation_Flow1_ID123_copyfile
Automation_Flow2_ID456_copyfile

The error Automation flow 2 returned said that “file_name” could not be found as it was looking for a file named “Automation_Flow1_ID123_copyfile”.

Through our logging and the Ansible artifacts folder command file, we were able to determine that the correct “file_name”, Automation_Flow2_ID456_copyfile, was passed into extra-vars.

Re-running Automation flow 2 consistently reproduced the error. It kept picking up the “file_name” value from when Automation flow 1 ran the common role “B” earlier. We run the automation on pods, so I redeployed the pod with a clean image and Automation flow 2 functioned properly after that.

We have faced this issue 2 or 3 times in the past few months and it happened in similar situations, but other roles. Redeploying the pod resolved the issue. There seems to be some cached value that gets propagated to future runs of the same role.

We are not sure how this issue can be reproduced, it just seems to happen after a certain period of time. Once it does happen, we can consistently reproduce the error when re-running the automation flows. I did backup the artifacts folder from the most recent issue in case that is helpful for further review.

When we do hit this issue, I would like some advice on what files we can look at or know what evidence we can gather to investigate this further.

Additional context:

Our use case is that we are managing the lifecycle of hosts and their respective pieces of software at scale. Deploy, configure, delete. We run the Ansible automation often and against multiple hosts at once. This may be a contributing factor due to scale. Maybe there is an overloaded artifacts or caching file somewhere?

We execute ansible playbooks from python using ansible-runner and this is all running on containers/pods.

I have to be a little vague on the details as the code my team and I develop is closed source and proprietary. However, what we are doing are common ansible automation tasks like, logging into a number of hosts, configuring some settings, copying files around, running scripts, etc.

I will provide what information I can and my intent is to open a bug report with enough details to do proper debug if we can verify we are facing a real issue.

Versions:
ansible-playbook [core 2.15.13]
python version = 3.10.8

Thanks!

I think this is too vague for anyone to really help. If you could provide any examples of your playbook, role B, or the copy task that would help.

You can also try running the playbook with verbose logging to see the values of variables as the play progresses. Either use the debug task to occasionally print a variable or run with -vvvv to dump a lot of info for every task.

1 Like

Thank you for your reply Mike.

Mostly what I am looking for are ideas on how I can gather evidence next time it happens. I know what I am seeing is a bug, but I’m not sure what I can do to effectively gather evidence for someone to investigate the problem.

I think the problem is happening at a level that is not apparent to the end user, so I am kind of just guessing at what files, folders, logs might be useful. It is happening before any playbook code is ran.

It just occurred to me, maybe a better question from my end is the following:

  • When someone executes an Ansible playbook with extra-vars, how does ansible compile all the variables before running any playbooks? Where does it load all the data from?

I know ansible does some work to load variables and various bits of info from other files so maybe that is some homework for me to do.

Debug wise:

From what we have been able to gather through debug task output is that the variable is wrong from the start. Putting a debug statement as the first step shows it is the stale value. We only read the value.

This also only happens uncommonly, most of the time it just works as intended.

Looking at the code there is one thing I notice and remember from previous times we saw this. The problematic playbook was running against a group of hosts, while the one without the issue was running against one host at a time.

Most notably is the difference in how we pass the file_name variable through to the role.

  • The working playbook adds one host and then imports role B and assigns the single host as the target. The role consumes the file_name variable directly.
  • The problematic playbook adds all the hosts and once, but defines as a host variable, the file_name variable since each host needs a different file. The overall playbook still imports role B, but is targeted at the group of hosts. Role B has to consume the file_name variable from the host info. So each host should have a unique value for file_name.

Maybe the above point narrows down the search area, but it still doesn’t explain how two completely separate playbooks run at different times, by different processes had interference with variable values. I just don’t understand how ansible is populating that value. How does it know what the previous value even was?

Playbook 1 is like this

- hosts: localhost
  connection: local
  gather_facts: yes
  tasks:
    - name: add host
      add_host:
        name: example_host
        ansible_ssh_host: "{{ hostname }}"
        ansible_ssh_user: "{{ user }}"
        ansible_ssh_pass: "{{ pass }}"
        ansible_become: yes
      no_log: True

- name: Run role B
  hosts: example_host
  tasks:
    - include_role:
        name: role_B
        tasks_from: main.yml  

Playbook 2 is like this

- hosts: localhost
  connection: local
  gather_facts: yes
  tasks:
    - name: add host
      add_host:
        groups: example_multi_hosts
        name: "{{ item.name }}"
        ansible_ssh_host: "{{ item.hostname }}"
        ansible_ssh_user: "{{ user }}"
        ansible_ssh_pass: "{{ pass }}"
        ansible_become: yes
        file_name: "{{ item.file_name }}"
      no_log: True
      with_items: "{{ multiple_host_info }}"


- name: run against multiple hosts
  hosts: example_multi_hosts
  strategy: free
  tasks:
    - include_role:
        name: role_B
        tasks_from: main.yml

Role B main.yml looks like this

- name: Set file path
  set_fact:
    example_file_path: "/example/file/path/{{ file_name }}"

- name: Show file path
  debug: msg="Example path {{ example_file_path }}"

- name: Copy example file over
  copy:
    src: "{{example_file_path}}"
    dest: /example/destination/path/file_name

..... Do more work

Hope this helps.

Its hard to say definitively how Ansible is getting variable values, since it depends a lot on the playbook, environment, and command line. There is a precedence list here, if you havnt already seen it.

Putting a debug statement as the first step shows it is the stale value.

I would run the playbook with -vvvv. At the start of the playbook, it will print out what extra vars were parsed at the command line. That would be useful to know; is the new input actually being passed in or not.

An extra-var takes precedence over everything for the duration of the playbook, even set_fact later on. So:

  1. If the stale value is not expected to be passed in, why is it being passed in?
  2. If the stale value is expected to be passed in, are we trying to overwrite the variable file_name somewhere (intentionally or unintentionally)?

If the new input actually is being parsed as an extra var, but then it changes value in the first task…well im not sure

The next time we hit the problem, I’ll make sure to give -vvvv a go to see if we can get some more debug info out of it. This is an intermittent problem, but it’ll probably happen again sooner rather than later.

In regards to the questions

If the stale value is not expected to be passed in, why is it being passed in?

In terms of our automation, I am certain that the correct extra-var value is being passed to Ansible. However, I’m very curious where ansible is pulling the stale value from. The stale value comes from an entirely different execution of a different playbook that only ran a common role. From a black box perspective, Ansible should have no idea the previous value existed.

If the stale value is expected to be passed in, are we trying to overwrite the variable file_name somewhere (intentionally or unintentionally)?

The extra-var value we pass in is the expected value, the stale value should not be there and is creating errors or misconfigurations. We are not overwriting it anywhere, beyond setting the value in the host vars, like I showed in my code snippet above. Even then, it should be the same value passed to extra-vars.

Additional notes:

I was able to dig up our previous work items where we saw this problem.
We saw:

  • When adding a host, Ansible used the correct name, but the hostname came from a stale value. i.e. it tried connecting to the wrong host despite being given different host info since it used stale host information
  • Another case where a different variable passed into extra-vars used the stale value.
    • This case also showed us that once this bug happens, all future runs of the playbook are stuck using the stale values.

Do you have any caching setup in the ansible.cfg or some other cache-like mechanism?

I would also be curious if the issue happens when the same container/pod runs the playbooks. Is there a way to confirm that your containers are only used for running one job and then they are killed?

When adding a host, Ansible used the correct name, but the hostname came from a stale value. i.e. it tried connecting to the wrong host despite being given different host info since it used stale host information

Are you saying the inventory hostname appeared correct (it used a new value that was different than the old one) but the ansible_ssh_host was incorrect (it used the old value)

Caching:
In terms of caching, ansible.cfg only has a few settings and the only setting around caching might be gathering = smart.

Otherwise, we do have an artifacts folder that Ansible automatically generates. That does seem to contain info on old runs, but it looks like each run has its own unique ID and folder. Each folder has a file that shows what command was run, some host info, and stdout/err. I’m not familiar with how the artifacts folders are used by Ansible.

Containers:
Our containers exist in the medium term and are not killed after running one job. They will run the same playbooks against different targets (different extra-vars) and are only refreshed when we push new code, are manually redeployed, or otherwise go down for one reason or another. This issue typically happens after the container has been running for a longer period of time, I’d guess 1 or 2 weeks, probably not longer than that.

I would also be curious if the issue happens when the same container/pod runs the playbooks.

Yes even across different processes.

When I was looking at our old work items referencing this problem, there was one case where 2 different target environments were hitting the same stale value error. This meant that 2 independent processes running the same playbook but different extra-vars values were both consuming the same stale value. Once this bug occurs, the error happens on each re-run until the container is redeployed.

One note, these playbooks get a ton of cycles. We have quite a large set of playbooks that, by the nature of our dev/test cycle, are run and re-run all the time.

Host name issue:

Are you saying the inventory hostname appeared correct (it used a new value that was different than the old one) but the ansible_ssh_host was incorrect (it used the old value)

Taking a look at the work item again, I misunderstood the error because it dealt with the hostname variable. It is similar to the original issue I posted. So from the same playbook 2 I posted above, but another variable we define. The interesting thing here is that the target_hostname variable we define contains the same value of the hostname used for making connections. For context we are delegating to a windows machine to run some powershell for this task.

    - name: add host
      add_host:
        groups: example_multi_hosts
        name: "{{ item.name }}"
        ansible_ssh_host: "{{ item.hostname }}" <======
        ansible_ssh_user: "{{ user }}"
        ansible_ssh_pass: "{{ pass }}"
        ansible_become: yes
        file_name: "{{ item.file_name }}"
 ==>    target_hostname: "{{ item.hostname }}" <======
      no_log: True
      with_items: "{{ multiple_host_info }}"

Showing some debug my team did.

We printed out the hosts, which was correct

- name: DEBUG - showing hosts in the example_hosts group
  debug:
    msg: "{{ groups['example_hosts'] }}"

Output:
TASK [DEBUG - showing hosts in the example_hosts group] 
ok: [localhost] => {
    "msg": [
        "host_name_1",
        "host_name_2",
        "host_name_3",
        "host_name_4"
    ]
}

But when we print out the host variable it was a stale value

TASK [configure_host : Showing target_hostname passed to powershell command] ****
ok: [host_name_2 -> windows_vm(...)] => {
    "msg": "Calling ps_script.ps1 for host_name_9" <== Totally wrong host name value
}

This is the code that uses target_hostname

- name: Run script on windows
  script: >
    "/file/path/ps_script.ps1"
    -targeted_host "{{ target_hostname }}"

That is very odd. I am not sure I can help much more; it seems like the issue is with parts of ansible I don’t use much, if at all.

Id guess some sort of write-to-file is going on, and then the wrong file is being ingested for…some reason.

Using the smart gathering strategy does make use of the caching mechanism and I think could be causing this. But you would need to use a cache that persists. The default caching mechanism is memory and does not persist…

The artifacts could also be coming into play. I am really not familiar with them though.

Thank you Mike for taking the time to read my responses and reply with good suggestions and follow up questions. I appreciate it.

I will update this thread if I find out anything else and keep an eye on it if anyone else has any ideas of what might be going on here.