File operations failing on shared file system #67410

I opened an issue that was closed because the developer believes the problem is a race condition that can’t be dealt with in code. I created a playbook where two hosts that both mount a shared file system test for the existence of a file. The playbook starts with the file present in the shared file system. The playbook then executes the following steps:

  1. stat the file on both hosts (output shows the file is there)
  2. remove the file from host1 using a when option to limit the action to the desired host (output shows “skipping & changed”)
  3. stat the file on both hosts (output shows the file does not exist)
  4. create the file on host2 using a when option to limit the action to the desired host (output shows “skipping & changed” again, but on opposite hosts as step 2)
  5. stat the file on both hosts (output shows the file exists on host2 but not on host1)
    I don’t understand how this is a race condition. This isn’t a case where something outside ansible is creating the file. The task that creates the file clearly completes before the stat task that checks for the file’s existence is started. Further, the check for the file’s existence is run concurrently on both machines and the task run on the host that created the file sees the file whereas the other does not. A race condition would imply that the machine that doesn’t see that the file exists would have had to have checked before the task that creates the file finished.

While it’s possible that I’m being fooled by the order of output of the “failing” stat output in step 5, past experience tells me that the tasks in step 5 won’t be executed by any host without all hosts in step4 being completed. I had a set of WebSphere patches that I needed to execute against both linux and windows hosts. Though the patches were installed in exactly the same manner, the tasks were different between the two types of hosts. The Linux based task had a when option for the Linux OS type and it was followed by the Windows task with a when option for the Windows OS type. Though all of the Linux machines processed the task in parallel, the Windows machines didn’t start until the Linux machines had completed. In order to get both to operate in parallel, I had to add an async option to both tasks and then add more tasks to wait for the results.

If I’m confused, please set me straight so I understand how I’m creating a race condition. Otherwise, I’d like to reopen the issue.

Clarification is needed. Why the membership of "solr" in the list of groups
"group_names" is tested when "solr" is a host!?

Quoting from the issue #67410:

- name: Test shared file issue
  hosts: dmgr:solr
  ...
  tasks:
  ...
  - name: Remove the file from solr
    file:
      path: "{{ shared_dir }}/testfile"
      state: absent
    when: "'solr' in group_names"

Thank you,

  -vlado

solr is not a host, it is a group. It just happens to be a group of one host. Likewise, dmgr is a group of one host. Consider the following inventory:

[dmgr]
host1.example.com

[solr]
host2.example.com

In our application model, dmgr and solr are two functions. On some environments, they run on the same host and on other environments they run on different hosts. I’m using when to limit the tasks to one host or the other. If the two functions happen to run on the same host, the when condition will always match. Where I discovered the problem, the task list was in a role, so creating separate plays with different hosts is not an option. This playbook is just a simplified playbook that exemplifies the problem. The tasks in the play are all executed against the hosts dmgr:solr.

This justifies the developer's conclusion:
https://github.com/ansible/ansible/issues/67410#issuecomment-590919487

  "This is a race condition between two different hosts modifying the same
  shared file, which is asking for trouble. There isn't much we can do in
  code to fix this."

And in the same time, this reveals the problem in the code. The condition is
true for both hosts. (I have no idea why "solr" reports "skipping")

  - hosts: dmgr:solr

    ...

    - name: create the file on dmgr
      shell: 'echo "Hello World" > {{ shared_dir }}/testfile'
      when: "'dmgr' in group_names"

    TASK [create the file on dmgr]

Disregard this email. Wrong assumption about "group_names". Sorry for the
noise.

  -vlado

solr is not a host, it is a group. It just happens to be a group of one
host. Likewise, dmgr is a group of one host. Consider the following
inventory:

[dmgr]
host1.example.com

[solr]
host2.example.com

This justifies the developer’s conclusion:
https://github.com/ansible/ansible/issues/67410#issuecomment-590919487

“This is a race condition between two different hosts modifying the same
shared file, which is asking for trouble. There isn’t much we can do in
code to fix this.”

I still haven’t figured out why this justifies the developer’s conclusion. From the output, the task that creates the file clearly happens before the task that checks for its existence. The output included in the case shows the order in which things happen:

TASK [create the file on dmgr] *************************************************
skipping: [SOLR REDACTED]
changed: [DMGR REDACTED]

TASK [stat the file from both] *************************************************
ok: [SOLR REDACTED] 
ok: [DMGR REDACTED] 

From the output, it seems that the stat task doesn’t get kicked off until the “changed” result completes from the create task above it. Further, I’m basing this on a rather old (2014) Stack Overflow response which states:
As mentioned before: By default Ansible will attempt to run on all hosts in parallel, but task after Task(serial).

So I was assuming that based on the order of the playbook, the stat task couldn’t be started until after the creation task completed. If this playbook is causing a race condition, this may no longer be accurate.

And in the same time, this reveals the problem in the code. The condition is
true for both hosts. (I have no idea why “solr” reports “skipping”)

  • hosts: dmgr:solr

  • name: create the file on dmgr
    shell: ‘echo “Hello World” > {{ shared_dir }}/testfile’
    when: “‘dmgr’ in group_names”

TASK [create the file on dmgr]


skipping: [SOLR REDACTED]
changed: [DMGR REDACTED]

I’m guessing that my choice of using SOLR to indicate both host and group is confusing. Given the above inventory file, the skipping line would probably more accurately be
skipping: [host2.example.com] and is skipping because “dmgr” is not in group_names for host2.

Anyway, the solution is to fix the conditions and limit the deletion and
creation of the file to the host from one group only. For example

  • name: Remove the file from solr
    file:
    path: “{{ shared_dir }}/testfile”
    state: absent
    when: inventory_hostname in groups.solr

  • name: create the file on dmgr
    shell: ‘echo “Hello World” > {{ shared_dir }}/testfile’
    when: inventory_hostname in groups.dmgr

As a side-note, it’s possible to use “copy” instead of “shell”

  • name: create the file on dmgr
    copy:
    content: ‘echo “Hello World”’
    dest: “{{ shared_dir }}/testfile”
    when: inventory_hostname in groups.dmgr

HTH,

-vlado

The problem the solution is that it doesn’t solve the actual problem. This is just a simple playbook that illustrates the issue. In the actual playbook, host1 is running a command that is exporting a file from a store that only exists on host1. host2 needs to import that file. My solution was to run the command against host1 and export the file to the shared file system. Then run the import on host2. The import has the effect of removing the file from the shared file system. I had it all working until I added a “removes” option to the shell task that did the import. Then the removes indicated the file wasn’t there and skipped the import. That’s when I opened the case.

Short answer:

> I opened an issue <https://github.com/ansible/ansible/issues/67410&gt;

I've updated the issue #67410. I guess this might be a problem with the
permissions of the unprivileged user at the NFS client.

You might want to login to the NFS client and make sure how the unprivileged
user can access the testfile. It's weird, however, that this user is able to
remove the file but can't see it!?

Next test might be to "sleep" for couple of seconds after the testfile was
created.

* One task is missing both in the listing of the playbook and of the output.
  There are only 7 tasks reported OK not 8 as summarised in PLAY RECAP. See
  #67410.

To clarify this. I've set "gather_facts: false". The result is one task less.

  -vlado

Because it has nothing to do with Ansible and everything to do with your OS and network protocol you are using to your NAS.
Ansible just ask the OS and report the answer it gets.

If your are using NFS the default is async mode(some OS/distros change this), then the server is lying to the client, saying a file is created before it's created.
This is done to increase throughput, you can set it to sync but it will probably reduce the performance.