I am running Windows modules that disrupt the network connection. For instance, the installation of a network driver or the creation of a Network Team. The IP address doesn’t change, and the network connection is only out for a few moments. But when these run, my Ansible playbook basically freezes - it just sits there running the task until Ansible times out and the playbook fails. My colleagues tell me Linux handles this gracefully, reconnecting and continuing when the connection is back up. Any idea how I can get this behavior with Windows?
SSH seems to be very tolerant of momentary connection losses, so long as the connection isn’t actually “refused”.
WinRM under the covers is a very different beast (HTTP-based, logical connection instead of a single fixed TCP connection). It might be possible to retry certain parts of the WinRM exchange, but in general it’s not safe to blanket retry requests (eg, you don’t want to accidentally run something twice). The problem case is where a connectivity change like that happens before we receive the HTTP response from the Command/Send actions (retrying Receive would probably be OK).
The “right” way to deal with this would probably be to use async, but that didn’t make it in for Windows for 2.1 (should be in 2.2). Async should be tolerant of most kinds of dodgy/unstable connections…
Depending on what you’re trying to do, doing it as a scheduled task/script might make sense in the interim (eg, see http://docs.ansible.com/ansible/win_scheduled_task_module.html)
Hi Matt,
You mention the async support is going to be put in 2.2. Is there any other workaround for this problem other than the win_scheduled_task module.
For example, can we use polling/pinging to see whether the connection is back up?
I have tried several methods but none seem to work, http://pastebin.com/PS82PnBF is what I came up with but even this freezes after installation.
Appreciate any help/advice.
The module subsystem alone is not (and pretty much cannot safely) be made resilient to modules that interrupt the network connection.
That said, all the bits and pieces are there to do what you need if you’re doing custom work, but you’d have to string them together yourself to make an action/module pair that can be resilient to changes that interrupt the network connection. Take a look at the way the win_reboot action works- you can follow a similar pattern yourself: write a custom action plugin (and stuff it in action_plugins/ next to your playbook). The action plugin can exec the module, catch the network failure, poke at the box until it responds again, then ensure the changes were made correctly. There are several different approaches to doing this that work, depending on how exactly correct you want to be about races, failures that look like successes, etc, but the naive “happy path” case is very simple to implement.
Or you could just wait for async.
Hi Matt,
Thank you for the advice, appreciate it!
I tried doing it the ‘cleaner’ way using similar logic as in win_reboot.py however after some initial testing Ansible still seems to freeze on the connection.
Could you please take a look at my plugin and let me know where I’m going wrong or if I’m missing some tiny little detail.
It basically freezes after the install driver command is sent to the windows host (using pnputil).
Thank you!
Without being able to reproduce what it’s actually doing on my end, I suspect it’s blocking on the winrm Receive (you could verify that by inserting Fiddler or another proxy in the middle). That should time out eventually when no output comes back within the read timeout window- how long have you waited? (could also try setting ansible_winrm_read_timeout_sec to a nice low number to make it come back faster)
Another way you might be able to handle this (as kind of a poor-man’s async) would be to spawn the command in a new process via exec_command, and include a delay to prevent the hang during result fetch/disconnect, like:
start-process -nonewwindow powershell.exe “-command sleep 2; pnputil.exe -i -a driver/path”
Unless you capture and marshal the results to a file yourself, you wouldn’t be able to detect a failure (this is the heavy-lifting that async does for you), but should get the job done on the happy path.
Thanks for the suggestions Matt. I’ve tried both the approaches. No luck unfortunately
From what I understand, Ansible seems to be waiting for the result to get added to the results dictionary. Even though I have changed the timeouts in win_reconnect (the plugin which I wrote).
I’ve tried running it programmatically and through playbook as well. Same findings on both.
Here’s a bit of a traceback, this happens when I manually stop the run (ctrl + c), if it helps.
I’ve put the code over here (it is highly under developed as of now), you might have to modify certain things to make it work if you want to reproduce in your own environment.
Thanks for the time you’ve taken in helping me figure this out!
The closest thing I’ve been able to approximate what you’re doing is using devmanview /disable_enable to bounce my WinRM connection’s NIC- it definitely hangs on the Receive in that case, as I expected. Regardless, it’s a race, so without deferring the action on the Windows side, it’s possible that the NIC could bounce even before you’ve gotten the response from the WinRM Command POST, much less the actual process results via the next WinRM Receive call.
Unless you want to start working with things at the winrm level (trust me, you probably don’t), the trick is going to be deferring the device bounce until after the WinRM session has completed (where I was going with the sleep before the command in a separate process). This is further complicated by WinRM’s aggressive nuking of child processes once the parent shell has exited, so it’s also possible you’re running into that (eg, WinRM call completes, while Powershell subprocess is still sleeping, WinRM helpfully nukes the process for you before/during the action you want).
You might also want to look into doing this in a scheduled job- that would at least let you escape the constraints of the WinRM environment, though it brings a whole host of other problems, too…
Or just wait for async in 2.2.
Thank you for all the help Matt, I was finally able to solve this today
I did it using scheduled job as you had mentioned, basically deferring the install action on the host to a later time so I don’t lose the connection.
Fun times!
Great- sorry we had to go so far afield there, but I’ll make sure I try this test on the Windows async stuff so you can hopefully throw it away soon.