2019-Jun-06 devel branch and random hangs using winrm?

Upgraded ansible to

$ ansible --version

ansible 2.9.0.dev0

config file = /Users/tanner/projects/ansible.git/playbooks.git/celadonsystems.com/ansible.cfg

configured module search path = [u’/Users/tanner/.ansible/plugins/modules’, u’/usr/share/ansible/plugins/modules’]

ansible python module location = /Users/tanner/projects/ansible.git/ansible/lib/ansible

executable location = /Users/tanner/projects/ansible.git/ansible/bin/ansible

python version = 2.7.10 (default, Feb 22 2019, 21:55:15) [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)]

And now random playbooks and tasks just hang in unusual places.

Before the upgrade of ansible all this playbooks and tasks worked as expected.

`

  • name: Install programs (choco)
    win_chocolatey:
    name: “{{ item }}”
    state: “present”
    when:
  • chocolatey_packages is defined
    loop: “{{ chocolatey_packages|flatten(levels=1) }}”
    `

The above is a task that randomly hangs.

<win2016.LAB.testing.com> ESTABLISH WINRM CONNECTION FOR USER: Administrator@LAB.testing.com on PORT 5986 TO win2016.LAB.testing.com Using module file /Users/tanner/projects/ansible.git/ansible/lib/ansible/modules/windows/win_chocolatey.ps1 Pipelining is enabled.

I’ve let this sit for 12 hours.

No timeouts. No errors.

$ python -V

Python 2.7.10

Management host: macOS Mojave 10.14.5

Managed host: Windows Server 2016

$ kinit --version

kinit (Heimdal 1.5.1apple1)

Copyright 1995-2011 Kungliga Tekniska Högskolan

Send bug-reports to heimdal-bugs@h5l.org

This feels like a personal problem or a python-2.7 problem (still in progress of upgrading to python3)

I didn’t open an issue because I’d like to learn how to report more(?) better(?) info regarding this issue.

Any help would be appreciated.

Thanks.

I do notice when things hang the python process goes into a zombie state.

501 87898 86460 0 9:37PM ttys000 0:00.00 (python) 2006 0 0 0 0 - Z
501 87900 86460 0 9:37PM ttys000 0:00.00 (python) 2006 0 0 0 0 - Z
501 87909 86460 0 9:38PM ttys000 0:00.00 (python) 2006 0 0 0 0 - Z

If everything worked, what was the reason for upgrading to an unreleased development version?

This isn’t a production environment and I am testing the devel branch by design. Helping with making ansible better?

Hi Bob,

Does it just hang on chocolatey or on other things too?

I think there’s been a recent change which may have broken the ansible<->chocolatey integration although iirc its been fixed by latest chocolatey.

Anything in the event logs on the target machine(s)? I don’t use choco much but is it logging and if so is there anything in the log file for it?

Hope this helps,

Jon

Hi Bob,

Does it just hang on chocolatey or on other things too?

I think there’s been a recent change which may have broken the ansible<->chocolatey integration although iirc its been fixed by latest chocolatey.

Anything in the event logs on the target machine(s)? I don’t use choco much but is it logging and if so is there anything in the log file for it?

Hope this helps,

Predominately win_chocolatey tasks. And, yes, there was a problem with ansible<->chocolatey but 0.15.0 fixed the problem.

Infrequently win_update, which I have always attributed to windows update services being buggy, native patching has problems for me sometimes :frowning:

I’ll be testing more this week.

There’s hasn’t been any changes to the winrm or exec process for the devel branch that isn’t in 2.8 and I haven’t seen any big issues where it is hanging. Some things to check;

  • Do you see the module process actually running on the remote host

  • When it hangs, is the process still running, any left over processes that isn’t powershell.exe

  • Is it easily replicable, or does it only happen randomly, e.g. sometimes a task works but other times it doesn’t

  • I see you are using a MacOS host with Kerberos auth, there is a problem using the builtin kinit process on MacOS which causes a hang. Try using a self managed Kerberos ticket to rule out this problem

The MacOS Kerberos kinit issue is fixed by making sure pexpect is installed on the Python environment for the controller but it’s best to just use your own Kerb ticket to rule it out.

Thanks

Jordan

I am using MacOS with Kerb auth and I do not understand “self managed Kerberos tickets”

On my macOS controller I pretty much follow https://web.mit.edu/Kerberos/www/krb5-devel/doc/user/tkt_mgmt.html

I’ll install pexpect and see if things work better.

I’ll follow up on the other questions after working things with pexpect installed.

  • Do you see the module process actually running on the remote host

I do not see the module process running on the remote host. System Idle Process and Task Manager are the only things taking CPU.

  • When it hangs, is the process still running, any left over processes that isn’t powershell.exe

I do see powershell.exe and choco.exe in the process list.

  • Is it easily replicable, or does it only happen randomly, e.g. sometimes a task works but other times it doesn’t

Easily replicable with this ansible tasks

  • name: update installed chocolatey packages
    win_chocolatey:
    name: all
    state: latest
    tags: win_chocolatey
  • I see you are using a MacOS host with Kerberos auth, there is a problem using the builtin kinit process on MacOS which causes a hang. Try using a self managed Kerberos ticket to rule out this problem

The MacOS Kerberos kinit issue is fixed by making sure pexpect is installed on the Python environment for the controller but it’s best to just use your own Kerb ticket to rule it out.

I installed pexpect and no change in behavior. The “update installed chocolatey packages” task hangs.

Doing a dtruss on the python process on the controller I just get screens of the following.

$ sudo dtruss -p 64569
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0x7FFEE3C1D610) = 0 0

The self managed kerberos ticket is just you running kinit yourself and disabling the auto kinit process that Ansible runs if you supply the password. Because you actually see the process running on the Windows host then that rules out that problem and you can ignore it.

If it is hanging on choco upgrade all and you still see choco.exe running in the background then there’s not much we can do. Chocolatey seems to be running something and not exiting so win_chocolatey won’t exit for you. You will have to look at Chocolatey’s logs, or even run the upgrade all process yourself manually and see if it’s prompting for anything.

Thanks

Jordan

I am doing the kinit myself.

I did install pexpect but didn’t seem to make a difference, still hung. This time on removal of the chocolatey packages hung, task below:

  • name: remove old/deprecated packages (choco)
    win_chocolatey:
    name: “{{ chocolatey_remove_packages|list }}”
    state: “absent”
    when:
  • chocolatey_remove_packages is defined

Doing a “cup all” there are packages that need to be updated and the update process finishes as expected with 0

Chocolatey upgraded 9/40 packages.
See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).

Upgraded:

  • googlechrome v75.0.3770.80
  • winscp v5.15.2
  • flashplayeractivex v32.0.0.207
  • git v2.22.0
  • winscp.install v5.15.2
  • cleanup v4.5.2.20190522
  • vscode v1.35.0
  • flashplayerplugin v32.0.0.207
  • git.install v2.22.0
    PS C:\Users\Administrator.CORP> $LASTEXITCODE
    0

Mostly a linux and MacOS guy so I find the Windows Event Viewer super convoluted.

Where would I look for errors?

I see Information > Ansible stuff but nothing in Critical, Error, or Warning related to Ansible.

I’ll also jump over to the chocolatey group but I think they will say the problem only occurs with win_chocolatey it’s an Ansible issue not a chocolatey issue.

It’s a matter of narrowing down there the hang happens, the best way to do this is to narrow down is to find a single task that is repeatable and will always result in the hang. When you do this you can then debug further to narrow down where exactly the hang occurs, this can be done by;

  • Running the command manually to see if it works properly

  • Checking the logs that relate to the action, this could be Windows Event logs for Windows stuff but it could also just be log files for the program you are calling

  • Use something like win_command/win_shell to run the same step manually and see if the problem exists in the module or in the calling program + possibly WinRM
    If you can’t reliably replicate the hang then you need to do some more work in trying to find common scenarios that cause it to happen. Common scenarios you can try and identify are;

  • Whether this happens when calling a certain program, e.g. chocolatey

  • Using Chocolatey as an example, if it only happens when using it look at the Chocolatey logs on the system, IIRC this is at C:\ProgramData\Chocolatey

  • Going further, I would then try win_command/win_shell to run the same task and see if it hangs

  • I would also run the same command locally (outside of WinRM) and see if it hangs/prompts something

  • Whether this happens only on a certain host, or group of hosts

  • If a group of hosts then do they share anything in common, e.g. PS version, Windows version and so on- If it just happens random for any task, e.g. for win_file, win_stat, any other module then that is good info to know

  • This would indicate a potential problem with Ansible’s exec model but not necessarily- Try out older Ansible versions, e.g. 2.6, 2.7, or 2.8 to see if the problem is also there

  • If it only started appearing after a set date then look at what changed then

  • This could be things like a Windows update

  • A program update

  • Python package change

  • Ansible update

  • GPO policy change

  • So much more

Identifying hangs aren’t easy but without knowing the full story is next to impossible. From what you are saying the problem looks like it is happening on the execution side and unrelated to the Ansible controller so debugging the Python process probably won’t give you any good information.

Thanks

Jordan

Ansible did finally throw an error, this time on a Windows7 systems

TASK [win_chocolatey : update installed chocolatey packages] *******************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ConnectionError: HTTPSConnectionPool(host='win7.LAB.testing.com', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x111607d90>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))
fatal: [win7.LAB.testing.com]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": “"}

Some time stats

real 65m28.845s
user 3m14.159s
sys 0m47.496s

So how does the win_chocolatey module interactive with chocolatey packages what use something like AutoIt for post-install configuration?

(Linux guy here) so does Windows have something like a controlling tty?

Doing some debugging and I think the hang is on upgrading a package called CleanUp! (which I do maintain) <https://chocolatey.org/packages/cleanup> and the chocolateyintall.ps1 uses AutoIt

$ErrorActionPreference = ‘Stop’;

$packageName = ‘CleanUp’
$version = ‘452’
$fullPackage = $packageName + $version + ‘.exe’
$url = ‘http://stevengould.org/downloads/cleanup/’ + $fullPackage
$checkSum = ‘6ac7576c0b48ddee292f85724c7917e11360927ff8e5c5a8d795577a4241c131’
$workSpace = Join-Path $env:TEMP “$packageName-$env:chocolateyPackageVersion”

silent install requires AutoIT

IF remove cancel out (control-c) the Ansible process on the management host and I do a “choco list -lo” on the managed host I get a warning:

“[Pending] Removing incomplete install for ‘cleanup’”

That would seem to hint that the chocolatey was processing the CleanUp! package?

Any recommendations on what to try/do next?

Took CleanUp! out of the package installation and upgrade tasks and now all win_chocolatey tasks complete.

Another chocolatey package I maintain EasyLog USB <https://chocolatey.org/packages/easylogusb&gt;, it also uses AutoIt for installation/updates <https://github.com/basictheprogram/chocolatey-EasyLogUSB/blob/master/tools/chocolateyinstall.ps1&gt; and the win_chocolatey update task hangs on this package too.

So it looks like win_chocolatey modules has problem interacting with AutoIt?

Tested against Windows7, Windows10, Windows Server 2008 R2, Windows Server 2016 and all releases hang on the upgrade of CleanUp!

Looking like the problem with win_chocolatey and AutoIt.

AutoIt can get really fun when dealing with non-interactive sessions that WinRM is run on. The behaviour around console windows in session 0 is not the same as a normal interactive session you get when running it manually. If AutoIt is waiting for specific console state or is trying to click a button in the wrong area it won’t work as expected. So effectively the problem isn’t Chocolatey, Ansible, or win_chocolatey, it’s using AutoIt in a non-interactive session that is a byproduct of WinRM. I don’t know enough about AutoIt to give you tips on your scripts but ultimately you want to try and have some sort of timeout that will fail if it exceeds a certain time. This won’t help with getting it working but it will at least tell people what the issue may be instead of just hanging.

Thanks

Jordan