XML "not well-formed" error with powershell, different behavior in verbose mode

Hi!

I am having an issue with ansible interacting with powershell on a windows-machine. The Problem disappears when using verbose mode, which makes it a bit difficult to debug.

I am using ansible 2.18.1 on a linux machine, connecting via ssh to a windows machine running OpenSSH-Server. The necessary registry-keys are configured so powershell is used as the default shell. Connecting to the target using ssh -i path/to/keyfile user@target works fine and I am able to type and execute powershell commands.

I have played around with ansible-playbooks and ad-hoc-commands as well and at first they were working fine, until a few weeks ago (I can’t think of anything relevant I have changed since), when my playbooks as well as my ad-hoc-commands are failing with the following message:

$ ansible target -m win_ping
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 260
target | FAILED! => {
    "msg": "Unexpected failure during module execution: not well-formed (invalid token): line 1, column 260",
    "stdout": ""
}

This was the output of ansible target -m win_ping (with username and ssh-keyfile being configured in my inventory file), but it is the same issue when I execute -m win_shell -a "whoami", a playbook using the ansible.windows.win_updates-module or anything else.

The error message advices me to use the -vvv-option, and when I execute the same command in verbose mode (starting with -v), it works absolutely fine.

$ ansible -v target -m win_ping
No config file found; using defaults
target | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

When using the ENV-Variable ANSIBLE_DEBUG=true with the command, but without -v, the command still fails.

I have tried this with 3 different Windows-machines as targets, issue is always identical.

I would be happy to see what line 1, column 260 contains, but I have no idea how to debug it. Running all my commands with -v all the time wouldn’t be my prefered solution either…

It sounds like there’s some debug message being emitted by SSH that is breaking the parsing of the CLIXML string. We changed how SSH executes modules for Windows in 2.18 to reduce the number of sub-processes it called and hopefully simplify the CLIXML output but looks like there may potentially be a bug here that we need to fix.

There are two things you can do here

The latter will be the easiest option here as it’ll show you the full stderr from the result without you having to parse it through the chunk output and other debug data. If you can get that info it would be great to see what it looks like.

If you run with -vvv does it show you the SSH command that was run? If so what does that look like? When you set the remote shell, how did you configure that in the registry?

Thank you for your detailed answer.

I inserted the line in ssh.py, afterwards I get the following output:

$ ansible target -m win_ping   
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 271: invalid start byte
target | FAILED! => {
    "msg": "Unexpected failure during module execution: 'utf-8' codec can't decode byte 0x81 in position 271: invalid start byte",
    "stdout": ""
}

→ Position 271 is different from 260

to get around the decode-error, I used print(stderr) instead, so here we go:

b'#< CLIXML\r\n<Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04"><Obj S="progress" RefId="0"><TN RefId="0"><T>System.Management.Automation.PSCustomObject</T><T>System.Object</T></TN><MS><I64 N="SourceId">1</I64><PR N="Record"><AV>Module werden f\x81r erstmalige Verwendung vorbereitet.</AV><AI>0</AI><Nil /><PI>-1</PI><PC>-1</PC><T>Completed</T><SR>-1</SR><SD> </SD></PR></MS></Obj><Obj S="progress" RefId="1"><TNRef RefId="0" /><MS><I64 N="SourceId">2</I64><PR N="Record"><AV>Module werden f\x81r erstmalige Verwendung vorbereitet.</AV><AI>0</AI><Nil /><PI>-1</PI><PC>-1</PC><T>Completed</T><SR>-1</SR><SD> </SD></PR></MS></Obj></Objs>'

The target system is using german localization, so \x81 is an ü. When subtracting the number of the characters #< CLIXML\r\n at the beginning, character 260 is exactly this ü, so I guess it all breaks down to an encoding problem. The sentence translates to Preparing modules for first use, but as my powershell knowledge is rare, I don’t really get the context for that message.

If you run with -vvv does it show you the SSH command that was run?

I can see the ssh-command, execution part is PowerShell -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -EncodedCommand <long Base64-string>, where latter decodes to

try { [Console]::InputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding } catch { $null = $_ }
if ($PSVersionTable.PSVersion -lt [Version]"3.0") {
'{"failed":true,"msg":"Ansible requires PowerShell v3.0 or newer"}'
exit 1
}
$exec_wrapper_str = $input | Out-String
$split_parts = $exec_wrapper_str.Split(@("`0`0`0`0"), 2, [StringSplitOptions]::RemoveEmptyEntries)
If (-not $split_parts.Length -eq 2) { throw "invalid payload" }
Set-Variable -Name json_raw -Value $split_parts[1]
& ([ScriptBlock]::Create($split_parts[0]))

but i still don’t get where the actual command is executed.

When you set the remote shell, how did you configure that in the registry?

I did that according to the official documentation, see here

Hope this helps, let me know if I can assist in this further.

Thanks for the detailed reply, you are right it is the Python element tree module failing to decode ü at position 260 that is causing the problem. We are expecting the module output to be in UTF-8 which worked before 2.18 but since 2.18 we removed the intermediate wrapper process which set the codepage before running the sub-process so anything before our code that outputs like the CLIXML message here will not be in UTF-8 but the machine’s default encoding.

I’ll have to think about the way to solve this problem a bit, we don’t want to go back to having the second process as that slows down the execution, especial on slow hosts but we do need to handle such situations.

Thanks once again for sharing this info. I’ve opened ssh - Improve CLIXML stderr parsing by jborean93 · Pull Request #84569 · ansible/ansible · GitHub which should solve the problem for you and will backport it to 2.18 once merged.