Ansible hangs on package installation for hours before connection to host is lost

Hi,
I’m new to ansible and I’m facing an issue with executing our playbook.

Setup-

  • For 10 hosts on azure

  • all hosts are identical and newly created

  • ansible.cfg-
    [defaults]
    host_key_checking = False

  • portion of playbook that is installing packages

  • name: Set Initscripts to be Installed
    set_fact:
    packages_to_install: “{{ packages_to_install | default() + [ ‘initscripts’ ] }}”
    tags:

  • common
    when: ansible_distribution_major_version is version_compare(‘7’, ‘<’)

  • name: Set Java to be Installed
    set_fact:
    packages_to_install: “{{ packages_to_install | default() + [ jboss_java_pkg_name ] }}”
    when: install_java|bool
    tags:

  • common

  • name: Set Unzip to be Installed
    set_fact:
    packages_to_install: “{{ packages_to_install | default() + [ ‘unzip’ ] }}”
    tags:

  • common

  • name: Install packages
    package:
    name: “{{ packages_to_install }}”
    state: present
    register: packages_result
    retries: 5
    until: packages_result is success

At the step of package installation, ansible will freeze without any log output for 2 hours and then finally it will throw ssh connection error that host is unreachable.

Logs of the run

  • 9/8/2020 8:18:08 PM : “download_only”: false,
  • 9/8/2020 8:18:20 PM : 17403 1599576500.81986: _low_level_execute_command(): executing: /bin/sh -c ‘rm -f -r /home/wbuser/.ansible/tmp/ansible-tmp-1599576402.5-17403-228811267706642/ > /dev/null 2>&1 && sleep 0’
  • 9/8/2020 8:18:20 PM : >>><<<
  • 9/8/2020 8:18:20 PM : 17403 1599576500.86101: attempt loop complete, returning result
  • 9/8/2020 8:18:20 PM : 17403 1599576500.86111: dumping result to json
  • 9/8/2020 8:18:20 PM : “invocation”: {
  • 9/8/2020 8:18:20 PM : “update_cache”: false,
  • 9/8/2020 8:18:20 PM : “unzip-6.0-21.el7.x86_64 providing unzip is already installed”,
  • 9/8/2020 10:28:42 PM : 17410 1599584322.50669: stderr chunk (state=3):
  • 9/8/2020 10:28:42 PM : >>>Shared connection to 52.225.188.39 closed.
  • 9/8/2020 10:28:42 PM : <<<
  • 9/8/2020 10:28:42 PM : 17410 1599584322.50769: stderr chunk (state=3):
  • 9/8/2020 10:28:42 PM : >>><<<
  • 9/8/2020 10:28:42 PM : 17410 1599584322.50780: stdout chunk (state=3):
  • 9/8/2020 10:28:42 PM : >>><<<
  • 9/8/2020 10:28:42 PM : <52.225.188.39> (255, ‘’, ‘Shared connection to 52.225.188.39 closed.\r\n’)
  • 9/8/2020 10:28:42 PM : 17410 1599584322.50832: _low_level_execute_command(): starting
  • 9/8/2020 10:28:42 PM : 17410 1599584322.50843: _low_level_execute_command(): executing: /bin/sh -c ‘rm -f -r /home/wbuser/.ansible/tmp/ansible-tmp-1599576402.86-17410-193336400580842/ > /dev/null 2>&1 && sleep 0’
  • 9/8/2020 10:28:42 PM : <52.225.188.39> ESTABLISH SSH CONNECTION FOR USER: wbuser
  • 9/8/2020 10:28:42 PM : <52.225.188.39> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=33000 -o ‘IdentityFile=“/tmp/AnsibleFiles/id_rsa”’ -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ‘User=“wbuser”’ -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ControlPath=/root/.ansible/cp/cccd982d88 52.225.188.39 ‘/bin/sh -c ‘"’“‘rm -f -r /home/wbuser/.ansible/tmp/ansible-tmp-1599576402.86-17410-193336400580842/ > /dev/null 2>&1 && sleep 0’”’"‘’
  • 9/8/2020 10:28:43 PM : 17410 1599584323.14653: stderr chunk (state=2):
  • 9/8/2020 10:28:43 PM : >>><<<
  • 9/8/2020 10:28:43 PM : 17410 1599584323.14667: stdout chunk (state=2):
  • 9/8/2020 10:28:43 PM : >>><<<
  • 9/8/2020 10:28:43 PM : <52.225.188.39> (0, ‘’, ‘’)
  • 9/8/2020 10:28:43 PM : 17410 1599584323.14703: _low_level_execute_command() done: rc=0, stdout=, stderr=
  • 9/8/2020 10:28:43 PM : 17410 1599584323.14742: _execute() done
  • 9/8/2020 10:28:43 PM : 17410 1599584323.14745: dumping result to json
  • 9/8/2020 10:28:43 PM : 17410 1599584323.14750: done dumping result, returning
  • 9/8/2020 10:28:43 PM : 17410 1599584323.14777: done running TaskExecutor() for JbossEAP-26/TASK: jboss_common_role : Install packages [000d3ae7-f205-69c8-6cc1-0000000000c9]
  • 9/8/2020 10:28:43 PM : 17410 1599584323.14799: sending task result for task 000d3ae7-f205-69c8-6cc1-0000000000c9
  • 9/8/2020 10:28:43 PM : 17410 1599584323.15331: done sending task result for task 000d3ae7-f205-69c8-6cc1-0000000000c9
  • 9/8/2020 10:28:43 PM : 17410 1599584323.15339: WORKER PROCESS EXITING
  • 9/8/2020 10:28:43 PM : fatal: [JbossEAP-26]: UNREACHABLE! => {
  • 9/8/2020 10:28:43 PM : “changed”: false,
  • 9/8/2020 10:28:43 PM : “msg”: “Failed to connect to the host via ssh: Shared connection to 52.225.188.39 closed.”,
  • 9/8/2020 10:28:43 PM : “unreachable”: true
  • 9/8/2020 10:28:43 PM : }

Can anyone help in understanding why this would be happening and what measures I can take to prevent this?

Hi,
I'm new to ansible and I'm facing an issue with executing our playbook.

Setup-
* For 10 hosts on azure
* all hosts are identical and newly created
* ansible.cfg-
[defaults]
host_key_checking = False

* portion of playbook that is installing packages

- name: Set Initscripts to be Installed
set_fact:
packages_to_install: "{{ packages_to_install | default() + [ 'initscripts' ] }}"
tags:
- common
when: ansible_distribution_major_version is version_compare('7', '<')

- name: Set Java to be Installed
set_fact:
packages_to_install: "{{ packages_to_install | default() + [ jboss_java_pkg_name ] }}"
when: install_java|bool
tags:
- common

- name: Set Unzip to be Installed
set_fact:
packages_to_install: "{{ packages_to_install | default() + [ 'unzip' ] }}"
tags:
- common

- name: Install packages
package:
name: "{{ packages_to_install }}"
state: present
register: packages_result
retries: 5
until: packages_result is success

There is really no point to retrying package installation. Better look at the output when it fails first.

Also your until: condition compares a dict with a variable named success. So it will never return a true
value.

Regards
         Racke

IIRC this should not work - it will get into an endless loop.

This may or may not be your problem.
It occurs elsewhere below as well.

This playbook works with less number of hosts. As the number of hosts increase the probability of error increases. There are a few runs which have completed successfully with less number of hosts. Though I don’t have logs for them.
I’ll take your suggestions and try without until and retry. But if the problem is with the dictionary check then it would have failed for all the hosts. But it failed for only 1 host and completed for the other 9.

I can attach the complete log file if that helps

There are multiple way to improve the performance. I have noticed multiple time yum ansible module slow down the performance.

Try to use the pipelining or mitogen

Please refer the below URL

https://www.toptechskills.com/ansible-tutorials-courses/speed-up-ansible-playbooks-pipelining-mitogen/#

This playbook works with less number of hosts. As the number of hosts increase the probability of error increases. There
are a few runs which have completed successfully with less number of hosts. Though I don't have logs for them.
I'll take your suggestions and try without until and retry. But if the problem is with the dictionary check then it
would have failed for all the hosts. But it failed for only 1 host and completed for the other 9.

Yes that is correct .. success is a builtin test in Ansible.

But at any rate when the package install fails, that's a problem of the host (e.g. lack of resources,
installation problems) which need to be addressed instead of retrying the installation.

Regards
         Racke

I logged in to the VMs when ansible got stuck and here are my findings -

  1. I was able to login from the same container using same ssh keys.
  2. The packages were installed successfully on the host machines and yet not confirmation was given from ansible on the host machine.

I’m using the redhat-cop/ansible-role-jboss-common repo for setting up the host machine.

Any specific logs that I should look for?

Any reason why this would get stuck in endless loop?

I’m using the redhat-cop/ansible-role-jboss-common repo for setting up the host machine.

I would try getting help from the authors of the redhat-cop repository first, if only to make sure things can actually work (this is not known).