ssh retry functionality (for flaky network responses)

Hi,

Is there any way, in Ansible 1.9.6, of having Ansible retry a failed connection attempt x times at y intervals? We are seeing flaky network behavior when deploying on AWS, with apparently random ssh failures causing runs to break.

It’s never in the same place twice so it would be nice if we could have Ansible back off and retry before bailing.

If not in 1.9.6 (which we’re stuck on for another few months … .OpenShift reasons), how about 2.x?

Jeff

I’ve had moderate success by changing the ControlPersist in ssh_args parameter in ansible.cfg:

ssh_args = -o ControlMaster=auto -o ControlPersist=300s

That said, on big playbooks with a couple hundred hosts/tasks I generally still have at least one or two generic connection failures. Like you it happens on different tasks/hosts every time for me.

A generic retry would be phenomenal. Right now on the openshift side, we’ve found success with this config:

# config file for ansible -- [http://ansible.com/](http://ansible.com/)
# ==============================================
[defaults]
forks = NNN
host_key_checking = False
remote_user = root
roles_path = roles/
gathering = smart
fact_caching = jsonfile
fact_caching_connection = $HOME/ansible/facts
fact_caching_timeout = 600
log_path = $HOME/ansible.log
nocows = 1
callback_whitelist = profile_tasks

[privilege_escalation]
become = True

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=600s
control_path = %(directory)s/%%h-%%r
pipelining = True
timeout = 10

We give the ansible host as much memory as we possibly can (often 64G or so) for very large deployments where we want a lot of parallelism.

Thanks Ryan, I’ll take a look at that parameter.

Jeff

Thanks Jeremy, will compare and contrast with what we have currently.

If it goes away for more than a week, I’ll holler :slight_smile:

Jeff