ssh retry functionality (for flaky network responses)

Jeff_Richards · July 29, 2016, 4:49pm

Hi,

Is there any way, in Ansible 1.9.6, of having Ansible retry a failed connection attempt x times at y intervals? We are seeing flaky network behavior when deploying on AWS, with apparently random ssh failures causing runs to break.

It’s never in the same place twice so it would be nice if we could have Ansible back off and retry before bailing.

If not in 1.9.6 (which we’re stuck on for another few months … .OpenShift reasons), how about 2.x?

Jeff

Ryan_Groten · August 3, 2016, 3:58pm

I’ve had moderate success by changing the ControlPersist in ssh_args parameter in ansible.cfg:

ssh_args = -o ControlMaster=auto -o ControlPersist=300s

That said, on big playbooks with a couple hundred hosts/tasks I generally still have at least one or two generic connection failures. Like you it happens on different tasks/hosts every time for me.

jeder · August 3, 2016, 4:35pm

A generic retry would be phenomenal. Right now on the openshift side, we’ve found success with this config:

# config file for ansible -- [http://ansible.com/](http://ansible.com/)
# ==============================================
[defaults]
forks = NNN
host_key_checking = False
remote_user = root
roles_path = roles/
gathering = smart
fact_caching = jsonfile
fact_caching_connection = $HOME/ansible/facts
fact_caching_timeout = 600
log_path = $HOME/ansible.log
nocows = 1
callback_whitelist = profile_tasks

[privilege_escalation]
become = True

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=600s
control_path = %(directory)s/%%h-%%r
pipelining = True
timeout = 10

We give the ansible host as much memory as we possibly can (often 64G or so) for very large deployments where we want a lot of parallelism.

Jeff_Richards · August 3, 2016, 7:54pm

Thanks Ryan, I’ll take a look at that parameter.

Jeff

Jeff_Richards · August 3, 2016, 7:55pm

Thanks Jeremy, will compare and contrast with what we have currently.

If it goes away for more than a week, I’ll holler

Jeff

Topic		Replies	Views
Dealing with sporadic network failures on EC2 while running playbooks Ansible Project	4	1	September 12, 2014
Retrying failed tasks Ansible Project	3	78	August 10, 2015
[Errno 110] Connection timed out with multiple ssh in router Ansible Project	18	8	July 8, 2021
ANSIBLE_SSH_RETRIES on a playbook / host level? Get Help aap	1	28	May 22, 2025
Handling Retries Ansible Project	1	105	January 4, 2016

ssh retry functionality (for flaky network responses)

Related topics