We use Ansible to deploy code updates across a small fleet (~8 machines). At least a few times a week, we run into network hiccups that cause the SSH connection to a random EC2 instance to fail, causing the entire playbook run to fail. Sometimes this happens such that we are left with an incomplete deploy, which is no fun. In almost all cases we can immediately re-launch the playbook and the errant instance is fine the second time around. These appear to be very short interruptions, and there’s no rhyme or reason as to which instance it effects. It’s usually only one instance out of our fleet at a time (though there’s no pattern as to which has connectivity issues).
What kind of strategies is everyone using to deal with these sort of sporadic SSH failures that cause the whole playbook run to fail prematurely?
Hmm…
We run a lot of EC2 plays from our integration tests – originating not in EC2 and running almost constantly, and don’t really see this – some other providers, yes. I’d be curious if others do.
You can definitely consider running the Ansible control machine inside EC2, where connections will be more reliable (and also faster), which is something I usually recommend to folks.
Another thing is when spinning up new instances, using the “wait_for” trick, be sure to put a sleep in after the wait_for. SSH ports can come up but not be quite ready, which gives the appearance of SSH failure. I’m wondering if that might be part of it, or if you’re seeing connection issues at effectively random points or just those.
You can definitely consider running the Ansible control machine *inside*
EC2, where connections will be more reliable (and also faster), which is
something I usually recommend to folks.
We run an Ansible Tower instance in EC2 that runs these tasks. This is
where we are seeing the issues. We've tried running the playbooks from a
few different host machines on there, but we always eventually run into the
periodic SSH network failure where subsequent retries eventually work.
Another thing is when spinning up new instances, using the "wait_for"
trick, be sure to put a sleep in after the wait_for. SSH ports can come
up but not be quite ready, which gives the appearance of SSH failure. I'm
wondering if that might be part of it, or if you're seeing connection
issues at effectively random points or just those.
While we do use Ansible for provisioning new instances, that's not where
we're seeing the issue. It's our playbooks for rolling out code updates.
We're just SSH'ing into each (existing) app server, transferring the
updated code, and running a process restart. So by the time we run these
playbooks, the instances could be hours or days or months old at that
point, making the port readiness issue a non-factor.
Most of the time the EC2 network is fast and reliable, but we deploy
frequently and do run into these issues from time to time. This is
consistent with the errors we've seen with our app servers temporarily
being unable to reach ElastiCache instances. Failure is just one of those
things we have to live with and build for in EC2.
Yeah this is most definitely not a Tower specific thing since it’s just running Ansible underneath – but it’s not something we have been seeing.
I’d say run things periodically and avoid use of the Atlantis or Pompeii availability zones?
us-east-1 definitely falls under this description. We have at least one or
two small 10-15 second long hiccups each week between app servers and the
DB/cache instances, or even between specific ELBs and their child
instances. The disruptions are usually over so fast that it's not a big
deal, but that's a bit different of a case than a deploy.