Ansible hanging

Hi,

I’ve been able to run my playbook using fireball and a custom inventory (that queries the ec2 API) reasonably well using 5 hosts.

To completely apply a playbook from scratch takes about 160 seconds with paramiko and 65 seconds with fireball.

However, when I scale this out to 100 hosts, the playbook starts off quite quickly, and outputs lots of successful tasks but then after about 1 minute, it just grounds to a halt, with no apparent diagnostics.

I’m using ansible devel @ 8099e4ac.

What I have noticed is that specifying the inventory on the command line and having that get dynamically evaluated slows ansible down a lot. So to mitigate this, I just ran the inventory script once and piped the output into a text file, so that I could avoid the penalty of the EC2 API.

Then I ran this playbook to get fireballs running on each host:

  • hosts: all
    user: root
    gather_facts: false
    connection: paramiko
    tasks:
  • action: fireball

Running it like this:

$ ansible-playbook -i hosts.txt boot-fireball.yml -f 100 -vvv

took about 5 seconds to complete for 100 hosts.

Then I tried to run my actual playbook via fireball using the same hosts.txt forks=100, and this kicks off very quickly, successfully executing many tasks.

However, after about a minute, it completely grinds to a halt, without any diagnostics.

Is there anything that I can prod to get it to tell me where it is hanging?

For note, switching to paramiko appears to work, i.e. it is able to complete the entire playbook against all of the 100 hosts.

Cheers,

Ben

Hmm, 0mq has had some reported history of being temperamental.

Do you have anything in syslog on any of the hosts that hung, perhaps?

note -- we currently /do/ have a ticket open that exception logging
for fireball needs to be improved.

Hmm, 0mq has had some reported history of being temperamental.

Do you have anything in syslog on any of the hosts that hung, perhaps?

Unfortunately not - I’ve just burned those images. But I will be spinning this up again next week, so I’ll have a look into syslog then.

note – we currently /do/ have a ticket open that exception logging
for fireball needs to be improved.

OK, good to know - although if the app is blocked due to some kind of network state, it is difficult to log something out.

For now, running the playbook with ssh is probably good enough, although I do see the attraction of fireball. For a single run, it more than halved the playbook time, so it did achieve some milage.

Having said this, one naive question occurs to me - why do you need to make multiple network calls to execute a playbook? Would it be impractical to do the whole translation phase on the localhost first (i.e. resolve and bind all variables to tasks and templates), and then send the resulting command stream to the target host?

Or would it be even insaner to ship the ansible code to the target host and execute everything remotely (i.e. ship the code and keep the data local)?

Ansible can decide what to do to later based on what went before, and
even relative to what went before on other hosts in rather trivial
ways.
It's quite useful.

If you don't want that, just use ansible-pull if this appeals to you.

As long as you are pushing locally and not doing it all over the
internet, it should be quite reasonable to push :slight_smile:

0mq being random is likely just a 0mq bug. Folks should be certain
they want to fireball, and not everyone needs to.

Having said this, one naive question occurs to me - why do you need to make
multiple network calls to execute a playbook?

BTW, missed this the first time.

Look into "-c ssh" with ControlPersist if you want to keep connections open.

OK, that looks like it could bear fruit. As I indicated beforehand, just using plain paramiko is fast enough, so this might get a little bit further without having to debug the fireball mode. On balance, this is going to be good enough for our purposes - there is no concrete need to go as fast as fireball goes.

Thanks for the heads up.

Ansible can decide what to do to later based on what went before, and
even relative to what went before on other hosts in rather trivial
ways.
It’s quite useful.

Good to know. Can you give you some simple examples of this in action, so that we can get an impression of how that is working under the covers?

If you don’t want that, just use ansible-pull if this appeals to you.

Having ported my orchestration from Chef, I’d like to stick with the push model that attracted me to ansible in the first place.

My musings about shipping the code to the remote host were inspired by a half way house between pure push and pure pull. I was wondering whether you could run the orchestration in pure Python on the remote side without installing ansible and any dependencies remotely. But this is probably not very practical, so sorry for bringing it up.

As long as you are pushing locally and not doing it all over the
internet, it should be quite reasonable to push :slight_smile:

Unfortunately the use case is to configure 100+ nodes on AWS. Potentially I might need to spin up an orchestrator node on AWS from which I run ansible.

0mq being random is likely just a 0mq bug. Folks should be certain
they want to fireball, and not everyone needs to.

As indicated before, ssh is probably fast enough, and hence fireball is probably just nice to have.

Good to know. Can you give you some simple examples of this in action, so
that we can get an impression of how that is working under the covers?

Look up "register" and "only_if" / "when" in the advanced playbooks
doc for some of this.

My musings about shipping the code to the remote host were inspired by a
half way house between pure push and pure pull. I was wondering whether you
could run the orchestration in pure Python on the remote side without
installing ansible and any dependencies remotely. But this is probably not
very practical, so sorry for bringing it up.

There's room for this in the future, if we flag a playbook to run in
that way, and there was a wrapping module we wrote, ansible-playbook
could output pure-JSON, and ansible was also installed on the remote
end.
It's technically possible, just a little involved.

Seth and I originally called this idea the "ansi-ball" (and I don't
think we even need the tarball part of the original theory).

However, for most people something like ControlPersist with -c ssh is
a nicer way to go that achieves similar ends.

use case is to configure 100+ nodes on AWS. Potentially I
might need to spin up an orchestrator node on AWS from which I run ansible.

Sure thing.

Tons of people doing that. AWS is clearly slower from the outside,
running ansible inside is a really good idea, and you also save
bandwidth!

Hi Ben,
I’m in the process of prototyping/assessing a pull implementation with ansible. Your comment caught my eye - would you mind sharing what kinds of issues you ran into with implementing a pull model with Chef that made you transition to Ansible and/or consider a hybrid push/pull approach?

regarding whether something worked out with Chef or not, I'd rather
not discuss tool X vs tool Y, or 'should I adopt Ansible vs X' on this
mailing list, folks.

This is largely out of respect for people working on those other
tools, but also because, eventually, I'm going to tell someone how I
really feel about tool X or Y :slight_smile:

Please reply to each other directly if you want to discuss that.

Fair enough; the question was really with regards to push vs pull vs hybrid (I should’ve been clearer) - not Chef vs Ansible vs x.
I will email you directly about the latter question :slight_smile: