parallel execution of playbook at a time in multiple hosts

I don't know if this is a lack of memory, that normally gets a kernel
message mentioning killing off processes, this looks like something
much worse that is causing segfaults all over.

[2220566.328031] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:1838!
[2220566.328031] invalid opcode: 0000 [#3] SMP

looks like some nasty kernel bug related to memory allocation.

How to upgrade to ansible latest version? and how to solve the the following forks issue? Because i am very much struggling this issue?
ansible-playbook ssh.yml --force-handlers --forks=100

PLAY [Transfer and execute a script.] *****************************************

TASK: [Transfer the script] ***************************************************
changed: [dsrv493 → 127.0.0.1]
changed: [dsrv487 → 127.0.0.1]
changed: [dsrv486 → 127.0.0.1]
changed: [dsrv209 → 127.0.0.1]
changed: [dsrv488 → 127.0.0.1]
changed: [dsrv531 → 127.0.0.1]
Process SyncManager-1:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/managers.py”, line 558, in _run_server
server.serve_forever()
File “/usr/lib/python2.7/multiprocessing/managers.py”, line 184, in serve_forever
t.start()
File “/usr/lib/python2.7/threading.py”, line 745, in start
_start_new_thread(self.__bootstrap, ())
error: can’t start new thread
Traceback (most recent call last):
File “/usr/lib/pymodules/python2.7/ansible/runner/init.py”, line 85, in _executor_hook
Process Process-85:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
Traceback (most recent call last):
File “/usr/bin/ansible-playbook”, line 324, in
self.run()
sys.exit(main(sys.argv[1:]))
File “/usr/bin/ansible-playbook”, line 264, in main
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/pymodules/python2.7/ansible/runner/init.py”, line 81, in _executor_hook
result_queue.put(return_data)
pb.run()
File “/usr/lib/pymodules/python2.7/ansible/playbook/init.py”, line 348, in run
File “”, line 2, in put
if not self._run_play(play):
File “/usr/lib/pymodules/python2.7/ansible/playbook/init.py”, line 789, in _run_play
File “/usr/lib/python2.7/multiprocessing/managers.py”, line 758, in _callmethod
if not self._run_task(play, task, False):
File “/usr/lib/pymodules/python2.7/ansible/playbook/init.py”, line 497, in _run_task
results = self._run_task_internal(task, include_failed=include_failed)
File “/usr/lib/pymodules/python2.7/ansible/playbook/init.py”, line 439, in _run_task_internal
results = runner.run()
File “/usr/lib/pymodules/python2.7/ansible/runner/init.py”, line 1485, in run
Process Process-86:
while not job_queue.empty():
File “”, line 2, in empty
File “/usr/lib/python2.7/multiprocessing/managers.py”, line 755, in _callmethod
conn.send((self._id, methodname, args, kwds))
results = self._parallel_exec(hosts)
File “/usr/lib/pymodules/python2.7/ansible/runner/init.py”, line 1393, in _parallel_exec
IOError: [Errno 32] Broken pipe
prc.start()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 130, in start
self._connect()
File “/usr/lib/python2.7/multiprocessing/managers.py”, line 742, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File “/usr/lib/python2.7/multiprocessing/connection.py”, line 175, in Client
Traceback (most recent call last):
self._popen = Popen(self)
File “/usr/lib/python2.7/multiprocessing/forking.py”, line 121, in init
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
changed: [dsrv449 → 127.0.0.1]

Can you please tell how to solve this issue? Because this makes me lot of issues while running in this 100 servers.

I'm not sure your issues are ansible related, just triggered by
ansible forks. I don't think upgrading to the lastest version will
solve anything for you, you need to track down why your kernel is
hitting those segfault issues.

@Anand: you clearly have issues with your control host (hadware/OS). Try testing hardware and get rid of these messages before getting another try with ansible (or any software) on this host.

@Brian:
TL;DR: Good news: I have no DNS issue anymore. Bad news: It seems the root cause of my previous observations is that ansible has issues when dealing with errors related to hosts (hosts unreachable, etc), and it hurts its parallelism very badly.

1- Not a DNS issue

I got rid of all faulty hosts and ran a new bunch of tests. Execution times are now identical between an inventory only with hostnames and an inventory with explicit IP addresses in ansible_ssh_host variable. It seems that was environmental. I’ll keep an eye on it on my side.
[reminder: still with pipeling enabled but ControlMaster disabled]

  • $ time ansible all -i inventory_without_ip -m ping
    real 0m9.145s
    user 0m8.239s
    sys 0m2.787s
  • $ time ansible all -i inventory_with_ip -m ping
    real 0m9.040s
    user 0m7.570s
    sys 0m2.723s

I additionaly checked the items you suggested:

  • DNS resolution seems fine (on one host, there is a local DNS cache, on the other host, it does directly interact with DNS servers):
  • time dig @127.0.1.1 -f inventory.yml → real 0m0.114s
  • time dig @datacenter_dns_server -f inventory.yml → real 0m0.118s
  • conclusion: so none of the control machines has DNS issues.- RAM: no problem there (one control host has 1GB, 0 swap used during tests, the other has 8GB).
  • CPU: one core can briefly be maxed out, but most of the time, CPU usage is <10% of one core. That explains only the slight delta on execution time between my 2 control hosts (first has 1 core, the other one 4).

Anyway, DNS topic closed.

2- Ansible handling hosts errors

What I can easily see though, is that ansible is not handling errors gracefully and that hurts a lot the intended parallelism: there is a time penalty for each host being in error (unreachable, etc). To simulate this, I added fake hosts entries in my inventories (format: ‘<non_existing_hostname> ansible_ssh_host=’) , and that gives the following results (forks=100, everything should be executed in parallel)

  • 0 fake hosts: time ansible all -i inventory_test -m ping
    real 0m8.942s
    user 0m7.208s
    sys 0m2.830s

  • 1 fake hosts: time ansible all -i inventory_test -m ping
    real 0m18.951s
    user 0m7.337s
    sys 0m2.733s
    SSH timeout is set to 10 in ansible.cfg. It seems to had 10 secs to previous execution time. Ok, fair enough.

  • 2 fake hosts: time ansible all -i inventory_test -m ping
    real 0m21.471s
    user 0m7.720s
    sys 0m2.910s
    Now there is something weird.

  • 3 fake hosts: time ansible all -i inventory_test -m ping
    real 0m31.229s
    user 0m7.600s
    sys 0m2.832s
    Ouch!

  • 4 fake hosts: time ansible all -i inventory_test -m ping
    real 0m41.139s
    user 0m7.591s
    sys 0m2.847s
    Ok, there is a pattern now.

  • 5 fake hosts: time ansible all -i inventory_test -m ping
    real 0m51.172s
    user 0m7.563s
    sys 0m2.939s
    It’s confirmed.

That is not what I expected: the time of the whole run should not increase past a certain value (the maximum time between the slowest host and a host in error). Each additional faulty host adds like the whole timeout time, which strongly suggests some sequential algorithm, not something going into parallel threads.

3- Conclusion

That may explain my original observations: faulty hosts increase execution time linearly, showing a “serial/sequential” behaviour instead of treating all hosts at a time.

Finally consistent facts to work on! :slight_smile:

Brian, are you able to reproduce this?

Regards,

Florent.

When dealing with the first contacts with hosts, ansible must deal
with it sequentially as it might need to update the known_hosts file,
otherwise the file will be corrupted, that is probably what you are
seeing with your 'invalid hosts'.

Thank you Brian for the explanation.