Hi
We are testing out AWX_task’s ability to handle concurrent Ansible jobs. We executed a single Ansible job to a 1000 virtual hosts with various number of forks (100,200,500 etc). Ansible server’s hardware’s spec is very good (8 CPU and 64GB of memory). The assigned Ansible job is a simple job where it’s shell command is “echo start; sleep 20; echo”.
We executed AWX_task (without going through AWX_Web) with the test 10 times based on 100 forks. In one of the runs, it failed to complete all of the 1000 jobs. It didn’t complete one jobs on one host(hostname: myhost369, ip:20.0.1.119). We TCP captured the communication, it seems like Ansible opened the SSH port successfully, but didn’t execute any commands and then request to close the TCP connection. 20.0.1.119.csv is a failure log from the TCP dump captured on AWX server(which ip address is 20.0.0.254 and awx_task docker’s ip is 172.70.0.6). 20.0.1.10.csv and 20.0.1.120.csv are logs where there have been successful. 20.0.1.119, 20.0.1.10 and 20.0.1.120 are IP address of the hosts that Ansible connects to.
Ansible version is 2.4.2
Docker version 17.0.9.0-ce
AWX version 1.0.1.268
As the Ansible config is default, there is no change to the log setting. there is no logs written.
As for the mulitple host emulation, we increased the SSH server’s simultaneous session .
MAXSesssions1000
MaxStartups 1000:30:2000
useDNS no
StrictHostkeyCheckin no
In the ansible script, gather_facts: no
To avoid file simultaneous file write issue, we didn’t write to the file but only used echo.
The servers are on VMWare ESXI 6.5. And the managed node are on the virtual server.
For your reference, here is the complete wireshark file.
https://drive.google.com/file/d/18gCtC-mkzsKFFay7N8a2eVhdPIJIR7D-/view?usp=sharing
What do you suggest we do to find the cause of this error?
Thanks
(attachments)
20.0.1.119.csv (6.48 KB)
20.0.1.120.csv (14.5 KB)
20.0.1.10.csv (12.8 KB)