We are testing out AWX’s ability to handle concurrent Ansible jobs. We executed a single Ansible job to a 1000 hosts with various number of forks (100,200,500 etc). The AWX server’s hardware’s spec is very good (8 CPU and 64GB of memory). The assigned Ansible job is a simple job where it’s shell command is “echo start; sleep 20; echo”. In the results, the AWX_web shows it inconsistently doesn’t get a response from some of the 1000 hosts. So the whole job status get stuck in a running state. In this state, if I run another job, the newly executed job will not get executed as well. So this is a serious problem. Also the larger the number of forks, the more likely the occurrence of missing responses. To analyze the problem, we packet captured the communication between AWX_task and the 1000 hosts. Based on the packets, we can see that all of the 1000 hosts completed their jobs and closed the SSH session. That means that AWX_task also knows that all of the jobs were completed. But AWX_web still shows that some of the results are not yet received. I think there might be a problem between AWX_web , AWX_task and RabbitmQ.
What could be the cause of this error. What logs would you need to diagnose this problem?
Can you include more experimental setup information?
Ansible version
Default Ansible config?
1,000 real hosts? Virtualized hosts? Hardware specs of the 1,000 hosts.
Is the postgres db running on the same machine?
What kind of storage is backing the postgres db?
Have you tried running the equivalent test without tower? Just invoke Ansible on the command line and observe it’s behavior given the large fork workload.
Dies tower succeed with a smaller fork count and 1,000 hosts?
When trying to emulate multiple hosts with a single host you often run into default configuration limitations. https://github.com/chrismeyersfsu/ansible-examples/tree/master/large_number_of_hosts_on_single_machine the playbook in that directory should help you get over the system configuration limitations. These configurations changes are for your managed nodes (the machine that ansible is connecting to … not the box tower is running on). We’ve seen the symptoms you describe under this exact scenario; the small ssh and system limitation tweaks should get you past this.
Ansible version is 2.4.1
No changes to the Ansible config. We are using it as AWX has set it.
We are using 1000 virtual hosts. we allocated multiple IP on a single server.
That server spec is 8cpu 64gb of RAM. CentOS
Postgres DB is on the same server on the AWX
The server is using internal SSD storage
We have not tried without the Tower, but we will try it out.
As the Ansible config is default, there is no change to the log setting. there is no logs written.
As for the mulitple host emulation, we increased the SSH server’s simultaneous session .
MAXSesssions1000
MaxStartups 1000:30:2000
useDNS no
StrictHostkeyCheckin no
In the ansible script, gather_facts: no
To avoid file simultaneous file write issue, we didn’t write to the file but only used echo.
The servers are on VMWare ESXI 6.5. And the managed node are on the virtual server.
Our test doesn’t fail all the time, but the failure occurs unpredictably. We are increasing the CPU and memory to find out the conditions where it runs successfully.
After discovering failures with 8 CPU 64GB RAM, we tried it again with 2 CPU and 4GB. The result is that it fails occasionally with 100 forks.
We believe that the operation is actually successful but AWX_Web is in a waiting state.
So we have already increased on the SSH session number, but we still fail.
What do you suggest we do to find the cause of this error?
Hi As requested we executed the test directly through AWX_task without going through AWX_Web. We ran the test 10 times based on 100 forks. In one of the run, it failed to complete all of the 1000 jobs. It didn’t complete one jobs on one host(hostname: myhost369, ip:20.0.1.119). We TCP captured the communication, it seems like Ansible opened the SSH port successfully, but didn’t execute any commands and then request to close the TCP connection. 20.0.1.119.csv is a failure log from the TCP dump captured on AWX server(which ip address is 20.0.0.254 and awx_task docker’s ip is 172.70.0.6). 20.0.1.10.csv and 20.0.1.120.csv are logs where there have been successful. 20.0.1.119, 20.0.1.10 and 20.0.1.120 are IP address of the hosts that Ansible connects to.
Do you need more information, and what could be the cause of this error?
Since Ansible “fails” outside of AWX it sounds like this is a core Ansible issue. I think you should send a message to the Ansible mailing list about this issue.
We would like to look into the issues you have mentioned (connection issues with the recent release of 1.0.2.x.)
Can you please share with us the details of the issue?
Looks like it was an environment issue. I had mixed versions of docker libraries installed on my host nodes and that caused intermittent connection issues on the bridge network. I corrected the issue and seems like that resolved the issues.