Hey all,
I am facing a timeout issue while trying to run a job template. This is our current setup:
AWX Version - 22.5.0 (AWX is running on OKD and is deployed using AWX Operator)
OKD Version - 4.11.0-0.okd-2022-12-02-145640 (Update Channel: Stable-4)
OpenSSH version on bastion host:
openssh-server-7.4p1-23.el7_9.x86_64
openssh-7.4p1-23.el7_9.x86_64
openssh-clients-7.4p1-23.el7_9.x86_64
OpenSSH version on remote server:
openssh-8.7p1-30.el9_2.x86_64
openssh-clients-8.7p1-30.el9_2.x86_64
openssh-server-8.7p1-30.el9_2.x86_64
The traffic flow is as follows:
AWX on OKD â Bastion Host/Jumpbox â Remote Server
Problem Statement:
When I try to run a template, the first few tasks run successfully. But after running a few tasks, I see that the server becomes unreachable and I see âTimeout Before Authenticationâ in the SSH logs on the remote server. Hereâs an example:
Identity added: /runner/artifacts/25/ssh_key_data (AWX)
Certificate added: /runner/artifacts/25/ssh_key_data-cert.pub (CA:sshca_2020_2 USER:awx VALID:1696849513-1696936093)
SSH password:
[WARNING]: Invalid characters were found in group names but not replaced, use
-vvvv to see details
PLAY [Setting up hosts] ********************************************************
TASK [Gathering Facts] *********************************************************
ok: [SERVER1]
TASK [hosts : create hosts] ****************************************************
ok: [SERVER1]
PLAY [Setting up resolv.conf] **************************************************
TASK [resolv : Configure resolv.conf] ******************************************
ok: [SERVER1]
PLAY [Setting up chronyd/ntp & timezone] ***************************************
TASK [chrony : Ensure that the chrony package is installed] ********************
ok: [SERVER1]
TASK [chrony : Attempting to overlay chrony configurations] ********************
ok: [SERVER1] => (item=chrony.conf)
failed: [SERVER1] (item=chronyd) => {âansible_loop_varâ: âitemâ, âitemâ: {âdstâ: â/etc/sysconfig/chronydâ, âmodeâ: 420, âsrcâ: âchronyd.sysconfig.j2â}, âmsgâ: âFailed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\r\nConnection closed by UNKNOWN port 65535â, âunreachableâ: true}
fatal: [SERVER1]: UNREACHABLE! => {âchangedâ: false, âmsgâ: âAll items completedâ, âresultsâ: [{âansible_loop_varâ: âitemâ, âchangedâ: false, âchecksumâ: â6f9d06e122ab7a370d9baa26c923ecc850718b49â, âdestâ: â/etc/chrony.confâ, âdiffâ: {âafterâ: {âpathâ: â/etc/chrony.confâ}, âbeforeâ: {âpathâ: â/etc/chrony.confâ}}, âfailedâ: false, âgidâ: 0, âgroupâ: ârootâ, âinvocationâ: {âmodule_argsâ: {â_diff_peekâ: null, â_original_basenameâ: âchrony.conf.j2â, âaccess_timeâ: null, âaccess_time_formatâ: â%Y%m%d%H%M.%Sâ, âattributesâ: null, âdestâ: â/etc/chrony.confâ, âfollowâ: true, âforceâ: false, âgroupâ: ârootâ, âmodeâ: â420â, âmodification_timeâ: null, âmodification_time_formatâ: â%Y%m%d%H%M.%Sâ, âownerâ: ârootâ, âpathâ: â/etc/chrony.confâ, ârecurseâ: false, âselevelâ: null, âseroleâ: null, âsetypeâ: null, âseuserâ: null, âsrcâ: null, âstateâ: âfileâ, âunsafe_writesâ: false}}, âitemâ: {âdstâ: â/etc/chrony.confâ, âmodeâ: 420, âsrcâ: âchrony.conf.j2â}, âmodeâ: â0420â, âownerâ: ârootâ, âpathâ: â/etc/chrony.confâ, âsizeâ: 186, âstateâ: âfileâ, âuidâ: 0}, {âansible_loop_varâ: âitemâ, âitemâ: {âdstâ: â/etc/sysconfig/chronydâ, âmodeâ: 420, âsrcâ: âchronyd.sysconfig.j2â}, âmsgâ: âFailed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\r\nConnection closed by UNKNOWN port 65535â, âunreachableâ: true}]}
PLAY RECAP *********************************************************************
SERVER1 : ok=4 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
As you can see in the above output, the first few tasks ran successfully, but the task after that starts to fail. I have tried different playbooks as well, the same problem persists.
Output of the /var/log/secure:
What I have tried so far:
- Added the following ansible variables:
-
ansible_ssh_args: â-o ControlMaster=auto -o ControlPersist=600s -o ConnectTimeout=600s -o ProxyCommand=âssh -o ConnectTimeout=600s -o StrictHostKeyChecking=no -W %h:%p -l awx BASTION_HOST_NAMEââ
-
ansible_ssh_timeout: 120
-
ansible_command_timeout: 120
-
ansible_timeout: 120
-
Added AWX_TASK_ENV[âANSIBLE_TIMEOUTâ] = â120â in /etc/tower/setting.py
- The playbook runs absolutely fine when I run it using ansible-playbook command on the bastion host
- I have played with various combinations of the above variables but am still getting the same issue. I even set the values to as high as 1200!
- The IPs are whitelisted on all firewalls
Any help would be highly appreciated. Please let me know if anything else is needed from my side.
Thanks,
Shrihari