Timeout Before Authentication - SSH

Hey all,

I am facing a timeout issue while trying to run a job template. This is our current setup:

AWX Version - 22.5.0 (AWX is running on OKD and is deployed using AWX Operator)

OKD Version - 4.11.0-0.okd-2022-12-02-145640 (Update Channel: Stable-4)

OpenSSH version on bastion host:

openssh-server-7.4p1-23.el7_9.x86_64
openssh-7.4p1-23.el7_9.x86_64
openssh-clients-7.4p1-23.el7_9.x86_64

OpenSSH version on remote server:

openssh-8.7p1-30.el9_2.x86_64
openssh-clients-8.7p1-30.el9_2.x86_64
openssh-server-8.7p1-30.el9_2.x86_64

The traffic flow is as follows:

AWX on OKD → Bastion Host/Jumpbox → Remote Server

Problem Statement:

When I try to run a template, the first few tasks run successfully. But after running a few tasks, I see that the server becomes unreachable and I see “Timeout Before Authentication” in the SSH logs on the remote server. Here’s an example:


Identity added: /runner/artifacts/25/ssh_key_data (AWX)
Certificate added: /runner/artifacts/25/ssh_key_data-cert.pub (CA:sshca_2020_2 USER:awx VALID:1696849513-1696936093)
SSH password:
[WARNING]: Invalid characters were found in group names but not replaced, use
-vvvv to see details

PLAY [Setting up hosts] ********************************************************

TASK [Gathering Facts] *********************************************************
ok: [SERVER1]

TASK [hosts : create hosts] ****************************************************
ok: [SERVER1]

PLAY [Setting up resolv.conf] **************************************************

TASK [resolv : Configure resolv.conf] ******************************************
ok: [SERVER1]

PLAY [Setting up chronyd/ntp & timezone] ***************************************

TASK [chrony : Ensure that the chrony package is installed] ********************
ok: [SERVER1]

TASK [chrony : Attempting to overlay chrony configurations] ********************
ok: [SERVER1] => (item=chrony.conf)
failed: [SERVER1] (item=chronyd) => {“ansible_loop_var”: “item”, “item”: {“dst”: “/etc/sysconfig/chronyd”, “mode”: 420, “src”: “chronyd.sysconfig.j2”}, “msg”: “Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\r\nConnection closed by UNKNOWN port 65535”, “unreachable”: true}
fatal: [SERVER1]: UNREACHABLE! => {“changed”: false, “msg”: “All items completed”, “results”: [{“ansible_loop_var”: “item”, “changed”: false, “checksum”: “6f9d06e122ab7a370d9baa26c923ecc850718b49”, “dest”: “/etc/chrony.conf”, “diff”: {“after”: {“path”: “/etc/chrony.conf”}, “before”: {“path”: “/etc/chrony.conf”}}, “failed”: false, “gid”: 0, “group”: “root”, “invocation”: {“module_args”: {“_diff_peek”: null, “_original_basename”: “chrony.conf.j2”, “access_time”: null, “access_time_format”: “%Y%m%d%H%M.%S”, “attributes”: null, “dest”: “/etc/chrony.conf”, “follow”: true, “force”: false, “group”: “root”, “mode”: “420”, “modification_time”: null, “modification_time_format”: “%Y%m%d%H%M.%S”, “owner”: “root”, “path”: “/etc/chrony.conf”, “recurse”: false, “selevel”: null, “serole”: null, “setype”: null, “seuser”: null, “src”: null, “state”: “file”, “unsafe_writes”: false}}, “item”: {“dst”: “/etc/chrony.conf”, “mode”: 420, “src”: “chrony.conf.j2”}, “mode”: “0420”, “owner”: “root”, “path”: “/etc/chrony.conf”, “size”: 186, “state”: “file”, “uid”: 0}, {“ansible_loop_var”: “item”, “item”: {“dst”: “/etc/sysconfig/chronyd”, “mode”: 420, “src”: “chronyd.sysconfig.j2”}, “msg”: “Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\r\nConnection closed by UNKNOWN port 65535”, “unreachable”: true}]}

PLAY RECAP *********************************************************************
SERVER1 : ok=4 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0


As you can see in the above output, the first few tasks ran successfully, but the task after that starts to fail. I have tried different playbooks as well, the same problem persists.

Output of the /var/log/secure:



What I have tried so far:

  1. Added the following ansible variables:
  • ansible_ssh_args: ‘-o ControlMaster=auto -o ControlPersist=600s -o ConnectTimeout=600s -o ProxyCommand=“ssh -o ConnectTimeout=600s -o StrictHostKeyChecking=no -W %h:%p -l awx BASTION_HOST_NAME”’

  • ansible_ssh_timeout: 120

  • ansible_command_timeout: 120

  • ansible_timeout: 120

  • Added AWX_TASK_ENV[‘ANSIBLE_TIMEOUT’] = ‘120’ in /etc/tower/setting.py

  1. The playbook runs absolutely fine when I run it using ansible-playbook command on the bastion host
  2. I have played with various combinations of the above variables but am still getting the same issue. I even set the values to as high as 1200!
  3. The IPs are whitelisted on all firewalls

Any help would be highly appreciated. Please let me know if anything else is needed from my side.

Thanks,
Shrihari

Hi, if you use a module that supports it, like shell, I would look into async and poll.
https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_async.html

I use this in nearly all playbooks, because the connections can be unpredictable.

Hi,

Here are a few suggestions you can try / things you can look to; top of my head:

  • Output of the /var/log/secure:

    Ok, these are sshd logs on your remote server. Can you post the ones from your bastion as well ?

  • Is there a load balancer or some proxy in the way (between your control node and bastion, or between your bastion and remote server) ? If so, have a look to logs.

  • Could you run your playbook in verbose mode (-vvvv to connection infos) ?

  • The playbook runs absolutely fine when I run it using ansible-playbook command on the bastion host

    You could also re-run this command in verbose mode and compare with output from the same playbook ran from AWS in verbose mode as well

  • Could you post your sshd configs (bastion and remote server) ?

  • What about trying a manual ssh connection from multiple points: AWS pod → remote server, AWS pod → bastion, bastion → remote server, and see if you can reproduce the issue. To run in verbose mode (-vvv should be enough) and with the same ssh args you defined in ansible_ssh_args. Warning: You’re using ControlMaster so either close open sessions before ssh -O exit ... or delete sockets in your ControlPath, to avoid reusing existing connections. You could also run another test without using ControlMaster at all

  • Same idea on sshd (again, both bastion and remote server): Stop sshd then start daemon again in debug mode: sudo sh -c '$(which sshd) -d'

  • Is there an IDS, antivirus or some tool like fail2ban running on your bastion or remote server ?

  • Does your playbook always fail on the same task, or is it irregular ? NVM. I haven’t seen this:

    I have tried different playbooks as well, the same problem persists.

  • Could you also print output from: ansible-config dump | grep -Ei 'time|delay|ssh'

  • Can you also print AWX node’s ~/.ssh/config content ?

I find it intriguing that you lose connection after roughly 1m. Maybe some sshd parameter like LoginGraceTime:

The server disconnects after this time if the user has not
successfully logged in. If the value is 0, there is no
time limit. The default is 120 seconds.

Though it seems to me auth did came through. Perhaps a bad interaction with ControlMaster as there shouldn’t be the need to reauthenticate between commands, but login do occur nonetheless. Do you happen to use pipelining in your Ansible config ?

Hey @ptn,
Thanks for your suggestions. I see that when I change the LoginGraceTime to 0 the problem resolved. But here are a few things I wanted to highlight:

  1. I see that the playbook takes about 10 mins to finish in AWX. Whereas if I run the playbook using ansible-playbook it’ll finish in less than 4-5 mins. I see in the logs that sometimes the AWX just PAUSES. It closes the connection from the previous task and just sits there!

  2. I cannot keep LoginGraceTime as 0. It is a production box and we maintain strict security compliance. :slight_smile:

Is there a load balancer or some proxy in the way (between your control node and bastion, or between your bastion and remote server) ? If so, have a look to logs.

No we don’t have any LBs in between. There’s AWX on OKD, bastion and the remote server. That’s it

Could you run your playbook in verbose mode (-vvvv to connection infos) ?

I do have the output file, but I am not able to attach it as only images are allowed to be attached. Can’t paste it either as it crosses the word limit

Could you post your sshd configs (bastion and remote server) ?

**BASTION:**

# CIS 6.2.1 Set SSH Protocol to 2
Protocol 2
# CIS 5.2.5 Set LogLevel to INFO
LogLevel INFO

HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_ecdsa_key
HostKey /etc/ssh/ssh_host_ed25519_key
SyslogFacility AUTHPRIV
AuthorizedKeysCommand /usr/bin/sss_ssh_authorizedkeys
AuthorizedKeysCommandUser nobody
AuthorizedKeysFile  .ssh/authorized_keys
PasswordAuthentication no
ChallengeResponseAuthentication no
GSSAPIAuthentication yes
GSSAPICleanupCredentials no

# CIS 5.2.16
UsePAM yes
# CIS 5.2.6 Disable SSH X11 Forwarding
X11Forwarding no
# CIS 5.2.7 Set MaxAuthTries to 4 or Less
MaxAuthTries 4
# CIS 5.2.8 Set SSH IgnoreRhosts to Yes
IgnoreRhosts yes
# CIS 5.2.9 Set SSH HostbasedAuthentication to No
HostbasedAuthentication no
# CIS 5.2.10 Disable SSH Root Login
# Currently we use this when we first setup boxes and as failsafe
# Allow auth from Bastion
PermitRootLogin without-password
TrustedUserCAKeys /etc/ssh/ca.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
# End Bastion additions
# PermitRootLogin no
# CIS 5.2.11 Set SSH PermitEmptyPasswords to No
# Currently we disable the password for root not sure
PermitEmptyPasswords no
# CIS 5.2.12 Do not allow users to set environment options
PermitUserEnvironment no

# CIS 5.2.14 Ensure SSH LoginGraceTime is set to one minute or less
LoginGraceTime 0

UsePrivilegeSeparation sandbox      # Default for new installations.
#AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES
#AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
#AcceptEnv LC_IDENTIFICATION LC_ALL LANGUAGE
#AcceptEnv XMODIFIERS

# CIS 6.2.11 Use Only Approved Cipher in Counter Mode
Ciphers aes256-ctr,aes192-ctr,aes128-ctr

# CIS 5.2.13 Set Idle Timeout Interval for User Login
ClientAliveInterval 900
ClientAliveCountMax 3

# CIS 6.2.13 Limit Access via SSH
# Limiting access is currently handled by coreldap system

# CIS 5.2.15 Set SSH Banner
# We currently do not have any banner text to send
Banner /etc/issue.net

# CIS 5.2.17 Disable AllowTcpForwarding
AllowTcpForwarding no

# CIS 5.2.18 Set MaxStartups
MaxStartups 10:30:60

# CIS 5.2.19 MaxSessions is set to 4 or less
MaxSessions 4

Subsystem   sftp    /usr/libexec/openssh/sftp-server

REMOTE SERVER:

# Ansible managed
Port 22
AddressFamily any
ListenAddress IP_ADDRESS
Protocol 2
HostKey /etc/ssh/ssh_host_rsa_key
AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES
AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
AcceptEnv LC_IDENTIFICATION LC_ALL
AllowTcpForwarding yes
AuthenticationMethods publickey
AuthorizedKeysCommand /usr/bin/sss_ssh_authorizedkeys
AuthorizedKeysCommandUser nobody
AuthorizedKeysFile .ssh/authorized_keys
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
Banner /etc/issue.net
ChallengeResponseAuthentication no
Ciphers aes256-ctr,aes192-ctr,aes128-ctr
GSSAPIAuthentication yes
GSSAPICleanupCredentials yes
HostbasedAuthentication no
IgnoreRhosts yes
IgnoreUserKnownHosts no
LogLevel INFO
LoginGraceTime 60
MACs hmac-sha2-512,hmac-sha2-256
MaxAuthTries 4
PasswordAuthentication no
PermitEmptyPasswords no
PermitRootLogin without-password
PermitTunnel no
PermitUserEnvironment no
PrintMotd yes
PubkeyAuthentication yes
ServerKeyBits 1024
Subsystem sftp /usr/libexec/openssh/sftp-server
SyslogFacility AUTHPRIV
TCPKeepAlive yes
TrustedUserCAKeys /etc/ssh/ca.pub
UseDNS yes
UsePAM yes
X11Forwarding no
X11UseLocalhost yes
XAuthLocation /usr/bin/xauth
ClientAliveCountMax 0
ClientAliveInterval 900
Match User nagios,ansible,obox-rundeck,developer
  AuthenticationMethods publickey

What about trying a manual ssh connection from multiple points:

All the inbound connection to the remote server is only allowed from the bastion host as far as SSH connections are concerned. I have tried the manual SSH from AWX pod to bastion and remote server. It has worked perfectly fine.

Is there an IDS, antivirus or some tool like fail2ban running on your bastion or remote server ?

No. But if this was the case, it should have caused a similar issue while running ansible commands from the bastion host as well, right?

ansible.cfg:


# ==============================================

# This config file provides examples for running

# the OpenShift playbooks with the provided

# inventory scripts.

[defaults]

# Set the log_path

log_path = ~/openshift-ansible.log

# Additional default options for OpenShift Ansible

forks = 20

host_key_checking = False

retry_files_enabled = False

retry_files_save_path = ~/ansible-installer-retries

nocows = True

#private_key_file = /home/ansible/.ssh/ansible_id_rsa

roles_path = roles/

gathering = smart

fact_caching = jsonfile

fact_caching_connection = $HOME/ansible/facts

fact_caching_timeout = 600

callback_whitelist = profile_tasks

inventory_ignore_extensions = secrets.py, .pyc, .cfg, .crt, .ini

# work around privilege escalation timeouts in ansible:

timeout = 30

# Uncomment to use the provided example inventory

#inventory = inventory/hosts.example

[inventory]

# fail more helpfully when the inventory file does not parse (Ansible 2.4+)

unparsed_is_failed=true

# Additional ssh options for OpenShift Ansible

[ssh_connection]

pipelining = True

ssh_args = -C -o ControlMaster=auto -o ControlPersist=600s -o ConnectTimeout=600s

# proxy_command = "ssh -o StrictHostKeyChecking=no -W %h:%p -l awx BASTION_HOST"

timeout = 1200

# shorten the ControlPath which is often too long; when it is,

# ssh connection reuse silently fails, making everything slower.

control_path = %(directory)s/%%h-%%r

# if True, make ansible use scp if the connection type is ssh. (default is sftp)

scp_if_ssh = True

[persistent_connection]

command_timeout = 1200

connect_timeout = 1200

Can you also print AWX node’s ~/.ssh/config content ?


cat ssh_config
#       $OpenBSD: ssh_config,v 1.35 2020/07/17 03:43:42 dtucker Exp $

# This is the ssh client system-wide configuration file.  See
# ssh_config(5) for more information.  This file provides defaults for
# users, and the values can be changed in per-user configuration files
# or on the command line.

# Configuration data is parsed as follows:
#  1. command line options
#  2. user-specific file
#  3. system-wide file
# Any configuration value is only changed the first time it is set.
# Thus, host-specific definitions should be at the beginning of the
# configuration file, and defaults at the end.

# Site-wide defaults for some commonly used options.  For a comprehensive
# list of available options, their meanings and defaults, please see the
# ssh_config(5) man page.

# Host *
#   ForwardAgent no
#   ForwardX11 no
#   PasswordAuthentication yes
#   HostbasedAuthentication no
#   GSSAPIAuthentication no
#   GSSAPIDelegateCredentials no
#   GSSAPIKeyExchange no
#   GSSAPITrustDNS no
#   BatchMode no
#   CheckHostIP yes
#   AddressFamily any
#   ConnectTimeout 0
#   StrictHostKeyChecking ask
#   IdentityFile ~/.ssh/id_rsa
#   IdentityFile ~/.ssh/id_dsa
#   IdentityFile ~/.ssh/id_ecdsa
#   IdentityFile ~/.ssh/id_ed25519
#   Port 22
#   Ciphers aes128-ctr,aes192-ctr,aes256-ctr,aes128-cbc,3des-cbc
#   MACs hmac-md5,hmac-sha1,umac-64@openssh.com
#   EscapeChar ~
#   Tunnel no
#   TunnelDevice any:any
#   PermitLocalCommand no
#   VisualHostKey no
#   ProxyCommand ssh -q -W %h:%p gateway.example.com
#   RekeyLimit 1G 1h
#   UserKnownHostsFile ~/.ssh/known_hosts.d/%k
#
# This system is following system-wide crypto policy.
# To modify the crypto properties (Ciphers, MACs, ...), create a  *.conf
#  file under  /etc/ssh/ssh_config.d/  which will be automatically
# included below. For more information, see manual page for
#  update-crypto-policies(8)  and  ssh_config(5).
Include /etc/ssh/ssh_config.d/*.conf

AWX NODE:


sh-5.1# pwd
/etc/ssh/ssh_config.d
sh-5.1# 
sh-5.1# 
sh-5.1# ls -l
total 4
-rw-r--r--. 1 root root 581 Jul 20 10:18 50-redhat.conf
sh-5.1# 
sh-5.1# 
sh-5.1# cat 50-redhat.conf 
# The options here are in the "Match final block" to be applied as the last
# options and could be potentially overwritten by the user configuration
Match final all
        # Follow system-wide Crypto Policy, if defined:
        Include /etc/crypto-policies/back-ends/openssh.config

        GSSAPIAuthentication yes

# If this option is set to yes then remote X11 clients will have full access
# to the original X11 display. As virtually no X11 client supports the untrusted
# mode correctly we set this to yes.
        ForwardX11Trusted yes

# Uncomment this if you want to use .local domain
# Host *.local

Configurations being passed as variables:


---

ansible_ssh_user: root

ansible_ssh_args: '-o ControlMaster=auto -o ControlPersist=600s -o ConnectTimeout=600s -o ProxyCommand="ssh -o ConnectTimeout=600s -o StrictHostKeyChecking=no -W %h:%p -l awx BASTION_HOST"'

ansible_user: root

ansible_become: no

ansible_become_user: root

ansible_ssh_timeout: 120

ansible_command_timeout: 120

Can you suggest me some workaround for this LoginGraceTime ? If tweaking that worked, I am suspecting the issue around it. But CIS is in the way. Your help is deeply appreciated :slight_smile:

Thanks

Hey @disi

Now that I have narrowed the issue, here is the update. After setting LoginGraceTime to 0, I see that the issue is resolved. But when I inspected the logs, the asynchronous behaviour might be affecting here. The connection is being closed by the remote host before the job is complete. Can you think of anything that might work here other than async and poll? Asking because we have 100s of playbook and changing each one to use async and poll will take forever! :sweat_smile:

I have pasted all the required details as a reply to the other comment below. Please have a look :slight_smile:

There is a lot to unpack here; I had a quick look but will probably come back when I have more time to read it more carefully.

A few notes though:

  • I see that the playbook takes about 10 mins to finish in AWX. Whereas if I run the playbook using ansible-playbook it’ll finish in less than 4-5 mins.

    Well, it’s probably due to auth taking longer, since you removed the 1m timeout. I think you should look closely at your authentication config, starting with pam. You might find something like users database lookup along the way or authorization mechanisms that adds up to login time.

  • I cannot keep LoginGraceTime as 0. It is a production box and we maintain strict security compliance. :slight_smile:

    Sure, and I wasn’t suggesting that either :). You should always look for root cause and try to fix them before addressing symptoms.

  • I do have the output file, but I am not able to attach it as only images are allowed to be attached. Can’t paste it either as it crosses the word limit

    I get that. Note that you can always use an online paste service like gist or pastebin.

  • No. But if this was the case, it should have caused a similar issue while running ansible commands from the bastion host as well, right?

    Not necessarily; it’d depend on where this tool would be running. For instance, Fail2ban on bastion would possibly block inbound traffic from you control node but you still would be able to send outbound requests from it.

  • nocows = True

    What !?? You don’t use cowsay on your production systems ? :smile:

  • Can you suggest me some workaround for this LoginGraceTime ?

    Not at this point because we haven’t found the root cause yet. You could try to run sshd on your remote server in debug mode and look for logs as you run your playbook from AWX.

    Perhaps have a look to MaxStartups and MaxSessions sshd parameter, which might play a role there. I’m picturing Ansible trying to open lots of SSH session (though you use pipelining) with a long auth process (for unknown reason as for now), and sshd locking out new sessions to open when some limit is reached.
    Now you use ControlMaster client-side, so there shouldn’t be new auth requests in theory, but I’m not sure how it behave with ProxyCommand, perhaps have a try on ProxyJump instead if OpenSSH config in use permits it.

    Have a look for lingering ssh sessions maybe; sockets in your ControlPath folder as well as opened sessions on remote machine and bastion (see below).

    While running your playbook again, you could also watch and count ssh sessions with something like watch "lsof -Pn -i | grep -Ei '^ssh\b' | grep -Eiv 'LISTEN)$'" (or with ss, netstat or even conntrack) on your bastion and remote server, though I think your bastion might well be the culprit here.
    Would it be possible to deploy a new and identical server as your remote server and try to run your playbook on it from AWX without going through your bastion ? Just so we can rule out issues on either AWX node or remote server.

@ptn

I get that. Note that you can always use an online paste service like gist or pastebin.

Is there an IDS, antivirus or some tool like fail2ban running on your bastion or remote server ?

Nope. There is nothing in between

What !?? You don’t use cowsay on your production systems ? :smile:

Hahaha… I will make sure cowsay gets its place in production :smiley:

While running your playbook again, you could also watch and count ssh…

Okay, I think I found the root of the problem here.

Ansible Network FAQ — Ansible Documentation

It clearly state that if we are using ProxyCommand, every task will open up a new connection to the remote server. I inspected this using the lsof command that you mentioned and via logs. A playbook run via CLI opens only one connection for all the tasks (when pipelined). But it’s inevitable that every task opens a new SSH connection when proxying.

Would it be possible to deploy a new and identical server as your remote server and try to run your playbook on it from AWX without going through your bastion ? Just so we can rule out issues on either AWX node or remote server.

Yes, I did this. I removed the ProxyCommand and everything worked perfectly fine!

We have a strict PCI compliance that we need to use jumpboxes. What do you suggest me to do in this case?

What’s happening here is that, when we set LoginGraceTime to 0, we are providing the AWX node all the time it needs to authenticate. When there are quite a few tasks and if each task requires authentication, at some point (when LoginGraceTime is set to say 60), a connection is bound to exceed the time limit and fail. This is what I see happening in the logs

What do you suggest me to do in this case?

Have you tried using ProxyJump instead of ProxyCommand ? I’m not sure if it behaves the same way.

That being said, I’m not sure why there would be further authentications past the first one since you use ControlMaster, thus reusing open sessions through sockets.
I’m guessing some parameter on bastion’s sshd config is forcing a reauth everytime or something like this. Have a closer look on sockets and lingering sessions perhaps.

Another suggestion that might help you find something would be to restart sshd on bastion in debug mode and closely monitor ssh sessions while running your playbook from AWX.

Also looked at your linked playbook execution task in debug mode, and haven’t found anything useful we didn’t know already :confused: :

<SERVER1> (255, b'', b'Connection timed out during banner exchange\\r\\nConnection to UNKNOWN port 65535 timed out\\r\\n')

I’m really starting to think your issue stems from the authentication pipeline. I see you’re using GSSAPIAuthentication, so perhaps an Active Directory auth delegation through sssd or something similar. I’d focus on this part. You can start by monitoring not only sshd logs but also others; just run journalctl -f on your bastion and you might catch something.

Cool, I will try all the solutions you have offered and get back to you on Monday. :smiley:

1 Like