Receptor nodes can see each other, but work never leaves 'Pending' and AWX health checks fail

Hey folks!

TL;DR

I think I’m able to establish a Receptor mesh network between my AWX instance and remote execution nodes, but I can’t submit work to the edge node from my AWX Receptor instances.

Environment Context

I’m (hopefully) in the last legs of setting up a distributed ansible deployment system. I have an AWX instance living in the cloud and I intend on having remote execution instances on edge networks.

For an execution environment, I created a custom image using ansible-builder. I used the community ee’s configuration as a starting point and then added a couple galaxy collections as needed by my playbooks. The configuration I have is nearly identical to what’s defined there.

I’ve taken this execution environment image and I’m running it on a machine in my edge networks using Podman (it’s also set as my EE image in AWX). I override the entrypoint to the image ansible-builder creates so that Receptor is the container process instead of the dummy init script (I intend on using this container as a hybrid node, at least to get this up and running). I am ignoring the playbooks created by the instance bundle AWX generates, but I am loading the certs generated by that bundle (they get mounted into the edge node container as a volume).

Mesh Network Established?

I have verified that the edge Receptor node can be seen by the execution nodes based in the cloud. From within the edge container:

# receptorctl status
Node ID: edge_node_ip
Version: 1.4.11+git2a07c9c
System CPU Count: 4
System Memory MiB: 3915

Connection                Cost
cloud_node_pod_name       1

Known Node               Known Connections
edge_node_ip             cloud_node_pod_name: 1
cloud_node_pod_name      edge_node_ip: 1

Route                     Via
cloud_node_pod_name       cloud_node_pod_name

Node                          Service   Type       Last Seen             Tags
edge_node_ip                  unix      StreamTLS  2024-11-27 04:42:29   {'type': 'Control Service'}
cloud_node_pod_name           control   Stream     2024-11-27 04:42:21   {'type': 'Control Service'}

Node                      Work Types
edge_node_ip              ansible-runner, local
cloud_node_pod_name       local, kubernetes-runtime-auth, kubernetes-incluster-auth

The same information is corroborated from the same command on the cloud node.

I’ve additionally managed to verify a few additional things. I can successfully run traceroute from the cloud node (and vice versa):

# receptorctl traceroute edge_node_ip
0: cloud_node_pod_name in 96.302µs
1: edge_node_ip in 204.497033ms

The ping command works:

# receptorctl ping edge_node_ip
Reply from edge_node_ip in 93.602159ms
Reply from edge_node_ip in 64.784004ms
Reply from edge_node_ip in 83.656497ms
Reply from edge_node_ip in 68.753141ms

I can successfully submit work locally from the edge node:

# command produces a job which completes successfully
receptorctl  work submit ansible-runner --no-payload

When debug logging is enabled on the edge node, I see these entries regularly coming in:

DEBUG 2024/11/27 05:11:17 Sending service advertisement: &{edge_node_ip unix 2024-11-27 05:11:17.918769814 +0000 UTC m=+2405.032386098 2 map[type:Control Service] [{ansible-runner false} {local false}]}
DEBUG 2024/11/27 05:11:21 Received service advertisement from cloud_node_pod_name
DEBUG 2024/11/27 05:11:24 Received routing update Xr0bHzUe from cloud_node_pod_name via cloud_node_pod_name
DEBUG 2024/11/27 05:11:24 Sending routing update A2im9yqi. Connections: cloud_node_pod_name(1.00)
DEBUG 2024/11/27 05:11:34 Received routing update HXBCwpAw from cloud_node_pod_name via cloud_node_pod_name

I am able to connect to the edge node from the cloud node and run commands directly:

receptorctl --rootcas /etc/receptor/tls/ca/ connect --tls-client
 tlsclient {edge_node_ip} unix

# I can successully run `work list` from here; verified by submitting local work units 
# from the edge node and seeing them from the cloud node 

And finally, while I’ve not installed NTP clients on each system, I have verified that the datetime reported by each system is identical within one second of each other.

So, everything looks to be connected, except…

The Problem

The instance doesn’t get recognized by AWX and I can’t submit work to it from my cloud node.

  • AWX lists the instance as “unavailable”
  • The cloud execution node logs errors like this as frequently as the remote execution node claims to receive routing updates:
2024-11-27 05:18:17,713 WARNING  [hash] awx.main.tasks.system Execution node attempting to rejoin as instance edge_node_ip.
2024-11-27 05:18:37,856 INFO     [hash] awx.main.tasks.system Failed to find capacity of new or lost execution node edge_node_ip, errors:
Receptor error from edge_node_ip, detail:
Work unit expired on Wed Nov 27 05:18:37
min_value in DecimalField should be Decimal type.
  • And work units submitted to the edge node from the cloud node using receptorctl directly never get received by the remote node:
# The local equivalent of this command executed on the edge node directly 
# will create a job that completes successfully.  I've also tried this without TLS.
receptorctl work submit --tls-client tlsclient --node edge_node_ip ansible-runner --no-payload

# `receptorctl work list` output (this job never changes state, never gets seen by edge node):
{
    "Fr7dL57u": {
        "Detail": "Starting Worker",
        "ExtraData": {
            "Expiration": "2024-11-27T05:24:28.317660252Z",
            "LocalCancelled": false,
            "LocalReleased": false,
            "RemoteNode": "edge_node_id",
            "RemoteParams": {},
            "RemoteStarted": false,
            "RemoteUnitID": "",
            "RemoteWorkType": "ansible-runner",
            "SignWork": false,
            "TLSClient": "tlsclient"
        },
        "State": 0,
        "StateName": "Pending",
        "StdoutSize": 0,
        "WorkType": "remote"
    }
}

I’m not seeing any error logs on the edge node, either.

Edge Node receptor.conf

- node:
        id: edge_node_ip

- log-level:
        level: debug

- control-service:
        service: unix
        filename: /tmp/receptor.sock
        tls: receptor-tls-server

# These are the AWX-generated certs 
- tls-server:
        name: receptor-tls-server
        clientcas: /etc/receptor/certs/ca/mesh-CA.crt
        cert: /etc/receptor/certs/receptor.crt
        key: /etc/receptor/certs/receptor.key
        requireclientcert: true

# Names and relative paths kept identical to what is generated by AWX
- tls-client:
        name: receptor-tls-client
        rootcas: /etc/receptor/certs/ca/mesh-CA.crt
        insecureskipverify: false
        cert: /etc/receptor/certs/receptor.crt
        key: /etc/receptor/certs/receptor.key

- tcp-listener:
        bindaddr: 0.0.0.0
        port: 27199
        tls: receptor-tls-server

- work-command:
        worktype: ansible-runner
        command: ansible-runner
        params: worker
        allowruntimeparams: true

# Experimental...saw a `local` command must be advertised.  Also tried `default`.
- work-command:
    worktype: local
    command: ansible-runner
    params: worker
    allowruntimeparams: true

Final Thoughts

I’m really at a loss here and I believe I’ve exhausted all documentation available to me, including a number of posts on these forums. I’d sincerely appreciate any help, even if that is pointing me at Red Hat support (validation that RH would be the logical next step is enough for me at this point…).

Thanks in advance, folks.