Hey folks!
TL;DR
I think I’m able to establish a Receptor mesh network between my AWX instance and remote execution nodes, but I can’t submit work to the edge node from my AWX Receptor instances.
Environment Context
I’m (hopefully) in the last legs of setting up a distributed ansible deployment system. I have an AWX instance living in the cloud and I intend on having remote execution instances on edge networks.
For an execution environment, I created a custom image using ansible-builder. I used the community ee’s configuration as a starting point and then added a couple galaxy collections as needed by my playbooks. The configuration I have is nearly identical to what’s defined there.
I’ve taken this execution environment image and I’m running it on a machine in my edge networks using Podman (it’s also set as my EE image in AWX). I override the entrypoint to the image ansible-builder creates so that Receptor is the container process instead of the dummy init script (I intend on using this container as a hybrid node, at least to get this up and running). I am ignoring the playbooks created by the instance bundle AWX generates, but I am loading the certs generated by that bundle (they get mounted into the edge node container as a volume).
Mesh Network Established?
I have verified that the edge Receptor node can be seen by the execution nodes based in the cloud. From within the edge container:
# receptorctl status
Node ID: edge_node_ip
Version: 1.4.11+git2a07c9c
System CPU Count: 4
System Memory MiB: 3915
Connection Cost
cloud_node_pod_name 1
Known Node Known Connections
edge_node_ip cloud_node_pod_name: 1
cloud_node_pod_name edge_node_ip: 1
Route Via
cloud_node_pod_name cloud_node_pod_name
Node Service Type Last Seen Tags
edge_node_ip unix StreamTLS 2024-11-27 04:42:29 {'type': 'Control Service'}
cloud_node_pod_name control Stream 2024-11-27 04:42:21 {'type': 'Control Service'}
Node Work Types
edge_node_ip ansible-runner, local
cloud_node_pod_name local, kubernetes-runtime-auth, kubernetes-incluster-auth
The same information is corroborated from the same command on the cloud node.
I’ve additionally managed to verify a few additional things. I can successfully run traceroute
from the cloud node (and vice versa):
# receptorctl traceroute edge_node_ip
0: cloud_node_pod_name in 96.302µs
1: edge_node_ip in 204.497033ms
The ping
command works:
# receptorctl ping edge_node_ip
Reply from edge_node_ip in 93.602159ms
Reply from edge_node_ip in 64.784004ms
Reply from edge_node_ip in 83.656497ms
Reply from edge_node_ip in 68.753141ms
I can successfully submit work locally from the edge node:
# command produces a job which completes successfully
receptorctl work submit ansible-runner --no-payload
When debug logging is enabled on the edge node, I see these entries regularly coming in:
DEBUG 2024/11/27 05:11:17 Sending service advertisement: &{edge_node_ip unix 2024-11-27 05:11:17.918769814 +0000 UTC m=+2405.032386098 2 map[type:Control Service] [{ansible-runner false} {local false}]}
DEBUG 2024/11/27 05:11:21 Received service advertisement from cloud_node_pod_name
DEBUG 2024/11/27 05:11:24 Received routing update Xr0bHzUe from cloud_node_pod_name via cloud_node_pod_name
DEBUG 2024/11/27 05:11:24 Sending routing update A2im9yqi. Connections: cloud_node_pod_name(1.00)
DEBUG 2024/11/27 05:11:34 Received routing update HXBCwpAw from cloud_node_pod_name via cloud_node_pod_name
I am able to connect to the edge node from the cloud node and run commands directly:
receptorctl --rootcas /etc/receptor/tls/ca/ connect --tls-client
tlsclient {edge_node_ip} unix
# I can successully run `work list` from here; verified by submitting local work units
# from the edge node and seeing them from the cloud node
And finally, while I’ve not installed NTP clients on each system, I have verified that the datetime reported by each system is identical within one second of each other.
So, everything looks to be connected, except…
The Problem
The instance doesn’t get recognized by AWX and I can’t submit work to it from my cloud node.
- AWX lists the instance as “unavailable”
- The cloud execution node logs errors like this as frequently as the remote execution node claims to receive routing updates:
2024-11-27 05:18:17,713 WARNING [hash] awx.main.tasks.system Execution node attempting to rejoin as instance edge_node_ip.
2024-11-27 05:18:37,856 INFO [hash] awx.main.tasks.system Failed to find capacity of new or lost execution node edge_node_ip, errors:
Receptor error from edge_node_ip, detail:
Work unit expired on Wed Nov 27 05:18:37
min_value in DecimalField should be Decimal type.
- And work units submitted to the edge node from the cloud node using
receptorctl
directly never get received by the remote node:
# The local equivalent of this command executed on the edge node directly
# will create a job that completes successfully. I've also tried this without TLS.
receptorctl work submit --tls-client tlsclient --node edge_node_ip ansible-runner --no-payload
# `receptorctl work list` output (this job never changes state, never gets seen by edge node):
{
"Fr7dL57u": {
"Detail": "Starting Worker",
"ExtraData": {
"Expiration": "2024-11-27T05:24:28.317660252Z",
"LocalCancelled": false,
"LocalReleased": false,
"RemoteNode": "edge_node_id",
"RemoteParams": {},
"RemoteStarted": false,
"RemoteUnitID": "",
"RemoteWorkType": "ansible-runner",
"SignWork": false,
"TLSClient": "tlsclient"
},
"State": 0,
"StateName": "Pending",
"StdoutSize": 0,
"WorkType": "remote"
}
}
I’m not seeing any error logs on the edge node, either.
Edge Node receptor.conf
- node:
id: edge_node_ip
- log-level:
level: debug
- control-service:
service: unix
filename: /tmp/receptor.sock
tls: receptor-tls-server
# These are the AWX-generated certs
- tls-server:
name: receptor-tls-server
clientcas: /etc/receptor/certs/ca/mesh-CA.crt
cert: /etc/receptor/certs/receptor.crt
key: /etc/receptor/certs/receptor.key
requireclientcert: true
# Names and relative paths kept identical to what is generated by AWX
- tls-client:
name: receptor-tls-client
rootcas: /etc/receptor/certs/ca/mesh-CA.crt
insecureskipverify: false
cert: /etc/receptor/certs/receptor.crt
key: /etc/receptor/certs/receptor.key
- tcp-listener:
bindaddr: 0.0.0.0
port: 27199
tls: receptor-tls-server
- work-command:
worktype: ansible-runner
command: ansible-runner
params: worker
allowruntimeparams: true
# Experimental...saw a `local` command must be advertised. Also tried `default`.
- work-command:
worktype: local
command: ansible-runner
params: worker
allowruntimeparams: true
Final Thoughts
I’m really at a loss here and I believe I’ve exhausted all documentation available to me, including a number of posts on these forums. I’d sincerely appreciate any help, even if that is pointing me at Red Hat support (validation that RH would be the logical next step is enough for me at this point…).
Thanks in advance, folks.