In AWX 23.0.0 we introduce the ability to add hop node to AWX deployed on Kubernetes cluster
(shout out to the community members that contributed to that effort @tanganellilore @kurokobo @fosterseth @djyasin @thedoubl3j )
Currently there’s a limitation to the implementation, first hop/execution node must be an outbound connection from the AWX receptor mesh deployed on the Kubernetes cluster.
This topic aim to discuss the design/implementation to allow that first hop/execution node to directly peer into AWX receptor mesh deployed on the Kubernetes cluster.
@AWX
9 Likes
kdelee
(Elijah)
August 30, 2023, 5:46pm
2
Would we consider having a new pod in the awx deployment that is a receptor pod with 1 container and has a service associated with it that was mapped to the receport process listening there? It would be like internal hop node to the cluster with service that pointed to that one pod.
Or is that extra steps and we just expose new service pointing to the receptor port and selecting the task pods instead of the web pods like https://github.com/ansible/awx-operator/blob/ea5fb823f957557e6bc9976c023d7c9c691702e1/roles/installer/templates/networking/service.yaml.j2#L48 ?
2 Likes
apiVersion: apps/v1
kind: Deployment
metadata:
name: awx-hop-node
namespace: awx
spec:
selector:
matchLabels:
app.kubernetes.io/name: awx-hop-node
template:
metadata:
labels:
app.kubernetes.io/name: awx-hop-node
spec:
containers:
- args:
- /bin/sh
- -c
- |
hostname=awx-hop-node #hardcoded to deployment name
receptor --cert-makereq bits=2048 commonname=$hostname dnsname=$hostname nodeid=$hostname outreq=/etc/receptor/tls/receptor.req outkey=/etc/receptor/tls/receptor.key
receptor --cert-signreq req=/etc/receptor/tls/receptor.req cacert=/etc/receptor/tls/ca/mesh-CA.crt cakey=/etc/receptor/tls/ca/mesh-CA.key outcert=/etc/receptor/tls/receptor.crt verify=yes
exec receptor --config /etc/receptor/receptor.conf
image: quay.io/haoliu/awx-ee:v1.4.1
imagePullPolicy: Always
name: awx-hop-node
resources:
requests:
cpu: 50m
memory: 64M
volumeMounts:
- mountPath: /etc/receptor/receptor.conf
name: awx-hop-node-config
subPath: receptor.conf
- mountPath: /etc/receptor/tls/ca/mesh-CA.crt
name: awx-receptor-ca
readOnly: true
subPath: tls.crt
- mountPath: /etc/receptor/tls/ca/mesh-CA.key
name: awx-receptor-ca
readOnly: true
subPath: tls.key
- mountPath: /etc/receptor/tls/
name: awx-receptor-tls
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: awx
serviceAccountName: awx
volumes:
- name: awx-receptor-tls
- name: awx-receptor-ca
secret:
defaultMode: 420
secretName: awx-receptor-ca
- configMap:
defaultMode: 420
items:
- key: receptor_conf
path: receptor.conf
name: awx-hop-node-configmap
name: awx-hop-node-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: awx-hop-node-configmap
namespace: awx
data:
receptor_conf: |
---
- node:
id: awx-hop-node
- log-level: debug
- local-only: null
- tcp-listener:
port: 27199
tls: tlsserver
- tls-server:
cert: /etc/receptor/tls/receptor.crt
key: /etc/receptor/tls/receptor.key
name: tlsserver
clientcas: /etc/receptor/tls/ca/mesh-CA.crt
requireclientcert: true
mintls13: false
---
apiVersion: v1
kind: Service
metadata:
name: awx-hop-node
namespace: awx
spec:
type: LoadBalancer
ports:
- name: receptor
protocol: TCP
port: 27199
targetPort: 27199
selector:
app.kubernetes.io/name: awx-hop-node
For PoC, here’s a rough YAML that will stand up a hop node in OpenShift and after this we can register it in AWX and control-plane ee will connect to it
I will follow up with some challenges we discovered after doing this
1 Like
based on OpenShift documentation Configuring ingress cluster traffic using an Ingress Controller - Configuring ingress cluster traffic | Networking | OpenShift Container Platform 4.13
An Ingress Controller is configured to accept external requests and proxy them based on the configured routes. This is limited to HTTP, HTTPS using SNI, and TLS using SNI, which is sufficient for web applications and services that work over TLS with SNI.
so it does not seem like Route will be able to be use to expose TCP traffic
1 Like
NodePort is AN way to expose a port and allow external TCP connection to a service but this have a couple specific requirement
worker nodes must have resolvable/reachable IP/hostname (in openshift when deploy on AWS by default the worker node only have internal IP address)
when worker nodes that external receptor connect to gets removed we have to manually reconfigure external receptor to use new node
internal port (that controlplane-ee connect to) and external node port will NOT be the same (this cause problem in how we want to express this in the database and generate receptor config for other receptor nodes)
1 Like
kurokobo
(kurokobo)
September 1, 2023, 1:38am
6
Since Receptor supports WebSocket as its backend and major Ingress controllers also support WebSocket (with additional configurations), Ingress and Route may also be used for inbound connection if the backend can be WebSocket.
Of course I didn’t test anything yet
3 Likes
“WebSocket backend is highly untested” - @fosterseth
2 Likes
fosterseth
(Seth Foster)
September 1, 2023, 3:26pm
8
yeah websocket idea is interesting.
I was able to use metallb + ingress-nginx to expose a receptor tcp service.
I was on Kind, so I basically just followed this guide kind – LoadBalancer
and Exposing TCP and UDP services - Ingress-Nginx Controller
still very untested. I was able to run
socat - TCP4:172.19.255.200:5433
from outside the cluster, and see that connection attempts were made on the running receptor hop node inside the cluster.
2 Likes
fosterseth
(Seth Foster)
September 2, 2023, 4:42am
9
[sbf@fedora awx]$ k get all
NAME READY STATUS RESTARTS AGE
pod/awx-hop-node-66b644d5f7-xgll6 1/1 Running 0 8h
pod/awx-postgres-13-0 1/1 Running 1 (38h ago) 2d9h
pod/awx-task-65d6c69fdb-h5sb2 4/4 Running 0 82m
pod/awx-web-78d6849757-6wlwm 3/3 Running 0
awx-hop-node is deployed inside of the cluster.
ex1 is a remote execution node, running on my local machine outside of the cluster.
here is the full guide on what I did
3 Likes
Alright we made some progress with the WebSocket backend
here’s the configuration we have so far
apiVersion: apps/v1
kind: Deployment
metadata:
name: awx-hop-node
spec:
selector:
matchLabels:
app.kubernetes.io/name: awx-hop-node
template:
metadata:
labels:
app.kubernetes.io/name: awx-hop-node
spec:
containers:
- args:
- /bin/sh
- -c
- |
internal_hostname=awx-hop-node #hardcoded to deployment name
external_hostname=awx-hop-node-saas-dev.apps.controller-dev.testing.ansible.com #hardcoded to the route name
receptor --cert-makereq bits=2048 commonname=$internal_hostname dnsname=$internal_hostname dnsname=$external_hostname nodeid=$internal_hostname outreq=/etc/receptor/tls/receptor.req outkey=/etc/receptor/tls/receptor.key
receptor --cert-signreq req=/etc/receptor/tls/receptor.req cacert=/etc/receptor/tls/ca/mesh-CA.crt cakey=/etc/receptor/tls/ca/mesh-CA.key outcert=/etc/receptor/tls/receptor.crt verify=yes
exec receptor --config /etc/receptor/receptor.conf
image: quay.io/haoliu/awx-ee:v1.4.1
imagePullPolicy: Always
name: awx-hop-node
resources:
requests:
cpu: 50m
memory: 64M
volumeMounts:
- mountPath: /etc/receptor/receptor.conf
name: awx-hop-node-config
subPath: receptor.conf
- mountPath: /etc/receptor/tls/ca/mesh-CA.crt
name: awx-receptor-ca
readOnly: true
subPath: tls.crt
- mountPath: /etc/receptor/tls/ca/mesh-CA.key
name: awx-receptor-ca
readOnly: true
subPath: tls.key
- mountPath: /etc/receptor/tls/
name: awx-receptor-tls
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: awx
serviceAccountName: awx
volumes:
- name: awx-receptor-tls
- name: awx-receptor-ca
secret:
defaultMode: 420
secretName: awx-receptor-ca
- configMap:
defaultMode: 420
items:
- key: receptor_conf
path: receptor.conf
name: awx-hop-node-configmap
name: awx-hop-node-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: awx-hop-node-configmap
data:
receptor_conf: |
---
- node:
id: awx-hop-node
- log-level: debug
- ws-listener:
port: 27198
tls: tlsserver
- tcp-listener:
port: 27199
tls: tlsserver
- tls-server:
cert: /etc/receptor/tls/receptor.crt
key: /etc/receptor/tls/receptor.key
name: tlsserver
clientcas: /etc/receptor/tls/ca/mesh-CA.crt
requireclientcert: true
mintls13: false
---
apiVersion: v1
kind: Service
metadata:
name: awx-hop-node
spec:
type: ClusterIP
ports:
- name: tcp
port: 27199
targetPort: 27199
- name: ws
port: 27198
targetPort: 27198
selector:
app.kubernetes.io/name: awx-hop-node
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
annotations:
openshift.io/host.generated: "true"
name: awx-hop-node
namespace: saas-dev
spec:
host: awx-hop-node-saas-dev.apps.controller-dev.testing.ansible.com
port:
targetPort: ws
tls:
insecureEdgeTerminationPolicy: None
termination: passthrough
to:
kind: Service
name: awx-hop-node
weight: 100
wildcardPolicy: None
1 Like
fosterseth
(Seth Foster)
September 5, 2023, 5:51pm
11
modification to the remote execution node receptor.conf, after running install bundle
- ws-peer:
address: wss://awx-hop-node-saas-dev.apps.controller-dev.testing.ansible.com/
redial: true
tls: tls_client
then restart receptor service
1 Like
internal connection from controlplane-ee to hop node is made via TCP connection
external connection through route is made via websocket connection
currently we are hardcoding the hostnames for the hop node receptor certificate
I think when we get around to implement this we can create the Service and Route first, after Route have hostname we can create the deployment
1 Like
The next step for us to design the API/model change for Instance and also the CRD change for AWX
Our investigation right now is very much focused on OpenShift, so I would love the community to help us get the final implementation more generally applicable
1 Like
fosterseth
(Seth Foster)
September 13, 2023, 2:09am
15
I turned the awx-hop-node into a replica set, and scaled up to 2
The control nodes connect to the cluster hop nodes via the awx-hop-node
service
I see an issue though, the mesh can become disjoint, as the nodes are not fully connected.
For example the remote execution node is successfully connected to hop-1, but hop-1 is not currently connected to the control nodes.
2 Likes
fosterseth
(Seth Foster)
September 13, 2023, 2:38pm
16
if we scale up the cluster hop node stateful set, we can make sure hop-{1…N-1} is connected to hop-0 via tcp-peer
. So if hop-2 came online, it also would connect to hop-0.
In this way, we ensure there is a valid route from any control node to any remote execution node
add this to receptor conf config map
- tcp-peer:
address: awx-hop-node-0.headless:27199
redial: true
tls: tlsclient
There are other ways to go about this. For example, we can have each control node just directly peer to each cluster hop node.
1 Like
i feel like this creates a unnecessary bottle neck at hop-0
imagine if hop-0 goes down for whatever reason than all traffic and connection to the external mesh would be severed
Am I miss understanding the design here?
1 Like
lets expand this diagram a bit and add in service and route to the diagram
1 Like
gwmngilfen
(Greg Sutcliffe)
September 13, 2023, 3:46pm
19
Would it help if I enabled the GraphViz plugin … ?
2 Likes