Ability to allow inbound connection to AWX receptor mesh on Kubernetes

In AWX 23.0.0 we introduce the ability to add hop node to AWX deployed on Kubernetes cluster
(shout out to the community members that contributed to that effort @tanganellilore @kurokobo @fosterseth @djyasin @thedoubl3j)

Currently there’s a limitation to the implementation, first hop/execution node must be an outbound connection from the AWX receptor mesh deployed on the Kubernetes cluster.

This topic aim to discuss the design/implementation to allow that first hop/execution node to directly peer into AWX receptor mesh deployed on the Kubernetes cluster.

@AWX

9 Likes

Would we consider having a new pod in the awx deployment that is a receptor pod with 1 container and has a service associated with it that was mapped to the receport process listening there? It would be like internal hop node to the cluster with service that pointed to that one pod.

Or is that extra steps and we just expose new service pointing to the receptor port and selecting the task pods instead of the web pods like https://github.com/ansible/awx-operator/blob/ea5fb823f957557e6bc9976c023d7c9c691702e1/roles/installer/templates/networking/service.yaml.j2#L48 ?

2 Likes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: awx-hop-node
  namespace: awx
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: awx-hop-node
  template:
    metadata:
      labels:
        app.kubernetes.io/name: awx-hop-node
    spec:
      containers:
      - args:
        - /bin/sh
        - -c
        - |
          hostname=awx-hop-node #hardcoded to deployment name
          receptor --cert-makereq bits=2048 commonname=$hostname dnsname=$hostname nodeid=$hostname outreq=/etc/receptor/tls/receptor.req outkey=/etc/receptor/tls/receptor.key
          receptor --cert-signreq req=/etc/receptor/tls/receptor.req cacert=/etc/receptor/tls/ca/mesh-CA.crt cakey=/etc/receptor/tls/ca/mesh-CA.key outcert=/etc/receptor/tls/receptor.crt verify=yes
          exec receptor --config /etc/receptor/receptor.conf
        image: quay.io/haoliu/awx-ee:v1.4.1
        imagePullPolicy: Always
        name: awx-hop-node
        resources:
          requests:
            cpu: 50m
            memory: 64M
        volumeMounts:
        - mountPath: /etc/receptor/receptor.conf
          name: awx-hop-node-config
          subPath: receptor.conf
        - mountPath: /etc/receptor/tls/ca/mesh-CA.crt
          name: awx-receptor-ca
          readOnly: true
          subPath: tls.crt
        - mountPath: /etc/receptor/tls/ca/mesh-CA.key
          name: awx-receptor-ca
          readOnly: true
          subPath: tls.key
        - mountPath: /etc/receptor/tls/
          name: awx-receptor-tls
      restartPolicy: Always
      schedulerName: default-scheduler
      serviceAccount: awx
      serviceAccountName: awx
      volumes:
      - name: awx-receptor-tls
      - name: awx-receptor-ca
        secret:
          defaultMode: 420
          secretName: awx-receptor-ca
      - configMap:
          defaultMode: 420
          items:
          - key: receptor_conf
            path: receptor.conf
          name: awx-hop-node-configmap
        name: awx-hop-node-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: awx-hop-node-configmap
  namespace: awx
data:
  receptor_conf: |
    ---
    - node: 
        id: awx-hop-node
    - log-level: debug
    - local-only: null
    - tcp-listener:
        port: 27199
        tls: tlsserver
    - tls-server:
        cert: /etc/receptor/tls/receptor.crt
        key: /etc/receptor/tls/receptor.key
        name: tlsserver
        clientcas: /etc/receptor/tls/ca/mesh-CA.crt
        requireclientcert: true
        mintls13: false
---
apiVersion: v1
kind: Service
metadata:
  name: awx-hop-node
  namespace: awx
spec:
  type: LoadBalancer
  ports:
  - name: receptor
    protocol: TCP
    port: 27199
    targetPort: 27199
  selector:
    app.kubernetes.io/name: awx-hop-node

For PoC, here’s a rough YAML that will stand up a hop node in OpenShift and after this we can register it in AWX and control-plane ee will connect to it

I will follow up with some challenges we discovered after doing this

1 Like

based on OpenShift documentation Configuring ingress cluster traffic using an Ingress Controller - Configuring ingress cluster traffic | Networking | OpenShift Container Platform 4.13

An Ingress Controller is configured to accept external requests and proxy them based on the configured routes. This is limited to HTTP, HTTPS using SNI, and TLS using SNI, which is sufficient for web applications and services that work over TLS with SNI.

so it does not seem like Route will be able to be use to expose TCP traffic

1 Like

NodePort is AN way to expose a port and allow external TCP connection to a service but this have a couple specific requirement

  • worker nodes must have resolvable/reachable IP/hostname (in openshift when deploy on AWS by default the worker node only have internal IP address)
  • when worker nodes that external receptor connect to gets removed we have to manually reconfigure external receptor to use new node
  • internal port (that controlplane-ee connect to) and external node port will NOT be the same (this cause problem in how we want to express this in the database and generate receptor config for other receptor nodes)
1 Like

Since Receptor supports WebSocket as its backend and major Ingress controllers also support WebSocket (with additional configurations), Ingress and Route may also be used for inbound connection if the backend can be WebSocket.
Of course I didn’t test anything yet :laughing:

3 Likes

“WebSocket backend is highly untested” - @fosterseth

2 Likes

yeah websocket idea is interesting.

I was able to use metallb + ingress-nginx to expose a receptor tcp service.

I was on Kind, so I basically just followed this guide kind – LoadBalancer
and Exposing TCP and UDP services - Ingress-Nginx Controller

still very untested. I was able to run

socat - TCP4:172.19.255.200:5433

from outside the cluster, and see that connection attempts were made on the running receptor hop node inside the cluster.

2 Likes

[sbf@fedora awx]$ k get all
NAME                                READY   STATUS    RESTARTS      AGE
pod/awx-hop-node-66b644d5f7-xgll6   1/1     Running   0             8h
pod/awx-postgres-13-0               1/1     Running   1 (38h ago)   2d9h
pod/awx-task-65d6c69fdb-h5sb2       4/4     Running   0             82m
pod/awx-web-78d6849757-6wlwm        3/3     Running   0       

awx-hop-node is deployed inside of the cluster.

ex1 is a remote execution node, running on my local machine outside of the cluster.

here is the full guide on what I did

3 Likes

Alright we made some progress with the WebSocket backend

here’s the configuration we have so far

apiVersion: apps/v1
kind: Deployment
metadata:
  name: awx-hop-node
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: awx-hop-node
  template:
    metadata:
      labels:
        app.kubernetes.io/name: awx-hop-node
    spec:
      containers:
      - args:
        - /bin/sh
        - -c
        - |
          internal_hostname=awx-hop-node #hardcoded to deployment name
          external_hostname=awx-hop-node-saas-dev.apps.controller-dev.testing.ansible.com #hardcoded to the route name
          receptor --cert-makereq bits=2048 commonname=$internal_hostname dnsname=$internal_hostname dnsname=$external_hostname nodeid=$internal_hostname outreq=/etc/receptor/tls/receptor.req outkey=/etc/receptor/tls/receptor.key
          receptor --cert-signreq req=/etc/receptor/tls/receptor.req cacert=/etc/receptor/tls/ca/mesh-CA.crt cakey=/etc/receptor/tls/ca/mesh-CA.key outcert=/etc/receptor/tls/receptor.crt verify=yes
          exec receptor --config /etc/receptor/receptor.conf
        image: quay.io/haoliu/awx-ee:v1.4.1
        imagePullPolicy: Always
        name: awx-hop-node
        resources:
          requests:
            cpu: 50m
            memory: 64M
        volumeMounts:
        - mountPath: /etc/receptor/receptor.conf
          name: awx-hop-node-config
          subPath: receptor.conf
        - mountPath: /etc/receptor/tls/ca/mesh-CA.crt
          name: awx-receptor-ca
          readOnly: true
          subPath: tls.crt
        - mountPath: /etc/receptor/tls/ca/mesh-CA.key
          name: awx-receptor-ca
          readOnly: true
          subPath: tls.key
        - mountPath: /etc/receptor/tls/
          name: awx-receptor-tls
      restartPolicy: Always
      schedulerName: default-scheduler
      serviceAccount: awx
      serviceAccountName: awx
      volumes:
      - name: awx-receptor-tls
      - name: awx-receptor-ca
        secret:
          defaultMode: 420
          secretName: awx-receptor-ca
      - configMap:
          defaultMode: 420
          items:
          - key: receptor_conf
            path: receptor.conf
          name: awx-hop-node-configmap
        name: awx-hop-node-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: awx-hop-node-configmap
data:
  receptor_conf: |
    ---
    - node: 
        id: awx-hop-node
    - log-level: debug
    - ws-listener:
        port: 27198
        tls: tlsserver
    - tcp-listener:
        port: 27199
        tls: tlsserver
    - tls-server:
        cert: /etc/receptor/tls/receptor.crt
        key: /etc/receptor/tls/receptor.key
        name: tlsserver
        clientcas: /etc/receptor/tls/ca/mesh-CA.crt
        requireclientcert: true
        mintls13: false
---
apiVersion: v1
kind: Service
metadata:
  name: awx-hop-node
spec:
  type: ClusterIP
  ports:
  - name: tcp
    port: 27199
    targetPort: 27199
  - name: ws
    port: 27198
    targetPort: 27198
  selector:
    app.kubernetes.io/name: awx-hop-node
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  annotations:
    openshift.io/host.generated: "true"
  name: awx-hop-node
  namespace: saas-dev
spec:
  host: awx-hop-node-saas-dev.apps.controller-dev.testing.ansible.com
  port:
    targetPort: ws
  tls:
    insecureEdgeTerminationPolicy: None
    termination: passthrough
  to:
    kind: Service
    name: awx-hop-node
    weight: 100
  wildcardPolicy: None
1 Like

modification to the remote execution node receptor.conf, after running install bundle

- ws-peer:
    address: wss://awx-hop-node-saas-dev.apps.controller-dev.testing.ansible.com/
    redial: true
    tls: tls_client

then restart receptor service

1 Like

internal connection from controlplane-ee to hop node is made via TCP connection
external connection through route is made via websocket connection

currently we are hardcoding the hostnames for the hop node receptor certificate

I think when we get around to implement this we can create the Service and Route first, after Route have hostname we can create the deployment

1 Like

The next step for us to design the API/model change for Instance and also the CRD change for AWX

Our investigation right now is very much focused on OpenShift, so I would love the community to help us get the final implementation more generally applicable

1 Like

2 Likes

I turned the awx-hop-node into a replica set, and scaled up to 2

The control nodes connect to the cluster hop nodes via the awx-hop-node service

I see an issue though, the mesh can become disjoint, as the nodes are not fully connected.

For example the remote execution node is successfully connected to hop-1, but hop-1 is not currently connected to the control nodes.

2 Likes

if we scale up the cluster hop node stateful set, we can make sure hop-{1…N-1} is connected to hop-0 via tcp-peer. So if hop-2 came online, it also would connect to hop-0.

In this way, we ensure there is a valid route from any control node to any remote execution node

add this to receptor conf config map

- tcp-peer:
    address: awx-hop-node-0.headless:27199
    redial: true
    tls: tlsclient

There are other ways to go about this. For example, we can have each control node just directly peer to each cluster hop node.

1 Like

i feel like this creates a unnecessary bottle neck at hop-0

imagine if hop-0 goes down for whatever reason than all traffic and connection to the external mesh would be severed

Am I miss understanding the design here?

1 Like

lets expand this diagram a bit and add in service and route to the diagram

1 Like

Would it help if I enabled the GraphViz plugin … ? :slight_smile:

2 Likes

does discourse support mermaid? GitHub - mermaid-js/mermaid: Generation of diagrams like flowcharts or sequence diagrams from text in a similar manner as markdown

1 Like