Ability to allow inbound connection to AWX receptor mesh on Kubernetes

TheRealHaoLiu · September 14, 2023, 8:17pm

gwmngilfen · September 14, 2023, 9:47pm

There is a mermaid plugin but I’m not sure if we get it on our Enterprise plan

TheRealHaoLiu · September 15, 2023, 2:06am

Hummm not sure if I can have svc selecting unique pod by name? If not doesn’t matter we can have 2 deployment it would be the same concept.

More importantly. I do question even if we have multiple internal hop node exposed separately does it even provide redundancy?

For example I have 2 internal hop node exposed via 2 separate svc/route

I have 7 external execution node connecting to both internal hop node all 7 will favor the first hop node if it’s configured the same.

If hop that the external execution node die connection die (yes receptor will reroute after certain point to the other hop) even without the second hop kube will heal and bring the hop back up and we will be back to normal?

@kurokobo @jlanda @fosterseth @tanganellilore what do you think? Is it worth having multiple internal hop nodes at all?

kurokobo · September 15, 2023, 12:25pm

Hummm not sure if I can have svc selecting unique pod by name?

For statefulset, we can use following selector. Refer to: StatefulSets | Kubernetes

  selector:
    statefulset.kubernetes.io/pod-name: hop-1

I do question even if we have multiple internal hop node exposed separately does it even provide redundancy?

In my understanding, multiple internal hop node can reduce downtime especially for node failure. By default, Kubernetes waits five minutes or more to restart pods after node failure. Also, especially since statefulset guarantees the number of pods, the pods on the failed node may not be re-created after five minutes or more unless the pods is forcibly deleted.

Therefore, multiple internal hop nodes can reduce downtime of the automation mesh that is caused by node failure.

In my opinion, it would be simplest to define a new CRD (Hop? MeshProxy?) in AWX Operator that would allow a set of Deployment (replicas fixed at 1), Service, and Ingress/Route to be deployed. If users need more than one internal hop, simply deploy multiple CRs.

TheRealHaoLiu · September 15, 2023, 8:09pm

For statefulset, we can use following selector. Refer to: StatefulSets | Kubernetes

  selector:
    statefulset.kubernetes.io/pod-name: hop-1

TIL, thanks @kurokobo!

On to the next problem with StatefulSet… since each of the svc/route will have unique hostnames and that’s information we need to provide to the pod for generating certificates, I don’t see how we can achieve that with a single StatefulSet.

Original “rough” thought was to create svc and route first get the information than create the deployment and pass both internal and external hostname in as env var at the creation time of the deployment.

In my understanding, multiple internal hop node can reduce downtime especially for node failure . By default, Kubernetes waits five minutes or more to restart pods after node failure. Also, especially since statefulset guarantees the number of pods, the pods on the failed node may not be re-created after five minutes or more unless the pods is forcibly deleted.

VERY good point!

In my opinion, it would be simplest to define a new CRD (Hop ? MeshProxy ?) in AWX Operator that would allow a set of Deployment (replicas fixed at 1 ), Service, and Ingress/Route to be deployed. If users need more than one internal hop, simply deploy multiple CRs.

I agree with defining a new CRD, the current CRD is getting very crowded

@shanemcd @rooftopcellist thoughts?

kurokobo · September 16, 2023, 12:57am

As far as I know, it is difficult without an Operator to automatically deploy Service and Ingress depending on the number of replicas in StatefulSet.

By the way, do you have any special reason to use StatefulSet instead of Deployment?

My rough thought (not tested):

AWX Operator handles a new CRD ( a set of Deployment, Service, and Ingress)
- Internal and external hostnames can be defined through CR
User can specify both internal and external hostnames via AWX UI if Pod is specified as Hop Type (this requires UI enhancement)
AWX generates manifest files containing certificates, keys, CRs, and kustomization.yaml as an installation bundle
- User can deploy internal hop node by applying the manifests

gwmngilfen · September 18, 2023, 9:47am

OK, we should have Mermaid support:

graph LR;
  A[Old thing] -- Migration --> B[New thing!];

Note that Mermaid diagrams will not be sent in emails, so keep that in mind please!

TheRealHaoLiu · September 18, 2023, 3:05pm

there’s no specific reason why we were thinking StatefulSet

I agree with your design (although we probably not going to do the UI bits in the first go around)

TheRealHaoLiu · September 18, 2023, 3:27pm

naming is the hardest thing in software engineering…

I’m thinking something like AWXMeshIngress does anyone got any good ideas?

kurokobo · September 18, 2023, 3:53pm

How about AWXMeshGateway

rooftopcellist · September 18, 2023, 8:05pm

“Gateway” is going to conflict with the naming of other components in flight in the space, and would probably just cause confusion.

We talked about it more in a meeting just now; AWXHopNodes and AWXMeshProxy were both suggested, but we settled on AWXMeshIngress for the time being. Please voice any concerns with that name here, the decision is not final yet, but that will be our placeholder and we start to work on this.

I am a fan of having this be a separate CRD for the AWX Operator, I think that makes a lot of sense and will keep us from growing the “AWX” CRD more than we have to.

TheRealHaoLiu · September 20, 2023, 1:22am

Congratulations on your first post on the forum

fosterseth · September 29, 2023, 7:47pm

We are considering this API change on api/v2/instance endpoint

{
   "id": 33,
   "receptor_node_id": "ex1",
   "receptor_addresses": [
       {
           "id": 23,
           "address": "awx-hop-node",
           "port": 27199,
           "protocol": "tcp",
           "internal": true
       },
       {
           "id": 24,
           "address": "awx-hop-node.route",
           "protocol": "ws",
           "path": "/path",
           "internal": false
       }
   ]
}

hostname becomes receptor_node_id, and no longer needs to be a DNS resolvable address.
Each instance can have a list of receptor_addresses. If the instance does not have a receptor listener, this list is empty
Each receptor_address entry includes necessary data to form a proper address that can route traffic on the network to point to a receptor service for that instance.
addresses with internal set to true means that address is only resolvable from inside the kubernetes cluster. For example, control nodes would use these addresses to connect to the ingress (internal) hop nodes
This representation also allows for the full suite of receptor backends: tcp, udp, and websockets
When adding remote nodes in the API, users need to select which receptor address to use when peering to that node.
peers currently is a list of hostnames. With this change, peers would be a list of receptor_addresses primary keys. E.g. peers = [23, 25]

TheRealHaoLiu · October 2, 2023, 9:26pm

https://github.com/ansible/awx-operator/pull/1576

starting to sketch out what the AWX operator changes look like

input welcome

currently I’m focusing on OpenShift specific implementation if anyone from the community want to hop in and help us sketch out how this would work on non OpenShift platform that would be awesome!

TheRealHaoLiu · October 3, 2023, 6:04pm

btw @kurokobo we decided to end up using Statefulset because this way we have a static podname for the service to select on just in case replicas was “accidentally” increased no traffic would go to the extra pods

kurokobo · October 4, 2023, 1:17am

I will completely leave the final decision to you and your team, but my concerns are as follows:

With StatefulSet, pods may not be automatically rescheduled when a node goes down due to a failure.
In this case, since manual intervention is required, the outage time will be considerably greater than the time it takes for the unintentionally increased replicas to be reduced by the Opeartor.

If you have time, power off the node where the pod for AWXMeshIngress is running and see how it works.

Of course, it is difficult to compare the frequency of replicas increasing due to accidents with the frequency of node failures

Refer to:

TheRealHaoLiu · October 4, 2023, 1:24am

Fair trade off. Ur right coming back up when things go wrong (uncontrolled) is probably more important than protecting people from their own mistakes

tanganellilore · November 4, 2023, 6:01pm

@TheRealHaoLiu What is the final “design” of inbound connection to AWX?

Because to me in not much clear why we need an hop node inside deployment if we approach with ingress service that support websocket.

In my mind if we talk about websocket and ingress, I’m expect that “hop” node pod will not required due to the fact that awx itself can manage ws.

Execution note outside the awx can reach with ingress url, awx itself can manage this connection directly, without require to change anyfing in case of replica and so on.

I missing somenting? (I’m not so Expert on Openshift so my post is k8s oriented)

Thanks for your explanation

kurokobo · November 5, 2023, 5:15am

Execution note outside the awx can reach with ingress url, awx itself can manage this connection directly, without require to change anyfing in case of replica and so on.

That is correct, but in the current implementation, it should be noted that the connection from the Exec Node to AWX is in the outbound direction from AWX to the Exec Node on the backend.

For example, if there are 2 replicas for AWX and the backend is outbound as the current implementation, there will be no problem because the two AWX pods will each connect to the external Exec Node on the backend.

On the other hand, if we try to change this to inbound connection, the Exec Node can only connect to one of the AWX pods that load balanced by Ingress/Svc. The AWX pod that is not connected to Exec Node cannot be joind the mesh network, and will not be able to throw jobs to the Exec Node.

Technical background: Receptor does not have the ability to configure redundant single logical node with multiple actual nodes. Therefore, all nodes participating in the mesh network as a single, independent node. In other words, as the number of replicas for Receptor nodes increases, all pods join the mesh as independent nodes each other.

To solve this, we need to assign a single pod as the Receptor to receive inbound connections from the outside. This is the purpose of this proposal, in my understanding.

In addition to this, there are several other factors that prevent simply making AWX inbound connection from working well, such as dealing with the fact that the node name changes with each pod startup, certificate and port number considerations, etc.

fosterseth · November 6, 2023, 4:39pm

yeah @kurokobo that is right. Each task pod has a running receptor, but those receptor instances aren’t peered to each other.

Peering them to the same internal hop node ensures that from any task pod, we can reach any external remote node.

Initially I did have a POC that just creates listener on the task pods, and then daisy chain them together, so that from any task pod, you can reach any execution node.

But given our task pods can come up and down at will, doing this is kind of messy.

Instead it is simpler to just create a separate hop node that everything connects to.

Topic		Replies	Views
AWX how to deploy, scale and use execution node, hybrid node and hop node. AWX Project awx , kubernetes , ee	2	132	April 12, 2022
Question about receptor service (why is it missing in awx-task) AWX Project awx , kubernetes	13	195	September 12, 2023
Open Source version of Automation Mesh? AWX Project awx , kubernetes	11	91	March 24, 2023
Question about adding remote EE node to AWX k8s cluster Get Help awx , kubernetes , ee	8	548	April 18, 2024
AWX - Instance Receptor - podman bug? Get Help awx , collections , receptor , kubernetes , podman	5	800	December 29, 2023

Ability to allow inbound connection to AWX receptor mesh on Kubernetes

Related topics