Why does awx-operator scale my awx-task and awx-web to 0 replicas after startup

John_Ratliff · September 16, 2024, 9:26pm

I’m trying to deploy awx with awx-operator. When I apply my kustomize, things start up, but then it terminates all the pods and only restarts postgres. It has replicas for awx-task and awx-web set to 0. I have to manually increase them to 1 before it starts these containers. If I make any changes to my config and reapply, it does the same thing. How can I get it to keep running awx-task and awx-web?

K8s 1.27 (k3s)

Kustomize.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.19.1
  # - secrets.yaml
  - tls.yaml
  - awx.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.19.1

# Specify a custom namespace in which to install AWX
namespace: sea

awx.yaml

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx4
spec:
  hostname: awx4.k8s.test.example.com
  ingress_type: ingress
  ingress_annotations: |
    cert-manager.io/issuer: awx4-issuer
    traefik.ingress.kubernetes.io/router.middlewares: default-bastion-office-vpn@kubernetescrd
  ingress_tls_secret: awx4-acme-le-tls-cert
  service_type: ClusterIP
  extra_settings:
    - setting: TOWER_URL_BASE
      value: "'awx4.k8s.test.example.com"
  postgres_data_volume_init: true

tls.yaml

---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: awx4-issuer
spec:
  acme:
    privateKeySecretRef:
      name: awx4-acme-le-key
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    solvers:
      - http01:
          ingress:
            ingressClassName: traefik

I tried adding replicas: 1 to the spec, but that didn’t seem to affect anything. Am I missing something here?

John_Ratliff · September 16, 2024, 9:45pm

This may be unrelated, but when I deleted my namespace and tried to recreate, the awx-task won’t start at all. It’s endlessly waiting for database migrations, but there is no migration job ever created, so it seems completely stuck here.

John_Ratliff · September 17, 2024, 2:23pm

I see this error in the awx-operator logs

{“level”:“error”,“ts”:“2024-09-17T14:21:20Z”,“msg”:“Reconciler error”,“controller”:“awx-controller”,“object”:{“name”:“awx”,“namespace”:“sea”},“namespace”:“sea”,“name”:“awx”,“reconcileID”:“1fa7a53a-ca3a-46db-9049-dcd78f6e1cbb”,“error”:“event runner on failed”,“stacktrace”:“sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227”}

Not sure what this means.

John_Ratliff · September 17, 2024, 2:51pm

I did notice it had this in the logs too:

[installer : Scale down Deployment for migration]

Maybe it intentionally scaled it down? But why didn’t it scale it back up post migration?

John_Ratliff · September 18, 2024, 6:22pm

I think this is the problem, but I don’t know what to do about it.

fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined\n\nThe error appears to be in '/opt/ansible/roles/installer/tasks/resources_configuration.yml': line 248, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Apply deployment resources\n  ^ here\n"}

I saw an issue report on the github saying it was a problem upgrading from 2.18 to 2.19, but I didn’t upgrade from 2.18. I installed fresh from 2.19.1. If I install version 2.18, I don’t get this error and the deployment gets scaled back up after the initial “migration”.

disi · September 19, 2024, 10:57am

Hi, I see the same error trying to go from 2.15.0 to 2.19.1. I tried adding the spec to awx-operator helm deployment:
web_manage_replicas: true
But it still says it is not defined.

p.s. wxs.awx.ansible.com CRD needs to be forcefully upgraded or AWX complete redeployed.

John_Ratliff · September 19, 2024, 6:36pm

Yes, the problem for me was that someone else already installed an older awx-operator on the cluster with helm, so the CRDs were wrong, and even if I updated the CRDs, the helm installer would revert the CRDs while my awx was deploying, so it was always a bit random if things would work depending on when the CRDs were broken.

system · October 19, 2024, 6:36pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The Scale up deployment task is failing AWX Project awx , kubernetes	1	2	July 29, 2019
AWX HA on Kubernetes AWX Project awx , ubuntu , kubernetes	2	7	April 6, 2020
Unable to deploy the AWX tower on Kubernetes(EKS) AWX Project awx , kubernetes , aws	8	31	July 26, 2023
Missing awx instance AWX Project awx	12	55	November 3, 2022
AWX pods not deployed by operator (0.13.0) on K8s 1.22 AWX Project awx , kubernetes	0	7	September 28, 2022

Why does awx-operator scale my awx-task and awx-web to 0 replicas after startup

Related topics