Why does awx-operator scale my awx-task and awx-web to 0 replicas after startup

I’m trying to deploy awx with awx-operator. When I apply my kustomize, things start up, but then it terminates all the pods and only restarts postgres. It has replicas for awx-task and awx-web set to 0. I have to manually increase them to 1 before it starts these containers. If I make any changes to my config and reapply, it does the same thing. How can I get it to keep running awx-task and awx-web?

K8s 1.27 (k3s)

Kustomize.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.19.1
  # - secrets.yaml
  - tls.yaml
  - awx.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.19.1

# Specify a custom namespace in which to install AWX
namespace: sea

awx.yaml

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx4
spec:
  hostname: awx4.k8s.test.example.com
  ingress_type: ingress
  ingress_annotations: |
    cert-manager.io/issuer: awx4-issuer
    traefik.ingress.kubernetes.io/router.middlewares: default-bastion-office-vpn@kubernetescrd
  ingress_tls_secret: awx4-acme-le-tls-cert
  service_type: ClusterIP
  extra_settings:
    - setting: TOWER_URL_BASE
      value: "'awx4.k8s.test.example.com"
  postgres_data_volume_init: true

tls.yaml

---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: awx4-issuer
spec:
  acme:
    privateKeySecretRef:
      name: awx4-acme-le-key
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    solvers:
      - http01:
          ingress:
            ingressClassName: traefik

I tried adding replicas: 1 to the spec, but that didn’t seem to affect anything. Am I missing something here?

This may be unrelated, but when I deleted my namespace and tried to recreate, the awx-task won’t start at all. It’s endlessly waiting for database migrations, but there is no migration job ever created, so it seems completely stuck here.

I see this error in the awx-operator logs

{“level”:“error”,“ts”:“2024-09-17T14:21:20Z”,“msg”:“Reconciler error”,“controller”:“awx-controller”,“object”:{“name”:“awx”,“namespace”:“sea”},“namespace”:“sea”,“name”:“awx”,“reconcileID”:“1fa7a53a-ca3a-46db-9049-dcd78f6e1cbb”,“error”:“event runner on failed”,“stacktrace”:“sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227”}

Not sure what this means.

I did notice it had this in the logs too:

[installer : Scale down Deployment for migration]

Maybe it intentionally scaled it down? But why didn’t it scale it back up post migration?

I think this is the problem, but I don’t know what to do about it.

fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined\n\nThe error appears to be in '/opt/ansible/roles/installer/tasks/resources_configuration.yml': line 248, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Apply deployment resources\n  ^ here\n"}

I saw an issue report on the github saying it was a problem upgrading from 2.18 to 2.19, but I didn’t upgrade from 2.18. I installed fresh from 2.19.1. If I install version 2.18, I don’t get this error and the deployment gets scaled back up after the initial “migration”.

Hi, I see the same error trying to go from 2.15.0 to 2.19.1. I tried adding the spec to awx-operator helm deployment:
web_manage_replicas: true
But it still says it is not defined.

p.s. wxs.awx.ansible.com CRD needs to be forcefully upgraded or AWX complete redeployed.

Yes, the problem for me was that someone else already installed an older awx-operator on the cluster with helm, so the CRDs were wrong, and even if I updated the CRDs, the helm installer would revert the CRDs while my awx was deploying, so it was always a bit random if things would work depending on when the CRDs were broken.