Why does awx-operator scale my awx-task and awx-web to 0 replicas after startup

I’m trying to deploy awx with awx-operator. When I apply my kustomize, things start up, but then it terminates all the pods and only restarts postgres. It has replicas for awx-task and awx-web set to 0. I have to manually increase them to 1 before it starts these containers. If I make any changes to my config and reapply, it does the same thing. How can I get it to keep running awx-task and awx-web?

K8s 1.27 (k3s)

Kustomize.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.19.1
  # - secrets.yaml
  - tls.yaml
  - awx.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.19.1

# Specify a custom namespace in which to install AWX
namespace: sea

awx.yaml

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx4
spec:
  hostname: awx4.k8s.test.example.com
  ingress_type: ingress
  ingress_annotations: |
    cert-manager.io/issuer: awx4-issuer
    traefik.ingress.kubernetes.io/router.middlewares: default-bastion-office-vpn@kubernetescrd
  ingress_tls_secret: awx4-acme-le-tls-cert
  service_type: ClusterIP
  extra_settings:
    - setting: TOWER_URL_BASE
      value: "'awx4.k8s.test.example.com"
  postgres_data_volume_init: true

tls.yaml

---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: awx4-issuer
spec:
  acme:
    privateKeySecretRef:
      name: awx4-acme-le-key
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    solvers:
      - http01:
          ingress:
            ingressClassName: traefik

I tried adding replicas: 1 to the spec, but that didn’t seem to affect anything. Am I missing something here?

This may be unrelated, but when I deleted my namespace and tried to recreate, the awx-task won’t start at all. It’s endlessly waiting for database migrations, but there is no migration job ever created, so it seems completely stuck here.

I see this error in the awx-operator logs

{“level”:“error”,“ts”:“2024-09-17T14:21:20Z”,“msg”:“Reconciler error”,“controller”:“awx-controller”,“object”:{“name”:“awx”,“namespace”:“sea”},“namespace”:“sea”,“name”:“awx”,“reconcileID”:“1fa7a53a-ca3a-46db-9049-dcd78f6e1cbb”,“error”:“event runner on failed”,“stacktrace”:“sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227”}

Not sure what this means.

I did notice it had this in the logs too:

[installer : Scale down Deployment for migration]

Maybe it intentionally scaled it down? But why didn’t it scale it back up post migration?

I think this is the problem, but I don’t know what to do about it.

fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined\n\nThe error appears to be in '/opt/ansible/roles/installer/tasks/resources_configuration.yml': line 248, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Apply deployment resources\n  ^ here\n"}

I saw an issue report on the github saying it was a problem upgrading from 2.18 to 2.19, but I didn’t upgrade from 2.18. I installed fresh from 2.19.1. If I install version 2.18, I don’t get this error and the deployment gets scaled back up after the initial “migration”.

Hi, I see the same error trying to go from 2.15.0 to 2.19.1. I tried adding the spec to awx-operator helm deployment:
web_manage_replicas: true
But it still says it is not defined.

p.s. wxs.awx.ansible.com CRD needs to be forcefully upgraded or AWX complete redeployed.