AWX Controller died over the weekend

It’s pretty clear what happened was the AWX Controller “latest” upgraded to 2.13.1, and now my system will not start. I’d like to get it running on 2.13.1.

The error is:

TASK [Apply deployment resources] ********************************
fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'web_liveness_period' is undefined. 'web_liveness_period' is undefined. 'web_liveness_period' is undefined. 'web_liveness_period' is undefined\n\nThe error appears to be in '/opt/ansible/roles/installer/tasks/resources_configuration.yml': line 248, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Apply deployment resources\n ^ here\n"}
-------------------------------------------------------------------------------
{"level":"error","ts":"2024-03-18T15:51:14Z","logger":"logging_event_handler","msg":"","name":"awx-sandbox","namespace":"awx","gvk":"awx.ansible.com/v1beta1, Kind=AWX","event_type":"runner_on_failed","job":"6227022932347892420","EventData.Task":"Apply deployment resources","EventData.TaskArgs":"","EventData.FailedTaskPath":"/opt/ansible/roles/installer/tasks/resources_configuration.yml:248","error":"[playbook task failed]","stacktrace":"github.com/operator-framework/ansible-operator-plugins/internal/ansible/events.loggingEventHandler.Handle\n\tansible-operator-plugins/internal/ansible/events/log_events.go:111"}
{"level":"error","ts":"2024-03-18T15:51:15Z","logger":"runner","msg":"\u001b[0;34mansible-playbook [core 2.15.8]\u001b[0m\r\n\u001b[0;34m config file = /etc/ansible/ansible.cfg\u001b[0m\r\n\u001b[0;34m configured module search path = ['/usr/share/ansible/openshift']\u001b[0m\r\n\u001b[0;34m ansible python module location = /usr/local/lib/python3.9/site-packages/ansible\u001b[0m\r\n\u001b[0;34m ansible collection location = /opt/ansible/.ansible/collections:/usr/share/ansible/collections\u001b[0m\r\n\u001b[0;34m executable location = /usr/local/bin/ansible-playbook\u001b[0m\r\n\u001b[0;34m python version = 3.9.18 (main, Sep 22 2023, 17:58:34) [GCC 8.5.0 20210514 (Red Hat 8.5.0-20)] (/usr/bin/python3)\u001b[0m\r\n\u001b[0;34m jinja version = 3.1.3\u001b[0m\r\n\u001b[0;34m libyaml = True\u001b[0m\r\n\u001b[0;34mUsing /etc/ansible/ansible.cfg as config file\u001b[0m\r\n\u001b[0;34mSkipping callback 'awx_display', as we already have a stdout callback.\u001b[0m\n\u001b[0;34mSkipping callback 'default', a...
----- Ansible Task Status Event StdOut (awx.ansible.com/v1beta1, Kind=AWX, awx-sandbox/awx) -----
PLAY RECAP *********************************************************************
localhost : ok=69 changed=0 unreachable=0 failed=1 skipped=66 rescued=0 ignored=0
----------
{"level":"error","ts":"2024-03-18T15:51:15Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx-sandbox","namespace":"awx"},"namespace":"awx","name":"awx-sandbox","reconcileID":"d22706a7-5a30-471d-9549-1db542c31b4c","error":"event runner on failed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

I ran:

[user@jump ~]$ oc apply --server-side -k github.com/ansible/awx-operator/config/crd?ref=2.13.1
customresourcedefinition.apiextensions.k8s.io/awxbackups.awx.ansible.com serverside-applied
customresourcedefinition.apiextensions.k8s.io/awxmeshingresses.awx.ansible.com serverside-applied
customresourcedefinition.apiextensions.k8s.io/awxrestores.awx.ansible.com serverside-applied
customresourcedefinition.apiextensions.k8s.io/awxs.awx.ansible.com serverside-applied

The error captured above continues after the CRD command was applied.

Is there a path forward from here?

Kevin

To be slightly more clear, the Controller pod is running and the postgres pod is running, but the web/task/ee pod will not start.

Hi,
Try this:

oc apply --server-side --force-conflicts -k github.com/ansible/awx-operator/config/crd?ref=2.13.1

From: 2.13.1 web and task won't install · Issue #1771 · ansible/awx-operator · GitHub

Thank you, yes. I did run that command without the “force” param, and received no errors. To be thorough, I just ran the command again with the “force” param, again received no errors, and again it did not help. The problem with the missing variable remains.

The error was: 'web_liveness_period' is undefined.

Thank you for the suggestion.

Kevin

Crd on server side was applied?

I’m sorry. I don’t know what you are asking. I ran the command and I got back the results pasted above. Does that equal “being applied”? Or is there some other magic I can try to run?

We do see the right chart versions applied.

oc describe awx
Name:         awx-sandbox
Namespace:    awx
Labels:       app.kubernetes.io/component=awx
              app.kubernetes.io/instance=ns-awx
              app.kubernetes.io/managed-by=awx-operator
              app.kubernetes.io/operator-version=2.13.1
              app.kubernetes.io/part-of=awx-sandbox

Bizarrely, the errors are still there, but now the instance is up and running. Any guesses what that’s about?

Suspicions: It may be we had another namespace running on which I did not do the apply. More likely, we had this in the the code for the namespace having the problem:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "200"
  name: awx-operator-test
spec:
  destination:
    name: in-cluster
    namespace: test-namespace
    server: ''
  source:
    path: config/default
    # Find the latest tag here: https://github.com/ansible/awx-operator/releases
    repoURL: 'https://github.com/ansible/awx-operator.git'
    targetRevision: 2.12.1
  project: infra
  syncPolicy:
    automated:
      prune: false
      selfHeal: false

This directive was probably in conflict with the build auto-pulled over the weekend. Since I did not direct the pull to happen, I did nothing to prepare for it.

Anyway, we are running now.

Thank you.