AWX Backup Fails

My AWX backup is failing. I am running AWX ver 23.7.0 I have applied my manifest and the job runs until it completes with an error. I don’t understand what the output from the ansible job is telling me. Can anyone assist with troubleshooting?

Also see post: Backup Posstgres DB from cluster #1 and restore DB to cluster#2

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-2024-08-06
  namespace: awx
spec:
  deployment_name: awx-operator-controller-manager
  backup_pvc:  awx-backup
  clean_backup_on_delete: true
localhost                  : ok=13   changed=1    unreachable=0    failed=1    skipped=8    rescued=0    ignored=0

----------
{"level":"error","ts":"2024-08-06T16:59:09Z","msg":"Reconciler error","controller":"awxbackup-controller","object":{"name":"awxbackup-08-05-2024","namespace":"awx"},"namespace":"awx","name":"awxbackup-08-05-2024","reconcileID":"746a5fdc-2c6d-4fef-9af0-71f4d4ba8722","error":"event runner on failed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235"}
{"level":"info","ts":"2024-08-06T16:59:10Z","logger":"logging_event_handler","msg":"[playbook debug]","name":"awxbackup-08-05-2024","namespace":"awx","gvk":"awx.ansible.com/v1beta1, Kind=AWXBackup","event_type":"runner_on_ok","job":"68433906028693526","EventData.TaskArgs":""}

Hi, could you please share your full logs (or enough from the end of the logs) logs from operator?
It is necessary to carefully search the complete logs to determine which tasks actually failed.

I will get them loaded here ASAP. I have to get approvals for logs from our security team. It may take a day to two. Thanks for your patience.

If uploading the log here is a hassle, you can try to find the failed task on your side.
The log should look a lot like the standard playbook log, with a line somewhere like fatal: [localhost]: FAILED!.

Basically, what I’m trying to do is identify the failed tasks from the logs, check the implementation of those parts in the GitHub code, and guess the reasons for the failures.

The role that works during backup starts from this main.yaml: awx-operator/roles/backup/tasks/main.yml at 2.11.0 · ansible/awx-operator · GitHub

So, it’s really no different from troubleshooting when a regular playbook fails. The implementation of the role might be a bit complex, but I think you can do the same thing too!

OK, I think I see the issue. The backup job is trying to pull a container postgres13 and getting a not ready status.

{"message": "backoff pulling image" \ "postgres:13"\}   

I am running a private repo, air-gapped system and needed to pull my images for AWX from there. How do I modify the job to get the containers pulled from it.?

My private repo is something like this:
gsil-docker1.idm.gsil.org:5001/postgres:13

You can specify postgres_image: gsil-docker1.idm.gsil.org:5001/postgres:13 in spec of your AWXBackup object.

Ok, great. Let me make that modification, watch the job and report back. Thanks! :smiley:

OK, I modified the manifest for my backup and added the postgres parameter as suggested. I am still seeing the image backoff error. I am trying to remember what I need to do in this case. I don’t completely trust my Kubernetes knowledge yet as I am still pretty new to Kubernetes. I tried to helm delete my deployment and all the pods stopped with the exception of the awxbackup-db-management. I can delete that pod but that didn’t help. Which component must I modify or remove? I would have thought applying the updated manifest would have done it…

OK, I think I figured that part out. I had to delete the deployment followed by deleting the pod. Then I see that an automation job pod runs for a bit and then goes away. I presume that is my backup trying to run?

Ah my bad, correct spec is:

spec:
  ...
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: 13

Ref.: awx-operator/config/crd/bases/awx.ansible.com_awxbackups.yaml at 2.11.0 · ansible/awx-operator · GitHub

no worries, I’ll make that change and apply it.

OK, I initially got an error when applying that and figured out it need to be single quoted. Otherwise kubernetes barked at me for an invalid spec file.

spec:
  ...
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: '13'

Anyway,
I don’t see any new files or folders at my defined pvc location.
/var/lib/postgres/backup/

Hmmm… :thinking: any thoughts?

First of all check the operator’s log to see if PLAY RECAP succeeds with a failed of 0.

OK, part of my issue is that fact that I misread the spec. I forgot to drop the 13 at the end of postgres_image. I did that now. How do I check the operator log? Also, should I be seeing the operator listed as a deployment within my cluster now after having deleted it earlier in this post?


[root@gsil-kube01 ~]# kubectl get all -n awx
NAME                            READY   STATUS    RESTARTS   AGE
pod/awx-postgres-13-0           1/1     Running   0          60m
pod/awx-task-667c9bbf9d-5sbzw   4/4     Running   0          60m
pod/awx-task-667c9bbf9d-tfsng   4/4     Running   0          60m
pod/awx-web-c7bcbd976-6q4v2     3/3     Running   0          59m
pod/awx-web-c7bcbd976-gzzn2     3/3     Running   0          59m

NAME                                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
service/awx-operator-controller-manager-metrics-service   ClusterIP   x.x.216.205   <none>        8443/TCP       60m
service/awx-postgres-13                                   ClusterIP   None             <none>        5432/TCP       60m
service/awx-service                                       NodePort    x.x.18.162     <none>        80:30080/TCP   60m

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-task   2/2     2            2           60m
deployment.apps/awx-web    2/2     2            2           59m

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/awx-task-667c9bbf9d   2         2         2       60m
replicaset.apps/awx-web-c7bcbd976     2         2         2       59m

NAME                               READY   AGE
statefulset.apps/awx-postgres-13   1/1     60m

Since the backup role is executed by the operator, nothing will happen if the operator is not running.

Here’s what you need to do:

  1. Add the following spec to AWXBackup.
    spec:
      ...
      postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
      postgres_image_version: '13'
    
  2. Start the operator.
  3. Observe the logs of the operator’s pod and make sure that the playbook succeeds with failed=0.
  1. Affirmative, that is what I have.
  2. Yes I stopped and started AWX and the operator is running now.
  3. I can see the logs of the operator pod and the job is failing.

I have noted that for whatever reason I have two pods that are running as the backup job. That may be part of my issue. I was not expecting to see two pods running. One is for an 8/5 date, the other is for an 8/8 date. I suspect I have some old data that is fouling things up but I am unsure where to look. When I applied my backup manifest, where does that go?

Here is what I am seeing right now:

[root@gsil-kube01 ~]# kubectl get all -n awx
NAME                                                   READY   STATUS             RESTARTS   AGE
pod/awx-operator-controller-manager-6ffc56f846-qrg8j   2/2     Running            0          70m
pod/awx-postgres-13-0                                  1/1     Running            0          69m
pod/awx-task-667c9bbf9d-k7k52                          4/4     Running            0          69m
pod/awx-task-667c9bbf9d-t5gfw                          4/4     Running            0          69m
pod/awx-web-c7bcbd976-bgn8j                            3/3     Running            0          69m
pod/awx-web-c7bcbd976-tqj82                            3/3     Running            0          69m
pod/awxbackup-08-05-2024-db-management                 0/1     ImagePullBackOff   0          30s
pod/awxbackup-08-08-2024-db-management                 1/1     Terminating        0          9s

NAME                                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
service/awx-operator-controller-manager-metrics-service   ClusterIP   10.96.48.105     <none>        8443/TCP       70m
service/awx-postgres-13                                   ClusterIP   None             <none>        5432/TCP       69m
service/awx-service                                       NodePort    10.110.109.166   <none>        80:30080/TCP   69m

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-operator-controller-manager   1/1     1            1           70m
deployment.apps/awx-task                          2/2     2            2           69m
deployment.apps/awx-web                           2/2     2            2           69m

NAME                                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/awx-operator-controller-manager-6ffc56f846   1         1         1       70m
replicaset.apps/awx-task-667c9bbf9d                          2         2         2       69m
replicaset.apps/awx-web-c7bcbd976                            2         2         2       69m

NAME                               READY   AGE
statefulset.apps/awx-postgres-13   1/1     69m

[root@gsil-kube01 ~]# kubectl get secret -n awx
NAME                             TYPE                 DATA   AGE
awx-admin-password               Opaque               1      91d
awx-app-credentials              Opaque               3      72m
awx-broadcast-websocket          Opaque               1      91d
awx-custom-certs                 Opaque               2      80d
awx-ldap-password                Opaque               1      80d
awx-postgres-configuration       Opaque               6      91d
awx-receptor-ca                  kubernetes.io/tls    2      91d
awx-receptor-work-signing        Opaque               2      91d
awx-secret-key                   Opaque               1      91d
redhat-operators-pull-secret     Opaque               1      72m
sh.helm.release.v1.gsil-awx.v1   helm.sh/release.v1   1      73m
---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-2024-08-06
  namespace: awx
spec:
  deployment_name: awx-operator-controller-manager
  backup_pvc:  awx-backup
  clean_backup_on_delete: true
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: '13'

I looked at the top of my post and realized I forgot to mention that I am deploying AWX via helm chart and so much of the spec is located in the yaml file that I am deploying with the helm chart. Will that make a difference? Do I need to integrate the backup with the yaml file I am using at Helm deployment time? Right now, I am running this to start up AWX:
helm install -n awx gsil/gsil-awx /root/awx-operator/ -f awxvalues.yml

My backup-awx.yml exists as a separate file from awxvalues.yml.

@kurokobo
Can you comment further? See my previous reply. Thanks!

@jeremytourville You likely have 2 backup pods in your previous post because you scheduled a backup job while the operator was offline and before fixing the image used in the spec. So the first pod is stuck with ImagePullBackoff, while the second one was Terminating (and hopefully succeeded?). The helm chart is probably not doing anything to cleanup the old backup job.

You should be able to see your backups with kubectl get -n awx awxbackups.awx.ansible.com. Note that these are the CR’s for the operator to trigger backups, not the data itself. However, you can run kubectl describe to find out the status, which should also show where the backup lives.

Sample output:

Status:
  Backup Claim:      awx-snd-backup-claim
  Backup Directory:  /backups/tower-openshift-backup-2024-04-17-165512
  Conditions:
    Last Transition Time:  2024-04-17T16:55:38Z
    Reason:
    Status:                False
    Type:                  Failure
    Last Transition Time:  2024-04-17T16:54:52Z
    Reason:                Successful
    Status:                True
    Type:                  Running
    Last Transition Time:  2024-08-04T16:41:45Z
    Reason:                Successful
    Status:                True
    Type:                  Successful

The real data is inside of the Backup Claim at the Backup Directory.

As for your stuck pod, I would simply delete the related awxbackups.awx.ansible.com and deployment resources from kubernetes.