My AWX backup is failing. I am running AWX ver 23.7.0 I have applied my manifest and the job runs until it completes with an error. I don’t understand what the output from the ansible job is telling me. Can anyone assist with troubleshooting?
Hi, could you please share your full logs (or enough from the end of the logs) logs from operator?
It is necessary to carefully search the complete logs to determine which tasks actually failed.
If uploading the log here is a hassle, you can try to find the failed task on your side.
The log should look a lot like the standard playbook log, with a line somewhere like fatal: [localhost]: FAILED!.
Basically, what I’m trying to do is identify the failed tasks from the logs, check the implementation of those parts in the GitHub code, and guess the reasons for the failures.
So, it’s really no different from troubleshooting when a regular playbook fails. The implementation of the role might be a bit complex, but I think you can do the same thing too!
I am running a private repo, air-gapped system and needed to pull my images for AWX from there. How do I modify the job to get the containers pulled from it.?
My private repo is something like this: gsil-docker1.idm.gsil.org:5001/postgres:13
OK, I modified the manifest for my backup and added the postgres parameter as suggested. I am still seeing the image backoff error. I am trying to remember what I need to do in this case. I don’t completely trust my Kubernetes knowledge yet as I am still pretty new to Kubernetes. I tried to helm delete my deployment and all the pods stopped with the exception of the awxbackup-db-management. I can delete that pod but that didn’t help. Which component must I modify or remove? I would have thought applying the updated manifest would have done it…
OK, I think I figured that part out. I had to delete the deployment followed by deleting the pod. Then I see that an automation job pod runs for a bit and then goes away. I presume that is my backup trying to run?
OK, I initially got an error when applying that and figured out it need to be single quoted. Otherwise kubernetes barked at me for an invalid spec file.
OK, part of my issue is that fact that I misread the spec. I forgot to drop the 13 at the end of postgres_image. I did that now. How do I check the operator log? Also, should I be seeing the operator listed as a deployment within my cluster now after having deleted it earlier in this post?
[root@gsil-kube01 ~]# kubectl get all -n awx
NAME READY STATUS RESTARTS AGE
pod/awx-postgres-13-0 1/1 Running 0 60m
pod/awx-task-667c9bbf9d-5sbzw 4/4 Running 0 60m
pod/awx-task-667c9bbf9d-tfsng 4/4 Running 0 60m
pod/awx-web-c7bcbd976-6q4v2 3/3 Running 0 59m
pod/awx-web-c7bcbd976-gzzn2 3/3 Running 0 59m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/awx-operator-controller-manager-metrics-service ClusterIP x.x.216.205 <none> 8443/TCP 60m
service/awx-postgres-13 ClusterIP None <none> 5432/TCP 60m
service/awx-service NodePort x.x.18.162 <none> 80:30080/TCP 60m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/awx-task 2/2 2 2 60m
deployment.apps/awx-web 2/2 2 2 59m
NAME DESIRED CURRENT READY AGE
replicaset.apps/awx-task-667c9bbf9d 2 2 2 60m
replicaset.apps/awx-web-c7bcbd976 2 2 2 59m
NAME READY AGE
statefulset.apps/awx-postgres-13 1/1 60m
Yes I stopped and started AWX and the operator is running now.
I can see the logs of the operator pod and the job is failing.
I have noted that for whatever reason I have two pods that are running as the backup job. That may be part of my issue. I was not expecting to see two pods running. One is for an 8/5 date, the other is for an 8/8 date. I suspect I have some old data that is fouling things up but I am unsure where to look. When I applied my backup manifest, where does that go?
Here is what I am seeing right now:
[root@gsil-kube01 ~]# kubectl get all -n awx
NAME READY STATUS RESTARTS AGE
pod/awx-operator-controller-manager-6ffc56f846-qrg8j 2/2 Running 0 70m
pod/awx-postgres-13-0 1/1 Running 0 69m
pod/awx-task-667c9bbf9d-k7k52 4/4 Running 0 69m
pod/awx-task-667c9bbf9d-t5gfw 4/4 Running 0 69m
pod/awx-web-c7bcbd976-bgn8j 3/3 Running 0 69m
pod/awx-web-c7bcbd976-tqj82 3/3 Running 0 69m
pod/awxbackup-08-05-2024-db-management 0/1 ImagePullBackOff 0 30s
pod/awxbackup-08-08-2024-db-management 1/1 Terminating 0 9s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/awx-operator-controller-manager-metrics-service ClusterIP 10.96.48.105 <none> 8443/TCP 70m
service/awx-postgres-13 ClusterIP None <none> 5432/TCP 69m
service/awx-service NodePort 10.110.109.166 <none> 80:30080/TCP 69m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/awx-operator-controller-manager 1/1 1 1 70m
deployment.apps/awx-task 2/2 2 2 69m
deployment.apps/awx-web 2/2 2 2 69m
NAME DESIRED CURRENT READY AGE
replicaset.apps/awx-operator-controller-manager-6ffc56f846 1 1 1 70m
replicaset.apps/awx-task-667c9bbf9d 2 2 2 69m
replicaset.apps/awx-web-c7bcbd976 2 2 2 69m
NAME READY AGE
statefulset.apps/awx-postgres-13 1/1 69m
[root@gsil-kube01 ~]# kubectl get secret -n awx
NAME TYPE DATA AGE
awx-admin-password Opaque 1 91d
awx-app-credentials Opaque 3 72m
awx-broadcast-websocket Opaque 1 91d
awx-custom-certs Opaque 2 80d
awx-ldap-password Opaque 1 80d
awx-postgres-configuration Opaque 6 91d
awx-receptor-ca kubernetes.io/tls 2 91d
awx-receptor-work-signing Opaque 2 91d
awx-secret-key Opaque 1 91d
redhat-operators-pull-secret Opaque 1 72m
sh.helm.release.v1.gsil-awx.v1 helm.sh/release.v1 1 73m
I looked at the top of my post and realized I forgot to mention that I am deploying AWX via helm chart and so much of the spec is located in the yaml file that I am deploying with the helm chart. Will that make a difference? Do I need to integrate the backup with the yaml file I am using at Helm deployment time? Right now, I am running this to start up AWX: helm install -n awx gsil/gsil-awx /root/awx-operator/ -f awxvalues.yml
My backup-awx.yml exists as a separate file from awxvalues.yml.
@jeremytourville You likely have 2 backup pods in your previous post because you scheduled a backup job while the operator was offline and before fixing the image used in the spec. So the first pod is stuck with ImagePullBackoff, while the second one was Terminating (and hopefully succeeded?). The helm chart is probably not doing anything to cleanup the old backup job.
You should be able to see your backups with kubectl get -n awx awxbackups.awx.ansible.com. Note that these are the CR’s for the operator to trigger backups, not the data itself. However, you can run kubectl describe to find out the status, which should also show where the backup lives.
I was able to clean up the back job. I still don’t see any data being created in the backup folder. Instead I exec’d into the Postgres pod and manually created a backup of the AWX DB. In my new cluster I created an awx restore job. I was able to transfer the .sql file to the new cluster so it is ready to be restored/ingested by Postgres.
I am having issues with the correct folder structure being created. I created a storage.yml for my system. I don’t see the folder listed when I browse to the path via CLI. The PV and PVC say that everything is OK, no errors when running kubectl describe. The status for PV and PVC say bound. Is there something I need to add to the spec file?
Here is my spec file that I am deploying with helm at install time: