AWX Backup Fails

Please clarify your situation.

I can’t understand what’s going on in your environment, so I can’t tell if you’re doing the right trial and error.

You were asking us how to restore, but why is AWX already running in the new cluster and you’re trying to do a backup instead of a restore? Also, why did you delete the directories for DB and backups?

Instead of pasting these pragmented errors, I want to emphasize again that you should focus on identifying the failed tasks from the logs.

No I am trying to do the restore in the new cluster.

Why delete the dirs?

I found two different directories for PG data and I wasn’t positive which one held the “current” data. It didn’t really matter anyway because there wasn’t any production data there. I just wanted to ensure I had a clean start before running the restore. I also had to delete and recreate the directories because I was trouble shooting why my PV and PVC weren’t being created properly. Again, going back to STIG settings issues.

Understood but that’s the issue. I don’t fully understand what the log is telling me from the failed task and in order to release the log for posting I have to get my security team involved with approvals. It’s all a balancing act for me.

It looks like the failed task is:
Create mangement pod from templated deployment config.

init.yml

- name: Create management pod from templated deployment config
  k8s:
    name: "{{ ansible_operator_meta.name }}-db-management"
    kind: Deployment
    state: present
    definition: "{{ lookup('template', 'management-pod.yml.j2') }}"
    wait: true

I know not seeing the log puts you at a real disadvantage. Sorry but that’s the process I have to go through.

Does this mean restore job?

Oh, yes. Sorry for my typo causing confusion. I just edited that post to say restore.

1 Like

Okay, thanks for updating.
For the case that failing on this task, you should investigate why the db-management pod can not be created.

You can follow logs in real time by following commands:

kubectl -n awx logs -f deployments/awx-operator-controller-manager

While the task Create management pod from templated deployment config is invoking, you can see the db-management pod and investigate why the pod fails to be created.

kubectl -n awx get pod
kubectl -n awx describe pod <the name of the db-management pod>

Thanks, never noticed that, I would be happy to be one of the first who earned the badge :smiley:
Probably there are many posts that have been forgotten to be marked as solution, so the actual number may already be over 50 :stuck_out_tongue:

kubectl describe says:
MountVolume.NewMounter initialization failed for volume "awx-backup" : path "/var/lib/postgresql/backup" does not exist.

I presume this must be for the path inside the pod?

PV and PVC already say that they are bound and the file path exists on the node where PG is running.

The PVC used to store backup files is utilized by the db-management pod, which is different from the PostgreSQL pod.

The path must exist on the node where the db-management pod will be running.

Acknowledged, but I am a little confused. I thought applying my storage.yml file should have created that file path. Yet, I don’t see that path on all my nodes. (Only the node where I manually created that path) Do I need to create that path manually or is there something more fundamentally wrong with my storage? (Again, possibly pointing back to a STIG/security setting issue.)

I am seeing a task failure for: Check to make sure backup directory exists on PVC.

Kubectl describe shows me that the pod should be running on kube05. I have created that file path on kube05 and placed the backup inside there.

ie: /var/lib/postgresql/backup/tower-openshift-backup-<date>-<time>
with the three files inside that folder.

No, paths are not created automatically by no-provisioner class. You need to create them manually.

Just because the db-management pod started on kube05 once, there is no guarantee it will start on kube05 the next time unless affinity is set.

Also, how did you configure AWXRestore resource?

Gothcha, I’m still learning something new all the time. Thanks for that info.
I can confirm that the pod is running on that node after reviewing kubectl describe. But, yes I get it, the pod could move around.


apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
name: restore-awx
namespace: awx
spec:
deployment_name: awx
backup_name: awxbackup-09-19-2024
postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
postgres_image_version: ‘13’

Hey, don’t ignore my previous comment :frowning:

Sorry, I grabbed the old version from earlier in this thread. Shame on me, my mistake. Here is my actual restore:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
  name: restore-awx
  namespace: awx
spec:
  deployment_name: awx
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres 
  postgres_image_version: ‘13’

Any backup_* parameter is missing :upside_down_face:

Well, of course I suppose it’s actually set. Anyway, please double-check the environment once more.

If the files exist at the correct path on the correct node, the PV is properly published at that path, the db-management pod can mount that PV, and the correct parameters are given to AWXRestore, then the task shouldn’t fail.

OK, but I was following directions from post #37 where you said to remove backup_name: parameter :wink:

After further review of documentation and following the link, I see that parameter should be included.

Now my question is- what is the correct name to include? Would it be whatever name I gave the backup in my backup.yml I applied to my old cluster? Something else?

backup_name: <something_here>

image

OK, wow :blush: So easy to miss something! Thanks for pointing out my error!

I’m getting a bit tired of all the miscommunication and pointless troubelshooting until now :frowning: It’s exhausting :frowning:

1 Like

Here is what I have right now after carefully checking things. I don’t see what I am doing wrong but maybe you can spot the issue?

kubectl logs:

[root@gsil-kube04 ~]# kubectl logs -f -n awx awx-operator-controller-manager-6ffc56f846-2pn8n

... <output> ...

--------------------------- Ansible Task StdOut -------------------------------

 TASK [Check to make sure backup directory exists on PVC] ********************************
fatal: [localhost]: FAILED! => {"changed": true, "rc": 1, "return_code": 1, "stderr": "stat: missing operand\nTry 'stat --help' for more information.\n", "stderr_lines": ["stat: missing operand", "Try 'stat --help' for more information."], "stdout": "", "stdout_lines": []}

restore file:

[root@gsil-kube04 ~]# cat restore-awx.yaml
---
apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
  name: restore-awx
  namespace: awx
spec:
  deployment_name: awx
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: '13'
  backup_pvc: awx-backup
  backup_directory: /backups/tower-openshift-backup-2024-09-19-175222

PV & PVC status:

NAME          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                               STORAGECLASS    REASON   AGE
awx-backup    2Gi        RWX            Delete           Bound    awx/awx-backup                      local-storage            6s
postgres-pv   2Gi        RWX            Delete           Bound    awx/postgres-13-awx-postgres-13-0   local-storage            84d


[root@gsil-kube04 ~]# kubectl get pvc -n awx
NAME                            STATUS   VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS    AGE
awx-backup                      Bound    awx-backup    2Gi        RWX            local-storage   12s
postgres-13-awx-postgres-13-0   Bound    postgres-pv   2Gi        RWX            local-storage   84d

directory structure on kube05:

[root@gsil-kube05 backup]# pwd
/var/lib/postgresql

[root@gsil-kube05 postgresql]# ls -lah
total 4.0K
drwx------.  4 root root   32 Sep 24 15:35 .
drwxr-xr-x. 69 root root 4.0K Jul 31 14:15 ..
drwxr-xr-x.  3 root root   54 Sep 23 14:57 backup
drwxr-xr-x.  3 root root   18 Jul  1 15:42 data

[root@gsil-kube05 postgresql]# cd backup/
[root@gsil-kube05 backup]# ls -lah
total 0
drwxr-xr-x. 3 root root 54 Sep 23 14:57 .
drwx------. 4 root root 32 Sep 24 15:35 ..
drwxr-xr-x. 2 root root 59 Sep 23 14:57 tower-openshift-backup-2024-09-19-175222

[root@gsil-kube05 backup]# cd tower-openshift-backup-2024-09-19-175222/
[root@gsil-kube05 tower-openshift-backup-2024-09-19-175222]# ls -lah
total 151M
drwxr-xr-x. 2 root root   59 Sep 23 14:57 .
drwxr-xr-x. 3 root root   54 Sep 23 14:57 ..
-rwxr-xr-x. 1 root root 3.0K Sep 23 14:57 awx_object
-rwxr-xr-x. 1 root root  51K Sep 23 14:57 secrets.yml
-rwxr-xr-x. 1 root root 150M Sep 23 14:57 tower.db

cluster storage configuration:

[root@gsil-kube04 ~]# cat storage.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageClass.kubernetes.io/is-default-class: "true"
  name: local-storage
  namespace: awx
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: Immediate
#volumeBindingMode: WaitForFirstConsumer

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: postgres-pv
  namespace: awx
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:
    path: /var/lib/postgresql/data
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - gsil-kube04.idm.gsil.org
          - gsil-kube05.idm.gsil.org
          - gsil-kube06.idm.gsil.org

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-13-awx-postgres-13-0
  namespace: awx
spec:
  storageClassName: local-storage
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: awx-backup
  namespace: awx
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:
    path: /var/lib/postgresql/backup
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - gsil-kube04.idm.gsil.org
          - gsil-kube05.idm.gsil.org
          - gsil-kube06.idm.gsil.org

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: awx-backup
  namespace: awx
spec:
  storageClassName: local-storage
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi

kubectl describe restore-awx-db-management:

[root@gsil-kube04 localadm]# kubectl describe po/restore-awx-db-management -n awx
Name:                      restore-awx-db-management
Namespace:                 awx
Priority:                  0
Service Account:           default
Node:                      gsil-kube05.idm.gsil.org/x.x.8.38
Start Time:                Tue, 24 Sep 2024 15:48:15 +0000
Labels:                    app.kubernetes.io/component=awx
                           app.kubernetes.io/managed-by=awx-operator
                           app.kubernetes.io/operator-version=2.11.0
                           app.kubernetes.io/part-of=restore-awx
Annotations:               <none>
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:                        x.x.1.214
IPs:
  IP:  x.x.1.214
Containers:
  restore-awx-db-management:
    Container ID:  containerd://f705f7796e4517061ba1c5d34dcd73cf5fba53d9cd985776922f04a889d61c1d
    Image:         gsil-docker1.idm.gsil.org:5001/postgres:13
    Image ID:      gsil-docker1.idm.gsil.org:5001/postgres@sha256:5f4b5af578e8d63f371b724f7b83230125230793282cd2e08d221452dbb1fffe
    Port:          <none>
    Host Port:     <none>
    Command:
      sleep
      infinity
    State:          Running
      Started:      Tue, 24 Sep 2024 15:48:15 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  4Gi
    Requests:
      cpu:        25m
      memory:     32Mi
    Environment:  <none>
    Mounts:
      /backups from restore-awx-backup (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vgls5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  restore-awx-backup:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  awx-backup
    ReadOnly:   false
  kube-api-access-vgls5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  11s   default-scheduler  Successfully assigned awx/restore-awx-db-management to gsil-kube05.idm.gsil.org
  Normal  Pulled     11s   kubelet            Container image "gsil-docker1.idm.gsil.org:5001/postgres:13" already present on machine
  Normal  Created    11s   kubelet            Created container restore-awx-db-management
  Normal  Started    11s   kubelet            Started container restore-awx-db-management
  Normal  Killing    1s    kubelet            Stopping container restore-awx-db-management

Helm configuration and deployment for AWX:

[root@gsil-kube04 ~]# cat awxvalues.yaml
AWX:
  # enable use of awx-deploy template
  enabled: true
  name: awx
  spec:
    replicas: 2
    service_type: NodePort
    nodeport_port: 30080
    admin_user: admin
    hostname: awx.idm.gsil.org
    image: gsil-docker1.idm.gsil.org:5001/quay.io/ansible/awx
    image_version: 23.7.0
    init_container_image: gsil-docker1.idm.gsil.org:5001/quay.io/ansible/awx-ee
    init_container_image_version: latest
    ee_images:
    - name: AWX EE
      image: gsil-docker1.idm.gsil.org:5001/quay.io/ansible/awx-ee:23.7.0
    ee_extra_env: |
      - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
        value: enabled
    postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
    postgres_image_version: "13"
    postgres_selector: |
      nodefor: psql
    control_plane_ee_image: gsil-docker1.idm.gsil.org:5001/quay.io/ansible/awx-ee:23.7.0
    redis_image: gsil-docker1.idm.gsil.org:5001/redis
    redis_image_version: "7"
    bundle_cacert_secret: awx-custom-certs
    ldap_cacert_secret: awx-custom-certs
    ldap_password_secret: awx-ldap-password
    extra_settings:
    - setting: AUTH_LDAP_SERVER_URI
      value: >-
       ... <secret_something_here> ...

customVolumes:
  postgres:
    enabled: true
    hostPath: /var/lib/postgresql
    size: 2Gi
    storageClassName: local-storage
  projects:
    enabled: true
    hostPath: /opt/projects/data
    size: 5Gi

For your AWXRestore,

This should be backup_dir instead of backup_directory.

1 Like