AWX Backup Fails

The AWXRestore is designed to restore backup files created by the AWXBackup, so it won’t work if there’s only a manually created SQL file.

I previously suggested checking the Operator’s logs to identify failing tasks, and I’d recommend doing that again, please take a look at your logs.

Acknowledged, and I previously suggested that getting logs approved for release might take me a little time to get done. Please know that I appreciate the assistance! :grinning: This whole thing has been a balancing act for me to give enough info without compromising security. Thanks for understanding.

I am trying to be sure I have done everything I can from my perspective before going to my security team to get approvals to release detailed logs.

Here is what I can put up here right now.

TASK [Get PostgreSQL configuration]
fatal: [localhost]: FAILED! => {"censored": the output has been hidden due to the fact that 'no_log: true' was specified for this result"}

The next section of the log is what will take some time to get approvals on. (Maybe up to 5 working days).
I can confirm that the earlier work we did to the spec file did fix several things and now we are dealing with a different error than before.

I’m not trying to force you to share the logs here :smiley: If the logs are thoroughly investigated, that’s good enough for me. It doesn’t matter who does the investigation.

TASK [Get PostgreSQL configuration]

If this task has failed, you can check what it was supposed to do by looking at the code on GitHub. Also, any censored logs can be revealed by specifying no_log: false on CR.

However. before diving deeper into the investigation, could you please clarify exactly what you’re trying to do?
How about my previous following concern? Which operation is the Operator failing on: AWX CR, AWXRestore CR, or AWXBackup CR?"

I understand your not trying to force me to share logs. My security team was able to get the logs reviewed quicker than expected and I’ll be able to post those here shortly. I still had questions about what I am seeing in the logs and thought it might be best to post them here after getting the “OK” from security.

My intent is to backup the data in one cluster and restore the data to my new cluster. For example:
kube01 (control-plane)
kube02 (node)
Kube03 (node)
##This is the old cluster

kube04 (control-plane)
kube05 (node)
kube06 (node)
##This is the new cluster

So right now my focus is on getting a proper backup to work. As you stated, I can’t manually do a backup. I must use the operator with builtin Ansible to get the job done. I am learning a lot as we go based on your guidance in this process. You are a good teacher kurokobo!

Checking GitHub code is somewhat helpful but again I am lacking the confidence to be sure it is right. This is why I appreciate the guidance everyone has given. I thought I could use the code to manually create a backup based on what was intended in the code. Nope! I was wrong… :smile:

Hope this helps to clarify things.

In that case, this other thread might be helpful. They were similarly working on migrating AWX from one cluster to another, and that is somewhat documented.

Yes, I have seen and reviewed that thread. It does help a little but I still have some other questions. I guess I’ll just take things one step at a time. I’ll focus on getting the backup to work.

Got it, thanks.

TASK [Get PostgreSQL configuration]

So we should dive into this, here is the code: awx-operator/roles/backup/tasks/postgres.yml at 2.11.0 · ansible/awx-operator · GitHub

- name: Get PostgreSQL configuration
  k8s_info:
    kind: Secret
    namespace: '{{ ansible_operator_meta.namespace }}'
    name: "{{ this_awx['resources'][0]['status']['postgresConfigurationSecret'] }}"
  register: pg_config
  no_log: "{{ no_log }}"

I believe, almost certainly, the value of this_awx is incorrect. this_awx is configured here: awx-operator/roles/backup/tasks/init.yml at 2.11.0 · ansible/awx-operator · GitHub

- name: Look up details for this deployment
  k8s_info:
    api_version: "{{ api_version }}"
    kind: "AWX"
    name: "{{ deployment_name }}"
    namespace: "{{ ansible_operator_meta.namespace }}"
  register: this_awx

I think there’s a high chance that the deployment_name for AWXBackup is incorrect.

I took another look at the OP of this thread, and it’s clear that having deployment_name: awx-operator-controller-manager is wrong. If you’re still using this value, please change it to deployment_name: awx and give it another try.

2 Likes

That was it. The job is successful! :joy: I now have a folder in my defined location with a folder structure that has the date and time. In the folder are three files: awx_object, secrets.yml and tower.db

I am going to transfer the folder to my new cluster, modify my awx restore file accordingly and give it a go. Fingers crossed :crossed_fingers:

1 Like

OK, so the backup is good. Now my restore is failing. Here are my .yml files for both the backup and restore:

backup:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-09-19-2024
  namespace: awx
spec:
  deployment_name: awx
  backup_pvc: awx-backup
  clean_backup_on_delete: true
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: '13'

restore:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
  name: restore-awx
  namespace: awx
spec:
  deployment_name: awx
  backup_name: awxbackup-09-19-2024
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: '13'

My error message is:

--------------------------- Ansible Task StdOut -------------------------------

 TASK [Fail early if pvc is defined but does not exist] ********************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Cannot read the backup status variables for AWXBackup awxbackup-09-19-2024."}

The backup produces a folder called:
tower-openshift-backup-<date>-<time>

Is my backup name in the spec section wrong for the backup? I couldn’t tell from the other post.

I’m not sure what you did to transfer the “folder”, but it looks like you didn’t create a PVC called awxbackup-09-19-2024 on the new cluster.

Edit: It doesn’t see an AWXBackup called awxbackup-09-19-2024 on the new cluster. The AWXRestore CR is designed to restore from AWXBackup, not just a PVC… looking around for something…

So my PVC should be named awxbackup-09-19-2024? Do I understand that correctly?

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: awxbackup-09-19-2024
  namespace: awx
spec:
  storageClassName: local-storage
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi

I edited my post because I didn’t fully read the error message. It’s looking for kind: AWXBackup by that name, not the PVC.

You might be able to cheat it by creating an AWXBackup on the new cluster (wait for it to complete successfully), and replacing the backup PVC on the new cluster with the one you copied from the old cluster.

Delete the AWXRestore until the AWXBackup is ready. I don’t know if the Operator will continue retrying the restore, so we don’t want it to kickoff too soon.

So I think you are telling me that my PV or PVC are not named as expected. Is that correct?

What should they be? Basically, I am trying to understand the mapping between names, jobs, pv, pvc, etc.

The PVC is a secondary item. That’s where your physical backup data is.

What I am talking about is the AWXBackup CR itself. You used this to create the backup in the first place on the old cluster. Then on the new cluster, you’ve created an AWXRestore to restore the backup from, but when the operator processes the restore, it’s looking for the AWXBackup to check its status, but can’t find it because it’s on the other cluster.

So the workaround that I’m suggesting here is to run a backup on the new cluster. This involves creating an AWXBackup (using the same specs as before). This won’t have any data we care about since it’s backing up a fresh instance, however, we can replace the physical data in the PVC that gets created by the AWXBackup.

Then the AWXRestore should find not only the CR, but the PVC with data you really want.

Alternatively, you could expose the postgres pod on your old cluster and run a migration on the new AWX with the old_database_secret pointed to your old cluster and exposed postresql port.

The AWXRestore is designed to cover following two scenarios:

  • A) Restoring from the exsiting AWXBackup CR
    • the backup_name param is for this scenario
  • B) Restoring from the exsiting backup files in the PVC
    • the backup_dir param is for this scenario

You have performed (A), but this time you should proceed to (B). The ideas from @Denney-tech is technically possible, but it would be a bit complecated.

So you should:

  1. Create the PV and PVC in the new cluster to place backup files
  2. Place your backup directory (tower-openshift-backup-<date>-<time>) on the root of the PV
  3. Specify following params for AWXRestore
    spec:
      deployment_name: awx
      # backup_name: awxbackup-09-19-2024  <- remove this
      postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
      postgres_image_version: '13'
    
      # add following params
      backup_pvc: "<the name of your pvc that you've created on the new cluster on  the step 1>"
      backup_dir: "/backups/<the name of the backup directory, e.g. tower-openshift-backup-<date>-<time>>"
                 # ^^^^^^^^^ note: `/backups` is mandatory since your PV will be mounted as `/backups`
    

Refer to the README of the restore role for details: awx-operator/roles/restore at 2.11.0 · ansible/awx-operator · GitHub

1 Like

I didn’t realize B) was an option here. Always learning something from you.

P.s. @kurokobo You’re going to get that know-it-all badge soon. Probably as soon as @jeremytourville finishes migrating and marks one of your replies as the answer.

  1. Yes, I was already doing that
  2. Yes, I was already doing that
  3. This is VERY helpful. I understand much better what I should do.

Let me try the restore with the correct parameters and see what I get. BRB with some results…

I have some troubleshooting to do on my cluster. My PV and PVC are not getting created. I am 99.9% certain this is due to STIG (security settings) I had to apply to my system. I will have to go back and review those settings to see which one is causing the volume creation to fail.

Update:
OK, I had to create the correct path referenced in my storage and chmod -R 755 that directory. So AWX is running in the new cluster again after I had deleted my deployment and the postgres folders for the DB and Backup. Now, I applied the restore job and the job fails.

There is a reconciler error and event runner on failed in the Ansible logs.