AWX Backup Fails

kurokobo · September 18, 2024, 3:48pm

The AWXRestore is designed to restore backup files created by the AWXBackup, so it won’t work if there’s only a manually created SQL file.

I previously suggested checking the Operator’s logs to identify failing tasks, and I’d recommend doing that again, please take a look at your logs.

jeremytourville · September 18, 2024, 4:38pm

Acknowledged, and I previously suggested that getting logs approved for release might take me a little time to get done. Please know that I appreciate the assistance! This whole thing has been a balancing act for me to give enough info without compromising security. Thanks for understanding.

I am trying to be sure I have done everything I can from my perspective before going to my security team to get approvals to release detailed logs.

Here is what I can put up here right now.

TASK [Get PostgreSQL configuration]
fatal: [localhost]: FAILED! => {"censored": the output has been hidden due to the fact that 'no_log: true' was specified for this result"}

The next section of the log is what will take some time to get approvals on. (Maybe up to 5 working days).
I can confirm that the earlier work we did to the spec file did fix several things and now we are dealing with a different error than before.

kurokobo · September 19, 2024, 2:29pm

I’m not trying to force you to share the logs here If the logs are thoroughly investigated, that’s good enough for me. It doesn’t matter who does the investigation.

TASK [Get PostgreSQL configuration]

If this task has failed, you can check what it was supposed to do by looking at the code on GitHub. Also, any censored logs can be revealed by specifying no_log: false on CR.

However. before diving deeper into the investigation, could you please clarify exactly what you’re trying to do?
How about my previous following concern? Which operation is the Operator failing on: AWX CR, AWXRestore CR, or AWXBackup CR?"

jeremytourville · September 19, 2024, 2:43pm

I understand your not trying to force me to share logs. My security team was able to get the logs reviewed quicker than expected and I’ll be able to post those here shortly. I still had questions about what I am seeing in the logs and thought it might be best to post them here after getting the “OK” from security.

My intent is to backup the data in one cluster and restore the data to my new cluster. For example:
kube01 (control-plane)
kube02 (node)
Kube03 (node)
##This is the old cluster

kube04 (control-plane)
kube05 (node)
kube06 (node)
##This is the new cluster

So right now my focus is on getting a proper backup to work. As you stated, I can’t manually do a backup. I must use the operator with builtin Ansible to get the job done. I am learning a lot as we go based on your guidance in this process. You are a good teacher kurokobo!

Checking GitHub code is somewhat helpful but again I am lacking the confidence to be sure it is right. This is why I appreciate the guidance everyone has given. I thought I could use the code to manually create a backup based on what was intended in the code. Nope! I was wrong…

Hope this helps to clarify things.

Denney-tech · September 19, 2024, 3:18pm

In that case, this other thread might be helpful. They were similarly working on migrating AWX from one cluster to another, and that is somewhat documented.

jeremytourville · September 19, 2024, 3:24pm

Yes, I have seen and reviewed that thread. It does help a little but I still have some other questions. I guess I’ll just take things one step at a time. I’ll focus on getting the backup to work.

kurokobo · September 19, 2024, 3:45pm

Got it, thanks.

TASK [Get PostgreSQL configuration]

So we should dive into this, here is the code: awx-operator/roles/backup/tasks/postgres.yml at 2.11.0 · ansible/awx-operator · GitHub

- name: Get PostgreSQL configuration
  k8s_info:
    kind: Secret
    namespace: '{{ ansible_operator_meta.namespace }}'
    name: "{{ this_awx['resources'][0]['status']['postgresConfigurationSecret'] }}"
  register: pg_config
  no_log: "{{ no_log }}"

I believe, almost certainly, the value of this_awx is incorrect. this_awx is configured here: awx-operator/roles/backup/tasks/init.yml at 2.11.0 · ansible/awx-operator · GitHub

- name: Look up details for this deployment
  k8s_info:
    api_version: "{{ api_version }}"
    kind: "AWX"
    name: "{{ deployment_name }}"
    namespace: "{{ ansible_operator_meta.namespace }}"
  register: this_awx

I think there’s a high chance that the deployment_name for AWXBackup is incorrect.

I took another look at the OP of this thread, and it’s clear that having deployment_name: awx-operator-controller-manager is wrong. If you’re still using this value, please change it to deployment_name: awx and give it another try.

jeremytourville · September 19, 2024, 4:02pm

That was it. The job is successful! I now have a folder in my defined location with a folder structure that has the date and time. In the folder are three files: awx_object, secrets.yml and tower.db

I am going to transfer the folder to my new cluster, modify my awx restore file accordingly and give it a go. Fingers crossed

jeremytourville · September 19, 2024, 6:45pm

OK, so the backup is good. Now my restore is failing. Here are my .yml files for both the backup and restore:

backup:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-09-19-2024
  namespace: awx
spec:
  deployment_name: awx
  backup_pvc: awx-backup
  clean_backup_on_delete: true
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: '13'

restore:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
  name: restore-awx
  namespace: awx
spec:
  deployment_name: awx
  backup_name: awxbackup-09-19-2024
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: '13'

My error message is:

--------------------------- Ansible Task StdOut -------------------------------

 TASK [Fail early if pvc is defined but does not exist] ********************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Cannot read the backup status variables for AWXBackup awxbackup-09-19-2024."}

The backup produces a folder called:
tower-openshift-backup-<date>-<time>

Is my backup name in the spec section wrong for the backup? I couldn’t tell from the other post.

Denney-tech · September 19, 2024, 6:53pm

~~I’m not sure what you did to transfer the “folder”, but it looks like you didn’t create a PVC called awxbackup-09-19-2024 on the new cluster.~~

Edit: It doesn’t see an AWXBackup called awxbackup-09-19-2024 on the new cluster. The AWXRestore CR is designed to restore from AWXBackup, not just a PVC… looking around for something…

jeremytourville · September 19, 2024, 7:05pm

So my PVC should be named awxbackup-09-19-2024? Do I understand that correctly?

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: awxbackup-09-19-2024
  namespace: awx
spec:
  storageClassName: local-storage
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi

Denney-tech · September 19, 2024, 7:07pm

I edited my post because I didn’t fully read the error message. It’s looking for kind: AWXBackup by that name, not the PVC.

You might be able to cheat it by creating an AWXBackup on the new cluster (wait for it to complete successfully), and replacing the backup PVC on the new cluster with the one you copied from the old cluster.

Denney-tech · September 19, 2024, 7:10pm

Delete the AWXRestore until the AWXBackup is ready. I don’t know if the Operator will continue retrying the restore, so we don’t want it to kickoff too soon.

jeremytourville · September 19, 2024, 7:28pm

So I think you are telling me that my PV or PVC are not named as expected. Is that correct?

What should they be? Basically, I am trying to understand the mapping between names, jobs, pv, pvc, etc.

Denney-tech · September 19, 2024, 7:36pm

The PVC is a secondary item. That’s where your physical backup data is.

What I am talking about is the AWXBackup CR itself. You used this to create the backup in the first place on the old cluster. Then on the new cluster, you’ve created an AWXRestore to restore the backup from, but when the operator processes the restore, it’s looking for the AWXBackup to check its status, but can’t find it because it’s on the other cluster.

So the workaround that I’m suggesting here is to run a backup on the new cluster. This involves creating an AWXBackup (using the same specs as before). This won’t have any data we care about since it’s backing up a fresh instance, however, we can replace the physical data in the PVC that gets created by the AWXBackup.

Then the AWXRestore should find not only the CR, but the PVC with data you really want.

Denney-tech · September 19, 2024, 7:42pm

Alternatively, you could expose the postgres pod on your old cluster and run a migration on the new AWX with the old_database_secret pointed to your old cluster and exposed postresql port.

kurokobo · September 19, 2024, 11:55pm

The AWXRestore is designed to cover following two scenarios:

A) Restoring from the exsiting AWXBackup CR
- the backup_name param is for this scenario
B) Restoring from the exsiting backup files in the PVC
- the backup_dir param is for this scenario

You have performed (A), but this time you should proceed to (B). The ideas from @Denney-tech is technically possible, but it would be a bit complecated.

So you should:

Create the PV and PVC in the new cluster to place backup files
Place your backup directory (tower-openshift-backup-<date>-<time>) on the root of the PV

Specify following params for AWXRestore

spec:
  deployment_name: awx
  # backup_name: awxbackup-09-19-2024  <- remove this
  postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
  postgres_image_version: '13'

  # add following params
  backup_pvc: "<the name of your pvc that you've created on the new cluster on  the step 1>"
  backup_dir: "/backups/<the name of the backup directory, e.g. tower-openshift-backup-<date>-<time>>"
             # ^^^^^^^^^ note: `/backups` is mandatory since your PV will be mounted as `/backups`

Refer to the README of the restore role for details: awx-operator/roles/restore at 2.11.0 · ansible/awx-operator · GitHub

Denney-tech · September 20, 2024, 4:35am

I didn’t realize B) was an option here. Always learning something from you.

P.s. @kurokobo You’re going to get that know-it-all badge soon. Probably as soon as @jeremytourville finishes migrating and marks one of your replies as the answer.

jeremytourville · September 20, 2024, 11:37am

Yes, I was already doing that
Yes, I was already doing that
This is VERY helpful. I understand much better what I should do.

Let me try the restore with the correct parameters and see what I get. BRB with some results…

jeremytourville · September 20, 2024, 12:33pm

I have some troubleshooting to do on my cluster. My PV and PVC are not getting created. I am 99.9% certain this is due to STIG (security settings) I had to apply to my system. I will have to go back and review those settings to see which one is causing the volume creation to fail.

Update:
OK, I had to create the correct path referenced in my storage and chmod -R 755 that directory. So AWX is running in the new cluster again after I had deleted my deployment and the postgres folders for the DB and Backup. Now, I applied the restore job and the job fails.

There is a reconciler error and event runner on failed in the Ansible logs.

Topic		Replies	Views
Stumped on how to restore AWX backup to minikube running on mac laptop. AWX Project awx	25	88	April 21, 2023
AWX V19 - Backup and Restore AWX Project awx	9	122	July 27, 2022
AWX restore fails (postgres part succeeds however) AWX Project awx	4	28	August 3, 2022
Moving AWX Operator instance to new cluster with backup and restore Get Help awx , awx-operator	3	2067	May 28, 2024
How to run the restore process from an awx backup? AWX Project awx	2	49	October 24, 2022

AWX Backup Fails

Related topics