Acknowledged, and I previously suggested that getting logs approved for release might take me a little time to get done. Please know that I appreciate the assistance! This whole thing has been a balancing act for me to give enough info without compromising security. Thanks for understanding.
I am trying to be sure I have done everything I can from my perspective before going to my security team to get approvals to release detailed logs.
Here is what I can put up here right now.
TASK [Get PostgreSQL configuration]
fatal: [localhost]: FAILED! => {"censored": the output has been hidden due to the fact that 'no_log: true' was specified for this result"}
The next section of the log is what will take some time to get approvals on. (Maybe up to 5 working days).
I can confirm that the earlier work we did to the spec file did fix several things and now we are dealing with a different error than before.
I’m not trying to force you to share the logs here If the logs are thoroughly investigated, that’s good enough for me. It doesn’t matter who does the investigation.
TASK [Get PostgreSQL configuration]
If this task has failed, you can check what it was supposed to do by looking at the code on GitHub. Also, any censored logs can be revealed by specifying no_log: false on CR.
However. before diving deeper into the investigation, could you please clarify exactly what you’re trying to do?
How about my previous following concern? Which operation is the Operator failing on: AWX CR, AWXRestore CR, or AWXBackup CR?"
I understand your not trying to force me to share logs. My security team was able to get the logs reviewed quicker than expected and I’ll be able to post those here shortly. I still had questions about what I am seeing in the logs and thought it might be best to post them here after getting the “OK” from security.
My intent is to backup the data in one cluster and restore the data to my new cluster. For example:
kube01 (control-plane)
kube02 (node)
Kube03 (node)
##This is the old cluster
kube04 (control-plane)
kube05 (node)
kube06 (node)
##This is the new cluster
So right now my focus is on getting a proper backup to work. As you stated, I can’t manually do a backup. I must use the operator with builtin Ansible to get the job done. I am learning a lot as we go based on your guidance in this process. You are a good teacher kurokobo!
Checking GitHub code is somewhat helpful but again I am lacking the confidence to be sure it is right. This is why I appreciate the guidance everyone has given. I thought I could use the code to manually create a backup based on what was intended in the code. Nope! I was wrong…
In that case, this other thread might be helpful. They were similarly working on migrating AWX from one cluster to another, and that is somewhat documented.
Yes, I have seen and reviewed that thread. It does help a little but I still have some other questions. I guess I’ll just take things one step at a time. I’ll focus on getting the backup to work.
- name: Look up details for this deployment
k8s_info:
api_version: "{{ api_version }}"
kind: "AWX"
name: "{{ deployment_name }}"
namespace: "{{ ansible_operator_meta.namespace }}"
register: this_awx
I think there’s a high chance that the deployment_name for AWXBackup is incorrect.
I took another look at the OP of this thread, and it’s clear that having deployment_name: awx-operator-controller-manager is wrong. If you’re still using this value, please change it to deployment_name: awx and give it another try.
That was it. The job is successful! I now have a folder in my defined location with a folder structure that has the date and time. In the folder are three files: awx_object, secrets.yml and tower.db
I am going to transfer the folder to my new cluster, modify my awx restore file accordingly and give it a go. Fingers crossed
--------------------------- Ansible Task StdOut -------------------------------
TASK [Fail early if pvc is defined but does not exist] ********************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Cannot read the backup status variables for AWXBackup awxbackup-09-19-2024."}
The backup produces a folder called: tower-openshift-backup-<date>-<time>
Is my backup name in the spec section wrong for the backup? I couldn’t tell from the other post.
I’m not sure what you did to transfer the “folder”, but it looks like you didn’t create a PVC called awxbackup-09-19-2024 on the new cluster.
Edit: It doesn’t see an AWXBackup called awxbackup-09-19-2024 on the new cluster. The AWXRestore CR is designed to restore from AWXBackup, not just a PVC… looking around for something…
I edited my post because I didn’t fully read the error message. It’s looking for kind: AWXBackup by that name, not the PVC.
You might be able to cheat it by creating an AWXBackup on the new cluster (wait for it to complete successfully), and replacing the backup PVC on the new cluster with the one you copied from the old cluster.
Delete the AWXRestore until the AWXBackup is ready. I don’t know if the Operator will continue retrying the restore, so we don’t want it to kickoff too soon.
The PVC is a secondary item. That’s where your physical backup data is.
What I am talking about is the AWXBackup CR itself. You used this to create the backup in the first place on the old cluster. Then on the new cluster, you’ve created an AWXRestore to restore the backup from, but when the operator processes the restore, it’s looking for the AWXBackup to check its status, but can’t find it because it’s on the other cluster.
So the workaround that I’m suggesting here is to run a backup on the new cluster. This involves creating an AWXBackup (using the same specs as before). This won’t have any data we care about since it’s backing up a fresh instance, however, we can replace the physical data in the PVC that gets created by the AWXBackup.
Then the AWXRestore should find not only the CR, but the PVC with data you really want.
Alternatively, you could expose the postgres pod on your old cluster and run a migration on the new AWX with the old_database_secret pointed to your old cluster and exposed postresql port.
The AWXRestore is designed to cover following two scenarios:
A) Restoring from the exsiting AWXBackup CR
the backup_name param is for this scenario
B) Restoring from the exsiting backup files in the PVC
the backup_dir param is for this scenario
You have performed (A), but this time you should proceed to (B). The ideas from @Denney-tech is technically possible, but it would be a bit complecated.
So you should:
Create the PV and PVC in the new cluster to place backup files
Place your backup directory (tower-openshift-backup-<date>-<time>) on the root of the PV
Specify following params for AWXRestore
spec:
deployment_name: awx
# backup_name: awxbackup-09-19-2024 <- remove this
postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
postgres_image_version: '13'
# add following params
backup_pvc: "<the name of your pvc that you've created on the new cluster on the step 1>"
backup_dir: "/backups/<the name of the backup directory, e.g. tower-openshift-backup-<date>-<time>>"
# ^^^^^^^^^ note: `/backups` is mandatory since your PV will be mounted as `/backups`
I didn’t realize B) was an option here. Always learning something from you.
P.s. @kurokobo You’re going to get that know-it-all badge soon. Probably as soon as @jeremytourville finishes migrating and marks one of your replies as the answer.
I have some troubleshooting to do on my cluster. My PV and PVC are not getting created. I am 99.9% certain this is due to STIG (security settings) I had to apply to my system. I will have to go back and review those settings to see which one is causing the volume creation to fail.
Update:
OK, I had to create the correct path referenced in my storage and chmod -R 755 that directory. So AWX is running in the new cluster again after I had deleted my deployment and the postgres folders for the DB and Backup. Now, I applied the restore job and the job fails.
There is a reconciler error and event runner on failed in the Ansible logs.