Hi
As planned in my previous post, I attempted to upgrade my AWX setup using Helm : awx-operator from 2.9.0 to 2.19.0 and so awx from 23.5.1 from to 24.6.0.
Note: i’m a K8s beginner so my question here could be probably dumb.
1st attempt
Upgrade using helm upgrade my-awx-operator awx-operator/awx-operator
I had exactly this issue about postgre15 that cannot create directory but I wasn’t able to fix it by myself. So i decided to first make a Helm rollback, then upgrade to a lower version than 2.19.0.
2nd attempt
I rollback to 2.9.0 using helm rollback my-awx-operator as 2.9.0 were the previous release.
My current issue is that awx-xxx-web pod fail because it seems it doesn’t know how to connect to the Postgre DB
...brutally killing workers...
2024-07-05 09:12:35,115 INFO stopped: nginx (exit status 0)
2024-07-05 09:12:35,115 INFO stopped: nginx (exit status 0)
2024-07-05 09:12:35,119 WARNING [-] awx.conf.settings Database settings are not available, using defaults. error: connection is bad: Name or service not known
2024-07-05 09:12:35,119 WARNING Database settings are not available, using defaults. error: connection is bad: Name or service not known
2024-07-05 09:12:35,107 INFO [-] daphne.server Killed 0 pending application instances
2024-07-05 09:12:35,107 INFO Killed 0 pending application instances
2024-07-05 09:12:35,638 INFO stopped: daphne (exit status 0)
2024-07-05 09:12:35,638 INFO stopped: daphne (exit status 0)
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
worker 3 buried after 1 seconds
worker 4 buried after 1 seconds
worker 5 buried after 1 seconds
binary reloading uWSGI...
At the moment, the original postgres-13-0 pod is still UP with its data (i’m able to connect using Dbeaver)
Kubernetes secrets (postgres-configuration) is also still present with good information about credentials.
It seems like the web pod isn’t aware on how to get these informations, but i’m a bit stuck to my limited K8s knowledges…
Is there any advices or something to look at to be sure my web pod is well configured?
Have the hostnames included in postgres-configuration been rolled back to point to PSQL 13? Please check if they have been reverted to 13 after being changed to 15 during the upgrade.
Hi
hostname was awx-infra-postgres-15 (base 64 encoded), so i just update this value in awx-infra-postgres-13 (base64 encoded).
Also manually set annotations from awx-infra-postgres-15 to awx-infra-postgres-13.
Finally it cames back online and working ! huge thanks !
One more question, I saw on Github there’s issues/PR about the upgrade from psql13 to psql15, it seems to look likes UID 26 should have permissions to write on PV right ?
In my case, should I launch again the upgrade then temporarily mount PV to change those permissions or do we have a workaround ?
@kurokobo as your advice wasn’t working in my case, I finally found a workaround as mentioned in GitHub here and here.
Create a temporary pod
Map the existing postegre-15 PV to that pod
change directory permissions to
chown 26:0 data/
chmod 700 data/
kill my temporary pod then kill the postgre-15 pod that fails
pod regenerates by itself then it starts perform migration. I had some issues on logs that need to be fixed with kubectl apply --server-side -k "github.com/ansible/awx-operator/config/crd?ref=2.19.1" as mentionned in the same issue 1907 in the beginning of this post.
Conclusion : everything went very fast, DB migration from PGSQL13 to PGSQL15 seems to run smoothly.
I’ll only delete PV/PVC related to the old PGSQL13 and i’m 100% done.
Thanks again for your help ! (here but also all times I read you in GitHub )
If postgres_data_volume_init is set to true , then the chmod and chown commands you manually executed should have been performed automatically, so I’m not sure why that didn’t resolve the issue.
However, even if you add parameters and perform a helm upgrade, it might take some time for the AWX Operator to actually start a new reconciliation loop using those new parameters.
Also, depending on the failed tasks, there might be cases where condition checks are not correctly performed in the next loop, and that could be the reason why the parameters were not properly applied.
In any case, I’m glad it got resolved! Thank you for updating this topic!
Yeah I do really think it should resolve as it seems to resolves all related issues according to many Github issues.
Anyway, I succeed to perform the migration, even if it wasn’t 100% automated result it here !
Thanks again a lot for your precious help & advices !