AWX upgrade stuck

Hi
As planned in my previous post, I attempted to upgrade my AWX setup using Helm : awx-operator from 2.9.0 to 2.19.0 and so awx from 23.5.1 from to 24.6.0.

Note: i’m a K8s beginner so my question here could be probably dumb.

1st attempt
Upgrade using helm upgrade my-awx-operator awx-operator/awx-operator
I had exactly this issue about postgre15 that cannot create directory but I wasn’t able to fix it by myself. So i decided to first make a Helm rollback, then upgrade to a lower version than 2.19.0.

2nd attempt
I rollback to 2.9.0 using helm rollback my-awx-operator as 2.9.0 were the previous release.

My current issue is that awx-xxx-web pod fail because it seems it doesn’t know how to connect to the Postgre DB

...brutally killing workers...
2024-07-05 09:12:35,115 INFO stopped: nginx (exit status 0)
2024-07-05 09:12:35,115 INFO stopped: nginx (exit status 0)
2024-07-05 09:12:35,119 WARNING  [-] awx.conf.settings Database settings are not available, using defaults. error: connection is bad: Name or service not known
2024-07-05 09:12:35,119 WARNING  Database settings are not available, using defaults. error: connection is bad: Name or service not known
2024-07-05 09:12:35,107 INFO     [-] daphne.server Killed 0 pending application instances
2024-07-05 09:12:35,107 INFO     Killed 0 pending application instances
2024-07-05 09:12:35,638 INFO stopped: daphne (exit status 0)
2024-07-05 09:12:35,638 INFO stopped: daphne (exit status 0)
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
worker 3 buried after 1 seconds
worker 4 buried after 1 seconds
worker 5 buried after 1 seconds
binary reloading uWSGI...

At the moment, the original postgres-13-0 pod is still UP with its data (i’m able to connect using Dbeaver)
Kubernetes secrets (postgres-configuration) is also still present with good information about credentials.

It seems like the web pod isn’t aware on how to get these informations, but i’m a bit stuck to my limited K8s knowledges…

Is there any advices or something to look at to be sure my web pod is well configured?

Best regards

Gael

Have the hostnames included in postgres-configuration been rolled back to point to PSQL 13? Please check if they have been reverted to 13 after being changed to 15 during the upgrade.

2 Likes

Hi
hostname was awx-infra-postgres-15 (base 64 encoded), so i just update this value in awx-infra-postgres-13 (base64 encoded).
Also manually set annotations from awx-infra-postgres-15 to awx-infra-postgres-13.
Finally it cames back online and working ! huge thanks ! :slight_smile:

One more question, I saw on Github there’s issues/PR about the upgrade from psql13 to psql15, it seems to look likes UID 26 should have permissions to write on PV right ?

In my case, should I launch again the upgrade then temporarily mount PV to change those permissions or do we have a workaround ?

Correct.

Add postgres_data_volume_init: true to your AWX’s spec. This will fix the permissions on the directory in the PV automatically.

1 Like

Ok, i’ll try to find where i can add that inside Helm chart, and let you know what happened :wink: fingercrossed !

Just gave a try and got the same error.

I used to install or upgrade AWX without custom values but using this command (the one attempt that fails)

 helm upgrade my-awx-operator awx-operator/awx-operator --namespace ppr-awx --version 2.19.1 

I simply add postgres_data_volume_init: true to AWX.spec in the following values.yml :

AWX:

  # enable use of awx-deploy template
  enabled: true
  name: awx
  spec:
    admin_user: admin
    postgres_data_volume_init: true

  # configurations for external postgres instance
  postgres:
    enabled: false
    host: Unset
    port: 5678
    dbName: Unset
    username: admin
    # for secret management, pass in the password independently of this file
    # at the command line, use --set AWX.postgres.password
    password: Unset
    sslmode: prefer
    type: unmanaged

Then I upgrade using :

helm upgrade my-awx-operator awx-operator/awx-operator --namespace ppr-awx --version 2.19.1 -f .\values.yaml

I guess i set it in the wrong place ?

EDIT :
Also tried to add the following as seen here

postgres_init_container_commands: |
  chown 26:0 /var/lib/pgsql/data
  chmod 700 /var/lib/pgsql/data

But same error mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied

@kurokobo as your advice wasn’t working in my case, I finally found a workaround as mentioned in GitHub here and here.

  • Create a temporary pod
  • Map the existing postegre-15 PV to that pod
  • change directory permissions to
chown 26:0 data/
chmod 700 data/
  • kill my temporary pod then kill the postgre-15 pod that fails
  • pod regenerates by itself then it starts perform migration. I had some issues on logs that need to be fixed with kubectl apply --server-side -k "github.com/ansible/awx-operator/config/crd?ref=2.19.1" as mentionned in the same issue 1907 in the beginning of this post.

Conclusion : everything went very fast, DB migration from PGSQL13 to PGSQL15 seems to run smoothly.
I’ll only delete PV/PVC related to the old PGSQL13 and i’m 100% done.

Thanks again for your help ! (here but also all times I read you in GitHub :slight_smile: )

@motorbass
Sorry for the delayed response.

If postgres_data_volume_init is set to true , then the chmod and chown commands you manually executed should have been performed automatically, so I’m not sure why that didn’t resolve the issue.

However, even if you add parameters and perform a helm upgrade, it might take some time for the AWX Operator to actually start a new reconciliation loop using those new parameters.
Also, depending on the failed tasks, there might be cases where condition checks are not correctly performed in the next loop, and that could be the reason why the parameters were not properly applied.

In any case, I’m glad it got resolved! Thank you for updating this topic!

1 Like

Yeah I do really think it should resolve as it seems to resolves all related issues according to many Github issues.
Anyway, I succeed to perform the migration, even if it wasn’t 100% automated :slight_smile: result it here !
Thanks again a lot for your precious help & advices !

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.