Re-import postgres database

Hi everyone

I’ve recently upgrade my AWX Operator instance from 0.22.0 to 0.28.0. This included a switch from postgres 9 to postgres 13.
Now this migration wasn’t seamless, I had some problems with the new PV so I had to restart the pod and the migration process a few times. I think this might have caused some corruption on the database.

Some tasks seem to randomly fail with this error message:
Traceback (most recent call last): File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py”, line 82, in execute return self.cursor.execute(sql) psycopg2.errors.InternalError: unexpected data beyond EOF in block 166 of relation base/16384/2664 HINT: This has been seen to occur with buggy kernels; consider updating your system. The above exception was the direct cause of the following exception: Traceback (most recent call last): File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py”, line 481, in run self.pre_run_hook(self.instance, private_data_dir) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py”, line 1287, in pre_run_hook super(RunProjectUpdate, self).pre_run_hook(instance, private_data_dir) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py”, line 417, in pre_run_hook create_partition(instance.event_class._meta.db_table, start=instance.created) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py”, line 1163, in create_partition cursor.execute( File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py”, line 66, in execute return self._execute_with_wrappers(sql, params, many=False, executor=self._execute) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py”, line 75, in _execute_with_wrappers return executor(sql, params, many, context) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py”, line 84, in _execute return self.cursor.execute(sql, params) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py”, line 90, in exit raise dj_exc_value.with_traceback(traceback) from exc_value File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py”, line 82, in _execute return self.cursor.execute(sql) django.db.utils.InternalError: unexpected data beyond EOF in block 166 of relation base/16384/2664 HINT: This has been seen to occur with buggy kernels; consider updating your system.

My question is, is there some kind of check I can do on the database? Or is it possible to re-import the database again? I have backups from before the migration

You might attempt a restore from your 0.22.0 backup by creating an awx restore object, see this documentation https://github.com/ansible/awx-operator/blob/devel/roles/restore/README.md

we also have a tool that might help debug db connectivity issues “awx-manage check_db” that you may find useful

Do different tasks hit this error, or the same one (at random times)? Do you ever see a different stack trace when hitting an error?

AWX Team

  • It’s always the same task that fails, which is a “Source Control Update” task. Then the actual task fails because it needs this update before running
  • The stack trace seems to be the same, the only thing that changes is the block number i.e: “block 173 of relation base/16384/2664”
  • “awx-manage check db” seems to return no errors. These errors don’t happen all the time, so is fair to say that connectivity with the DB works fine most of the time.

Thanks for the link to the Restore role, I wasn’t aware of it! I’ll try to do a restore in another namespace and check for errors

Hi!

Did running the restore role help resolve the issue? Also, it might be worth looking a the postgres pod logs around the time that the project update fails and see if anything stands out as problematic.

AWX Team

We haven’t been able to run the restore role because we weren’t making backups using the backups role. We were only doing a pg_dumpall on the posgres pod.

We are running the storage on GlusterFS, which made me find this kb: https://access.redhat.com/solutions/3673761. We apply the recommended group in there but we’re still seeing the same error, just not as often. For now the workaround has been to disable inventory sync before each task.
Sadly we don’t have enough log retention to see the output of the pods during the migration