Reduce Postgres memory usage for AWX

My AWX platform is experiencing error 499 from awx UI. I think the issue comes from the slow response from Postgres.

I had one big job run, it took about 10 hours to finish. However, when this job started, Postgres’s memory use was quite high and error 499 was visible . But it did not go down even after this task was finished.

Should I modify any configs? Or add vacuum full job for PG?

2024-02-13T01:48:07.889762187Z stdout F X.X.X.X - - [13/Feb/2024:01:48:07 +0000] "GET /api/v2/metrics/ HTTP/1.1" 499 0 "http://X.X.X.X:8052/api/v2/metrics" "Prometheus/2.48.0" "-"
2024-02-13T01:47:59.115374436Z stdout F X.X.X.X - - [13/Feb/2024:01:47:59 +0000] "GET / HTTP/1.1" 499 0 "-" "Blackbox Exporter/0.24.0" "X.X.X.X"

This causes: AWX UI response slow, prometheus can’t scape metrics.
My platform: AKS
AWX-operator: 2.10.0
AWX: 23.6.0
AWX-EE: 23.6.0

My current pg config:

  postgres_resource_requirements:
    requests:
      cpu: 2000m
      memory: 8Gi
    limits:
      cpu: 4000m
      memory: 16Gi
  postgres_storage_requirements:
    requests:
      storage: 64Gi
    limits:
      storage: 100Gi
  postgres_image: docker.io/library/postgres
  postgres_image_version: "13.13"
  postgres_extra_args:
    - "-c"
    - "max_locks_per_transaction=1024"
    - "-c"
    - "max_connections=4000"
    - "-c"
    - "maintenance_work_mem=512MB"
    - "-c"
    - "shared_buffers=2048MB"
    - "-c"
    - "min_wal_size=2048MB"
    - "-c"
    - "max_wal_size=4096MB"
    - "-c"
    - "effective_io_concurrency=200"
    - "-c"
    - "checkpoint_completion_target=0.9"
    - "-c"
    - "effective_cache_size=6144MB"

Do you have any idea why one job run too long can cause throttling for postgres’s CPU? :thinking: I have one theory that this job run too long and it insert some jobs output then it caused high CPU.

does your long running job spit out a lot of output events?
those all get store in the database and can definitely be expensive

Sorry for the late response. I’m quite busy recently with a ton of work :sweat_smile:
Here was my ansible job

    - name: Start restore database
      ansible.windows.win_powershell:
        script: |
		  ### do restore here
      async: 86400
      poll: 10

Output something like that:

ASYNC POLL on XXXX: jid=j422747873893.12560 started=1 finished=0
ASYNC POLL on XXXX: jid=j422747873893.12560 started=1 finished=0
ASYNC POLL on XXXX: jid=j422747873893.12560 started=1 finished=0
ASYNC POLL on XXXX: jid=j422747873893.12560 started=1 finished=0
ASYNC POLL on XXXX: jid=j422747873893.12560 started=1 finished=0
ASYNC POLL on XXXX: jid=j422747873893.12560 started=1 finished=0
ASYNC POLL on XXXX: jid=j422747873893.12560 started=1 finished=0
ASYNC POLL on XXXX: jid=j422747873893.12560 started=1 finished=0

Job execute for 10 hours

I’m wondering that, this job execute in remote server, but why postgres cpu usage was too high ~ 3.11 cores. The number of write requests not very large if it only write some lines of logs to DB. Normally, it only consumes 1,5 cores