Reduce Postgres memory usage for AWX

My AWX platform is experiencing error 499 from awx UI. I think the issue comes from the slow response from Postgres.

I had one big job run, it took about 10 hours to finish. However, when this job started, Postgres’s memory use was quite high and error 499 was visible . But it did not go down even after this task was finished.

Should I modify any configs? Or add vacuum full job for PG?

2024-02-13T01:48:07.889762187Z stdout F X.X.X.X - - [13/Feb/2024:01:48:07 +0000] "GET /api/v2/metrics/ HTTP/1.1" 499 0 "http://X.X.X.X:8052/api/v2/metrics" "Prometheus/2.48.0" "-"
2024-02-13T01:47:59.115374436Z stdout F X.X.X.X - - [13/Feb/2024:01:47:59 +0000] "GET / HTTP/1.1" 499 0 "-" "Blackbox Exporter/0.24.0" "X.X.X.X"

This causes: AWX UI response slow, prometheus can’t scape metrics.
My platform: AKS
AWX-operator: 2.10.0
AWX: 23.6.0
AWX-EE: 23.6.0

My current pg config:

      cpu: 2000m
      memory: 8Gi
      cpu: 4000m
      memory: 16Gi
      storage: 64Gi
      storage: 100Gi
  postgres_image_version: "13.13"
    - "-c"
    - "max_locks_per_transaction=1024"
    - "-c"
    - "max_connections=4000"
    - "-c"
    - "maintenance_work_mem=512MB"
    - "-c"
    - "shared_buffers=2048MB"
    - "-c"
    - "min_wal_size=2048MB"
    - "-c"
    - "max_wal_size=4096MB"
    - "-c"
    - "effective_io_concurrency=200"
    - "-c"
    - "checkpoint_completion_target=0.9"
    - "-c"
    - "effective_cache_size=6144MB"

Do you have any idea why one job run too long can cause throttling for postgres’s CPU? :thinking: I have one theory that this job run too long and it insert some jobs output then it caused high CPU.

does your long running job spit out a lot of output events?
those all get store in the database and can definitely be expensive