memory problems

Hi,

…we’re experiencing problems running awx jobs due to memory exhaustion, I guess. The VM has decent memory (16G) and our jobs ran well for quite some time… But right now, it’s almost a 50:50 chance, that jobs are failing (output just stops - no error in the details of the job output…) - when I look at the system, I can see that there’s almost no memory left:

top - 07:26:54 up 2 days, 22:48, 3 users, load average: 0.00, 0.04, 0.07
Tasks: 289 total, 1 running, 287 sleeping, 0 stopped, 1 zombie
%Cpu(s): 1.1 us, 0.2 sy, 0.0 ni, 98.5 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 16002.2 total, 572.8 free, 11127.4 used, 4302.0 buff/cache
MiB Swap: 4096.0 total, 4082.4 free, 13.6 used. 4404.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4452 u032359 20 0 3227024 3.0g 21540 S 0.0 19.5 0:21.45 uwsgi
4453 u032359 20 0 2733744 2.6g 22308 S 0.0 16.5 0:34.42 uwsgi
4454 u032359 20 0 1689112 1.6g 17964 S 0.0 10.2 0:08.28 uwsgi
4450 u032359 20 0 1192644 1.1g 21752 S 0.0 7.1 0:05.32 uwsgi
4451 u032359 20 0 806172 782648 17308 S 0.0 4.8 0:03.96 uwsgi
841 root 20 0 1394412 514280 91092 S 8.9 3.1 277:29.50 k3s-server
4488 u032359 20 0 175364 145524 9556 S 0.0 0.9 1:46.60 awx-manage
4475 u032359 20 0 678812 141732 17236 S 0.0 0.9 4:41.08 awx-manage
214943 u032359 20 0 682804 138636 11212 S 0.0 0.8 0:00.65 awx-manage
214881 u032359 20 0 682872 138616 11212 S 0.0 0.8 0:00.68 awx-manage
4489 u032359 20 0 167396 137672 9548 S 0.0 0.8 1:50.24 awx-manage
215195 u032359 20 0 682688 136520 9352 S 0.0 0.8 0:00.39 awx-manage
215273 u032359 20 0 682428 136324 9348 S 0.0 0.8 0:00.25 awx-manage
4476 u032359 20 0 155472 134056 17168 S 0.0 0.8 1:54.50 awx-manage
4490 u032359 20 0 163384 132588 9544 S 0.0 0.8 1:50.01 awx-manage
4491 u032359 20 0 163064 132340 9512 S 0.0 0.8 1:52.48 awx-manage
4447 u032359 20 0 210160 118848 17620 S 0.0 0.7 1:55.35 awx-manage
4486 u032359 20 0 153708 108940 6328 S 0.0 0.7 3:49.42 awx-manage
4446 u032359 20 0 195720 105356 17296 S 0.0 0.6 1:19.55 daphne
3409 999 20 0 213128 84176 82252 S 0.0 0.5 0:02.74 postgres
2656 65532 20 0 799396 72800 55004 S 0.0 0.4 1:32.21 traefik
485 root 19 -1 145496 69596 68420 S 0.0 0.4 0:04.88 systemd-journal
1118 root 20 0 768336 61364 30612 S 0.0 0.4 31:43.44 containerd
3596 u032359 20 0 750832 47664 28832 S 0.0 0.3 19:25.71 metrics-server
3021 root 20 0 754780 43756 30116 S 0.0 0.3 8:12.12 coredns
3091 isarnet 20 0 742924 39912 22044 S 0.0 0.2 7:31.85 ansible-operato
4057 u032359 20 0 2368144 36252 21900 S 1.0 0.2 13:31.51 receptor
521 root rt 0 289460 27244 9076 S 0.0 0.2 0:22.01 multipathd
3927 u032359 20 0 29316 25156 9300 S 0.0 0.2 0:47.68 supervisord
2668 root 20 0 734716 24912 18076 S 0.0 0.2 1:39.18 local-path-prov
3981 u032359 20 0 29032 24760 9068 S 0.0 0.2 0:58.72 supervisord

Any hint on how to troubleshoot that further? It seems like the uwsgi process consumes way to much memory. AWX is running on k3s:

awx-operator: 0.23.0
k3s version v1.23.6+k3s1 (418c3fa8)
go version go1.17.5

AWX 21.2.0

Jobs are using dynamic inventory from nautobot.

Thanks,

Andreas

To clarify this - we observed that when we have a look at the job output, uwsgi starts to use more and more memory. Initially everything looks ok, but as soon as we start looking at the job output, we run into problems :frowning:

Any hint on how to troubleshoot that?

thanks,

Andreas

How many events are in the job output of the job in question? Also, are the jobs in a finished state (“succeeded” or “error”) or running state?

AWX Team

Hi,

…just discovered, that when I look at the job output of a successful job, memory seems to be almost stable, as expected. But then I have those failed jobs – looking at the output of a failed job bumps the memory from ~200MB to 2.2 GB! Maybe it is related to the job, because the job (which runs once a day) has currently a 50:50 chance of failing – without a proper error message. From the log it looks like it just aborted:

PLAY [localhost] ***************************************************************

TASK [get golden_config device status] *****************************************

ok: [localhost]

TASK [check devices for recent backup] *****************************************

skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesd0011-2812 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesd0012-2811 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesx0011-2809 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesx0012-2810 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesx0021-2811 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesx0022-2812 - 2022-09-19T01:01:00.818174Z - 4 hours old)

[ *** removed ~200 lines showing the same skipping: … lines *** ]

skipping: [localhost] => (item=xxxt172d - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172e - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172f - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172g - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172h - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172i - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172j - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt174a-1 - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt174b - 2022-09-19T01:31:00.840784Z - 3 hours old)

and that’s it – the output just stops. Failure reason is “Job terminated due to error”

In the logs I can see:

awx.main.dispatch job 4848 (error) encountered an error (rc=None), please see task stdout for details.

awx.main.dispatch task 34f3c74d-4394-4c33-8e70-0cbd52c7e385 starting awx.main.tasks.system.handle_work_error(*[‘34f3c74d-4394-4c33-8e70-0cbd52c7e385’])

awx.main.tasks.system Executing error task id 34f3c74d-4394-4c33-8e70-0cbd52c7e385, subtasks: [{‘type’: ‘job’, ‘id’: 4848}]

Just reconfirmed – if I have a look at the output of the successful runs of the job, memory stays stable. As soon as I have a look at a failed job output memory usage increases.

Thanks,

Andreas

Hi,

I believe this AWX issue describes the same problem https://github.com/ansible/awx/issues/12644. For now can you please bump that issue thread and summarize your issue there?

AWX Team

I believe this is a real bug and could use some help tracking it down

what does “/api/v2/jobs/id/job_events/children_summary” return for the failed job?

what does that same endpoint return for the same job, but when successful?

Do you get 502 errors when viewing the job detail page in the UI?
open the browser dev tools > network tab. When you go the job output page in the UI, but don’t scroll down. Do any requests give a 502 or take a long time to respond?

e.g.

note the status and time columns

Thanks for any information you can provide

Hi,

thanks for looking into that – will provide the requested information tomorrow. Should I still bump the old issue as told in the other mail (https://github.com/ansible/awx/issues/12644)?

Thanks,

Andreas

(attachments)

when you collect data you can probably put it in issue (and here too, or a link to your github comment at least) so everything is in one place. Plus a better chance of other devs taking a look on github

OK - I addded the data to the github issue (see https://github.com/ansible/awx/issues/12644#issuecomment-1256998192)

My findings were:

→ what does “/api/v2/jobs/id/job_events/children_summary” return for the failed job?

result:

{

“children_summary”: {

“2”: {

“rowNumber”: 1,

“numChildren”: 282

},

“3”: {

“rowNumber”: 2,

“numChildren”: 281

},

“4”: {

“rowNumber”: 3,

“numChildren”: 2

},

“7”: {

“rowNumber”: 6,

“numChildren”: 277

}

},

“meta_event_nested_uuid”: {},

“event_processing_finished”: true,

“is_tree”: true

}

→ what does that same endpoint return for the same job, but when successful?

Result:

{“detail”:“Nicht gefunden.”}
(which means “not found”)

→ Do you get 502 errors when viewing the job detail page in the UI? open the browser dev tools > network tab. When you go the job output page in the UI, but don’t scroll down. Do any requests give a 502 or take a long time to respond?

Result: No - No 502’s and a resonable amount of time to respond (5s):

But - if I pick up the scrollbar with my mouse and move it half way down, I can see a lot of requests being sent to the backend and then there is a 502 … Additionally, I can see the response size is between 1 and 7 MB - seems a bit large for loading the events.

Here’s a screenshot, after I moved the scrollbar down:

Once more - thanks for looking into this!

Best regards,

Andreas

Just to let you know - reducing the data received from the inventory and tune container log rotation made our jobs run stable again!

Thanks for your support!

Thanks Andy, can you let us know what container log rotation setting you changed? Was it just “max-size”?

Hi,

container-log-max-files and container-log-max-size – modified in /etc/systemd/system/k3s.service:

ExecStart=/usr/local/bin/k3s \

server \

‘–kubelet-arg’ \

‘container-log-max-files=4’ \

‘–kubelet-arg’ \

‘container-log-max-size=50Mi’ \