JOB FAILED - without failed logs

Rijo_Thomas · November 26, 2024, 11:36am

Hello,
I am currently experiencing an issue with AWX where a job fails with the status “failed.” Unfortunately, the job does not provide any relevant error messages to help diagnose the problem, and we only get logs up to the success task. I am unable to determine the cause of the failure.
If anyone has encountered a similar issue or has suggestions on how to troubleshoot this further, I would greatly appreciate your input.

Attaching the screenhot for your refernce.

NB : we have deployed the application n a OCP cluster

cnfrancis · November 28, 2024, 4:31am

A few things you can do:
1- Download the logs from the download icon and see if you get the full report
2- Refresh a few time the UI
3- Assuming this is k8s, find the pod the that was used to run your job, usually the nomenclature is automation-job-- and check the logs for that pod

alvaroc20 · November 28, 2024, 1:43pm

Hi!

Could you increase the log level for that job? Any additional information you provide could be really helpful to assist you!

Rijo_Thomas · November 29, 2024, 10:21am

We tired all these three options … But no luck.

Downloaded log files doesnt have any details related to the failed task or why it got failed.
Tried refreshing the UI multiple times.
Its in OCP and I have checked the automation pod output aswell. But even i was not able to find any output related to the failed task.

Rijo_Thomas · November 29, 2024, 10:22am

I tired making the log level to 3 ( debug) But will the same… I was bale to see the detailed output of all tasks… But only upto the success once.
The failed task output itself is not showing.

Denney-tech · November 29, 2024, 10:03pm

Can you share what your tasks look like at the point of failure? It might help to know a little more context about what Ansible is trying to do.

From what it looks like, you’re running a sync operation on a pulp-rpm repository (are you using TheForeman/RedHatSatellite?), which may be a long-running task. It also looks like this is happening as an included task in a loop from the previous task (). Something weird could be happening with the loop vars, or perhaps the sync takes a long time to run and is causing the task to hang/timeout in a weird manner.

Have you confirmed that at least one of the pulp repo syncs actually occur, and do they finish successfully? Does pulp have timestamps to show how long the sync takes?

cnfrancis · November 30, 2024, 11:49pm

what were the stats of the pod? was it being oom killed? or throttled?

Rijo_Thomas · December 2, 2024, 2:01pm

Can you share what your tasks look like at the point of failure? It might help to know a little more context about what Ansible is trying to do.

Its not some specific task that getting failed . AWX does not say about any failure .The task is to sync package from cdn redhat and is done everyday at midnight . Day before AWX was showing till 3rd sync task , there as yesterday it was showing till 8th sync task

on 31st Nov 2024

on 1st of December.

From what it looks like, you’re running a sync operation on a pulp-rpm repository (are you using TheForeman/RedHatSatellite?), which may be a long-running task. It also looks like this is happening as an included task in a loop from the previous task (). Something weird could be happening with the loop vars, or perhaps the sync takes a long time to run and is causing the task to hang/timeout in a weird manner.

Have you confirmed that at least one of the pulp repo syncs actually occur, and do they finish successfully? Does pulp have timestamps to show how long the sync takes?

Yes ,sync with multiple repos worked. Same loop was working some time back (almost a month before) .Pacakges are updated incrementally and hence it wont take that long for a single task .

Rijo_Thomas · December 2, 2024, 2:28pm

Automation-pod is showing OOMKilled and later on getting terminated.

cnfrancis · December 2, 2024, 3:08pm

ahhh you have you answer then, do you know what are the memory resources defined for your instance groups/container groups?
you need to see first if you have enough memory request, then increase your memory limit to support bursts (e.g perhaps 2x your request). You can do so via podspec override in your container group/instance group

Denney-tech · December 2, 2024, 3:19pm

@cnfrancis Good call on the OOMKill.

Topic		Replies	Views
AWX job terminated unexpectedly AWX Project awx	22	13	September 25, 2023
Automation pod started up and not completed. AWX Project awx , kubernetes	4	11	April 27, 2023
Jobs inconsistently report "Error" although run to completion AWX Project awx	24	38	January 27, 2023
AWX Issue reporting results Get Help awx	6	63	August 16, 2024
memory problems AWX Project awx	11	97	September 29, 2022

JOB FAILED - without failed logs

Related topics