Jobs inconsistently report "Error" although run to completion

I am testing a new AWX installation (AWX Operator 1.1.3, all defaults, AWX 21.10.2) on EKS (1.24). I am running the same job over and over - simple file update job.

When I run it, sometimes it completes successfully, but more often than not I get the job ending with “Error”, with the stdout output truncated, but in the AWX event output and in the results I can see the job actually succeeded.

The job details show “status” as “error”, “job_explanation” as “Job terminated due to error”, but in the end of the result_traceback JSON, I can see the playbook results reporting expected updates and “status”: “successful”. No actual errors can be seen.

It appears as if AWX is falsely determining that the job has errored out when it hasn’t…how can I debug this further?

hi there,

Are you running anything else on this installation?
If it’s possible for you to share the pod logs for awx-task that would probably be really handy, because there may be a traceback error there.
This behavior definitely isn’t what we expect and any information and context you can share with us for a recreation on our end would be helpful.
If you want to keep the pods up and running for testing purposed after the job is running you can then tail the pods and see the status.
If you want to do this, you can add this to your AWX spec (WARNING: this is for testing only and is a super bad idea if you’re doing literally anything other than testing):

extra_settings:

  • setting: RECEPTOR_RELEASE_WORK

value: “False”

Thanks,
AWX Team

This is a pristine EKS cluster only running AWX at this point. I am working on automation to get it all set up.

I was able to get the log from the executing automation-job pod and have scrubbed and attached it. As in detailed output, the entire run seems successful. I also attached the AWX job log itself (less the result traceback which is the first file), which shows what AWX thinks of the job.

I don’t seem to have a pod called awx-task.

(attachments)

automation-job-27-pod-log.txt (47.5 KB)
awx_job_27.txt (8.03 KB)

If your AWX CR was:

Ah, OK…I tore down the cluster for the long weekend but will check again next week.

OK, I was able to find the awx-demo-awx container, and the log from that isn’t giving any stdout it seems, unless I’m not doing it correctly:

kubectl logs -n awx awx-demo-8fbbf9578-5wpvf -c awx-demo-task --tail 1 -f
2023-01-18 17:19:04,923 INFO [0655c659edb74433be0617eeaa767eac] awx.analytics.job_lifecycle job-39 waiting
2023-01-18 17:19:05,561 INFO [0655c659edb74433be0617eeaa767eac] awx.analytics.job_lifecycle job-39 pre run
2023-01-18 17:19:07,179 INFO [0655c659edb74433be0617eeaa767eac] awx.analytics.job_lifecycle job-39 preparing playbook
2023-01-18 17:19:07,349 INFO [0655c659edb74433be0617eeaa767eac] awx.analytics.job_lifecycle job-39 running playbook
2023-01-18 17:19:15,835 INFO [0655c659edb74433be0617eeaa767eac] awx.analytics.job_lifecycle job-39 work unit id received
2023-01-18 17:19:15,883 INFO [0655c659edb74433be0617eeaa767eac] awx.analytics.job_lifecycle job-39 work unit id assigned
2023-01-18 17:19:27,935 INFO [0655c659edb74433be0617eeaa767eac] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 39
2023-01-18 17:19:32,224 INFO [0655c659edb74433be0617eeaa767eac] awx.analytics.job_lifecycle job-39 post run
2023-01-18 17:19:32,550 INFO [0655c659edb74433be0617eeaa767eac] awx.analytics.job_lifecycle job-39 finalize run
2023-01-18 17:19:33,249 WARNING [0655c659edb74433be0617eeaa767eac] awx.main.dispatch job 39 (error) encountered an error (rc=None), please see task stdout for details.

were you able to append this setting to your awx spec?

extra_settings:

  • setting: RECEPTOR_RELEASE_WORK

value: “False”

that way you can tail the actual job pod. Seems the jobs are running successfully but somewhere the jobs are detected as failed. This could be an issue in receptor or controller.

with that setting, you can also run kubectl exec into the control node task container and run “receptorctl --socket /var/run/awx-receptor/receptor.sock work list”. That output might be helpful to include as a response

AWX Team

The logs from the job pods was what I sent before.

Opening a shell to the task container, and running receptorctl after finding the actual command and socket paths, as they seemed to differ, I got the attached output.
The command I ran:

/var/lib/awx/venv/awx/bin/receptorctl --socket /run/receptor/receptor.sock work list

automation-job-40-rc825, automation-job-39-llr22 and automation-job-38-wzv5h were failed jobs. automation-job-36-jz58q and automation-job-37-fr9x6 succeeded. The entries in the output look pretty much the same in either case.

(attachments)

receptor-output-json.txt (4.14 KB)

Hi Team,
Facing exact same issue. Have gone through logs shared by Brian O’Neill, logs are pretty much same as mine.
Environment:-
Single Node K8S cluster(Baremetal) v1.23.5
AWX Operator versions tried: 1.1.3, 1.1.0, 0.30.0

If tried with 1Master 3Worker setup, then jobs does’nt through truncated error message( AWX Operator: 0.30.0). Facing only on Single node cluster.

But really wanted the setup to run seamlessly in Single node cluster.
Tried couple of workaround like increasing log size in kubelet config file (containerLogMaxSize: 800Mi) level and also tried increasing job_timeout: 20000 in env/settings (awx-rsys-ee container)
Still same issue [note: workaround might make no sense, not sure. Im just a beginner]

It would be great if you guys can help on this

doconeill,

thanks for sharing that info. can you please run /var/lib/awx/venv/awx/bin/receptorctl --socket /run/receptor/receptor.sock work results qI2QXT7R

to get the stdout of that failed job and share here (redact sensitive info first)

Thanks

AWX Team

Hello - we are seeing this exact same error on inventory sync - both from the Amazon EC2 and Source From a Project source types.

We are on awx-operator 1.1.3
AWX 21.10.2
AWS EKS v 1.24

“Traceback (most recent call last):\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py”, line 589, in run\n res = receptor_job.run()\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 319, in run\n res = self._run_internal(receptor_ctl)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 394, in _run_internal\n res = processor_future.result()\n File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 446, in result\n return self.__get_result()\n File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 391, in __get_result\n raise self._exception\n File “/usr/lib64/python3.9/concurrent/futures/thread.py”, line 58, in run\n result = self.fn(*self.args, **self.kwargs)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py”, line 1214, in wrapper_cleanup_new_process\n return func(*args, **kwargs)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 453, in processor\n return ansible_runner.interface.run(\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py”, line 210, in run\n r.run()\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py”, line 254, in run\n self.artifacts_callback(data)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py”, line 233, in artifacts_callback\n unstream_dir(self._input, length, self.artifact_dir)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/streaming.py”, line 71, in unstream_dir\n data = source.read(chunk_size)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/base64io.py”, line 282, in read\n results.write(base64.b64decode(data))\n File “/usr/lib64/python3.9/base64.py”, line 87, in b64decode\n return binascii.a2b_base64(s)\nbinascii.Error: Invalid base64-encoded string: number of data characters (247181) cannot be 1 more than a multiple of 4\n” }

The stdout from the inventory sync job actually shows that the inventory is parsed successfully and in verbose i can see it updating server group names but for some reason the inventory job throws an error

(attachments)

OK, attached is the output. This appears to be from running ansible-galaxy for the project. I see references to an error of “ERROR! Failed to download collection tar from ‘server0’: <urlopen error [Errno -2] Name or service not known”, so it appears to be having problems downloading stuff from galaxy.ansible.com - but I don’t know why it would. It downloads the project fine from bitbucket, etc. And the play itself seems to run.

(attachments)

failed.json.txt (69.6 KB)

hello - do you have any updates with the information provided to you from the other author and myself? Thanks for your help

I also see this when the inventory sync is scheduled and errors out:

“created_by”: “None”, “started”: “2023-01-23T02:30:16.206523+00:00”, “finished”: “2023-01-23T02:41:55.925345+00:00”, “status”: “error”, “traceback”: “Traceback (most recent call last):\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py”, line 589, in run\n res = receptor_job.run()\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 319, in run\n res = self._run_internal(receptor_ctl)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 394, in _run_internal\n res = processor_future.result()\n File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 446, in result\n return self.__get_result()\n File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 391, in __get_result\n raise self._exception\n File “/usr/lib64/python3.9/concurrent/futures/thread.py”, line 58, in run\n result = self.fn(*self.args, **self.kwargs)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py”, line 1214, in wrapper_cleanup_new_process\n return func(*args, **kwargs)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 453, in processor\n return ansible_runner.interface.run(\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py”, line 210, in run\n r.run()\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py”, line 254, in run\n self.artifacts_callback(data)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py”, line 233, in artifacts_callback\n unstream_dir(self._input, length, self.artifact_dir)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/streaming.py”, line 76, in unstream_dir\n with zipfile.ZipFile(tmp.name, “r”) as archive:\n File “/usr/lib64/python3.9/zipfile.py”, line 1266, in init\n self._RealGetContents()\n File “/usr/lib64/python3.9/zipfile.py”, line 1333, in _RealGetContents\n raise BadZipFile(“File is not a zip file”)\nzipfile.BadZipFile: File is not a zip file\n” }

here is another error thrown when trying to sync with Amazon EC2 source type. I am trying with the new awx-ee that was published in quay (21.11.90) and the latest - still throwing these python errors

“Traceback (most recent call last):\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py”, line 589, in run\n res = receptor_job.run()\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 319, in run\n res = self._run_internal(receptor_ctl)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 394, in _run_internal\n res = processor_future.result()\n File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 446, in result\n return self.__get_result()\n File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 391, in __get_result\n raise self._exception\n File “/usr/lib64/python3.9/concurrent/futures/thread.py”, line 58, in run\n result = self.fn(*self.args, **self.kwargs)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py”, line 1214, in wrapper_cleanup_new_process\n return func(*args, **kwargs)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 453, in processor\n return ansible_runner.interface.run(\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py”, line 210, in run\n r.run()\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py”, line 254, in run\n self.artifacts_callback(data)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py”, line 233, in artifacts_callback\n unstream_dir(self._input, length, self.artifact_dir)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/streaming.py”, line 71, in unstream_dir\n data = source.read(chunk_size)\n File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/base64io.py”, line 282, in read\n results.write(base64.b64decode(data))\n File “/usr/lib64/python3.9/base64.py”, line 87, in b64decode\n return binascii.a2b_base64(s)\nbinascii.Error: Incorrect padding\n” }

I upgraded the awx operator to 1.1.4 and awx 21.11.0 this morning after the recent post. I still am seeing problems on inventory sync jobs - this traceback is what we received AFTER the updates we performed:

Traceback (most recent call last): File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py”, line 593, in run res = receptor_job.run() File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 319, in run res = self._run_internal(receptor_ctl) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 394, in _run_internal res = processor_future.result() File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 446, in result return self.__get_result() File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 391, in __get_result raise self._exception File “/usr/lib64/python3.9/concurrent/futures/thread.py”, line 58, in run result = self.fn(*self.args, **self.kwargs) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py”, line 1214, in wrapper_cleanup_new_process return func(*args, **kwargs) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py”, line 462, in processor return ansible_runner.interface.run( File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py”, line 211, in run r.run() File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py”, line 254, in run self.artifacts_callback(data) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py”, line 233, in artifacts_callback unstream_dir(self._input, length, self.artifact_dir) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/streaming.py”, line 71, in unstream_dir data = source.read(chunk_size) File “/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/base64io.py”, line 282, in read results.write(base64.b64decode(data)) File “/usr/lib64/python3.9/base64.py”, line 87, in b64decode return binascii.a2b_base64(s) binascii.Error: Invalid base64-encoded string: number of data characters (453985) cannot be 1 more than a multiple of 4

For now - I am going to be using a different ee for the controlplane until the latest issues are resolved - this is what i added to my AWX specs:

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: some name
namespace: awx
spec:
image_pull_policy: Always
control_plane_ee_image: novumrgi/awx-ee:stable

So I rebuilt the database from scratch, which did not change things. Everything I can look at says the playbook runs succeed, but they get marked as failed anyways about 50% of the time. I’m at a loss.

yeah sorry folks - we are seeing some other reports that sound familiar to what is being reported here https://github.com/ansible/awx/issues/13469

I plan to investigate this further

Do you see anything in the receptor logs (from awx-ee container) with this message “client rate limiter Wait returned an error: context canceled”?

I don’t see any message like that, but I may not have that specific log and I’ve torn down the cluster for now. That issue looks pretty much identical to mine, although I haven’t really seen th demo template fail, only the one I have execute against other systems.

Hi Brian,

Add this to your specs file for the AWX deployment and it should work for now until they fix the awx-ee image.

spec:
image_pull_policy: Always
control_plane_ee_image: novumrgi/awx-ee:stable