Question about receptor service (why is it missing in awx-task)

Hi,

Just out of technical curiosity I am looking into the fact I can get a execution/hop node working on AWX.
I thought this wouldn’t be this hard.
Just overwrite the /etc/receptor/receptor.conf with your own version and run some awx-manage commands lifted from the docker-compose guide.

Correct me if I am wrong:

To my surprise there is no receptor service running in the awx task container on K8S

https://github.com/ansible/awx/blob/devel/tools/ansible/roles/dockerfile/templates/supervisor_task.conf.j2

But in the docker-compose dev guide there is a receptor service

https://github.com/ansible/awx/blob/devel/tools/docker-compose/supervisor.conf

[program:awx-receptor]
command = receptor --config /etc/receptor/receptor.conf

So why is this missing in the K8S version? Is the “official” reason that I need to rebuild the containers with receptor in it, but I can’t understand why this just isn’t implemented in the awx (k8s)version. Am I missing something because it is strange that the /etc/receptor/receptor.conf config is present.

I believe that for inter K8S communication it is not needed to have it running? But that limits the functionality it could have (for technical curiosity people).

Kind regards

Stefan

Hi,

Just found that the receptor service is in the awx-ee container. When overmounting the /etc/receptor/receptor.conf file and adding a k8s I am able to receptorctl ping my “hop node”. Still need to work some things out .

We don’t have an official method for setting this up at the moment, but it’s technically feasible. I’m really curious how this turns out, please let us know your findings :slight_smile:

You’ll certainly have to use the management command provision_instance to populate the db with information about your remote instance.

You may need to look into the work-signing field in receptor configs.

Seth

As promised I have incoherently written down my progress.

https://malfunceddie.github.io/article/awx-hop-nodes/hopnodes/

Hope someone can give me some pointers

King regards,

nice write up. For the very last step, did you disable the control plane node instance?

The control plane node instance still needs to be active, because it still plays a roll in preparing the task to run on the remote execution node.

So keep your control node instance enabled. To force the task is ran on the execution node, create a job template and for instance group, choose the “remote” group (where your execution node is located)

also I see your comments about work-signing keys, yes you probably need them.

they are pretty easy to configure.

your control node receptor configs should have

  • work-signing:
    privatekey: /etc/receptor/work_private_key.pem
    tokenexpiration: 1m

  • work-verification:
    publickey: /etc/receptor/work_public_key.pem

your execution node should have

  • work-verification:
    publickey: /etc/receptor/work_public_key.pem

to create the keys, just do openssl commands, you can see them evokes in the plays here
https://github.com/ansible/awx/blob/a86740c3c9eaf9a551e850341d8adec5a3962dd5/tools/docker-compose/ansible/roles/sources/tasks/main.yml#L84

Hi,

I did not delete the control plane node and I see it in the Instance Groups list with the awx node as an instance.

I also added the work-signing and work-verification but I still get the same error message.

2022-05-09 19:28:57,748 DEBUG [90cb4ebd2f4e495d8181bbd5cd592e50] awx.main.scheduler Running task manager.
2022-05-09 19:28:57,753 DEBUG [90cb4ebd2f4e495d8181bbd5cd592e50] awx.main.tasks.system Last scheduler run was: 2022-05-09 19:28:27.211789+00:00
2022-05-09 19:28:57,809 DEBUG [90cb4ebd2f4e495d8181bbd5cd592e50] awx.main.dispatch task 9fe0c975-b200-4dc9-931f-97baacdfd748 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*)
2022-05-09 19:28:57,820 DEBUG [90cb4ebd2f4e495d8181bbd5cd592e50] awx.main.scheduler Starting Scheduler
2022-05-09 19:28:57,911 DEBUG [90cb4ebd2f4e495d8181bbd5cd592e50] awx.main.scheduler Skipping group remote, task cannot run on control plane
2022-05-09 19:28:57,912 DEBUG [90cb4ebd2f4e495d8181bbd5cd592e50] awx.analytics.job_lifecycle adhoccommand-14300 needs capacity

2022-05-09 19:28:57,912 DEBUG [90cb4ebd2f4e495d8181bbd5cd592e50] awx.main.scheduler ad_hoc_command 14300 (pending) couldn’t be scheduled on graph, waiting for next cycle"

receptor.conf and keys.
bash-4.4$ ls
receptor.conf work-private-key.pem work-public-key.pem
bash-4.4$ cat receptor.conf

I think I am running into this:

https://github.com/ansible/awx/commit/f850f8d3e0ebc97e3efed7d5001395bae85c997b

This means that when IS_K8S=True, any Job Template associated with an Instance Group will never actually go from pending → running (because there’s no capacity - all playbooks must run through Container Groups). For an improved ux, our intention is to introduce logic into the operator install process such that the default group that’s created at install time is a Container Group that’s configured to point at the K8S cluster where awx itself is deployed.

It looks like this global flag completely blocks the intention of running execution/hop/hybrid nodes on AWX since “only container groups are intended to be used”.
editing the awx-awx-configmap and setting IS_K8S=False

Result:

The the control node is now a hybrid node? But my job is trying to run on the execution node :open_mouth:

I still got an error (no output form job) but:

WARNING 2022/05/10 08:31:36 Received unreachable message from axw-ee
WARNING 2022/05/10 08:31:36 Received unreachable message from axw-ee
ERROR 2022/05/10 08:31:36 Error locating unit: VECXTy1J
ERROR 2022/05/10 08:31:36 unknown work unit VECXTy1J

Will try to debug coming nights.

Hi,

https://github.com/ansible/awx/blob/78660ad0a22798ace9210b37ba5be9429603e49d/awx/main/tasks/jobs.py#L153

Since I set IS_K8S: to false

params = {
“container_image”: image,
“process_isolation”: True,
“process_isolation_executable”: “podman”, # need to provide, runner enforces default via argparse
“container_options”: [‘–user=root’],
}

It requires podman for process isolation. This concurs with the logs.

022-05-24 09:57:55,367 DEBUG [40bb4e64544740da86feaa7e786076b0] awx.main.tasks.jobs ad_hoc_command 14398 (running) finished running, producing 0 events.
Unable to find process isolation executable: podman

The problem is that podman is installed on my execution node. Not sure why it is saying this.

it could also be that this is on the pod? I will look into it what happens when I set DEFAULT_CONTAINER_RUN_OPTIONS = [‘–process_isolation’, False]

Hello Stefan,
I’m stuck in a situation identical to yours,
when I start a job in the awx control node I can see those messages on the log of the receptor node (built up with docker-compose):

DEBUG 2023/05/17 15:40:17 Client connected to control service awx-ee:lsqpVoOx
DEBUG 2023/05/17 15:40:17 Client connected to control service awx-ee:8CacMHFc
ERROR 2023/05/17 15:40:17 Error locating unit: emZ4a2SV
ERROR 2023/05/17 15:40:17 unknown work unit emZ4a2SV
WARNING 2023/05/17 15:40:18 Received unreachable message from awx-ee
WARNING 2023/05/17 15:40:18 remote service 8CacMHFc to node awx-ee is unreachable
WARNING 2023/05/17 15:40:18 Received unreachable message from awx-ee
WARNING 2023/05/17 15:40:18 Received unreachable message from awx-ee
WARNING 2023/05/17 15:40:18 Received unreachable message from awx-ee

In the log of the job showed on the awx web interface I get the following message:

WARN[0000] “/” is not a shared mount, this could cause issues or missing mounts with rootless containers
Error: mounting overlay failed “/etc/pki/ca-trust”: chown /var/lib/awx/.local/share/containers/storage/overlay-containers/53533cf575c7f9852a687b51ddfef3d9f3c0a210ef9375fc0f021573024fa261/userdata/overlay/805202068/upper: invalid argument

did you manage to solve it?
what am I missing?

Thank you in advance,
Emanuele

This ended up being a major feature we released in AWX 21.7.0

please see these docs I wrote up on adding a remote execution node to the k8s control plane.

https://github.com/ansible/awx/blob/devel/docs/execution_nodes.md

What steps did you take to deploying the remote node?

AWX Team

I found out what was the problem, in the awx control I had in Settings → job settings this couple of mounts:
[
“/etc/pki/ca-trust:/etc/pki/ca-trust:O”,
“/usr/share/pki:/usr/share/pki:O”
]
the :open_mouth: options was not accepted on the receptor node by podman, removing it or substituting it with an option accepted by podman I resolved the problem.
The remote receptor node was deployed with docker-compose, following this link:
https://github.com/ansible/awx/blob/devel/tools/docker-compose/README.md

Thank you for the docs, I’ll will follow it, it seems much more simple.
Emanuele

hi ,

I get the same errors as the logs above.

WARNING 2023/09/12 08:38:45 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:39:05 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:39:25 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:39:46 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:40:06 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:40:26 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:40:46 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:41:06 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:41:26 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:41:46 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
WARNING 2023/09/12 08:42:06 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
INFO 2023/09/12 08:42:26 Connection established with receptor-hop
INFO 2023/09/12 08:42:26 Known Connections:
INFO 2023/09/12 08:42:26 receptor-1: receptor-hop(1.00)
INFO 2023/09/12 08:42:26 receptor-hop: awx_1(1.00) receptor-1(1.00) receptor-2(1.00)
INFO 2023/09/12 08:42:26 receptor-2: receptor-hop(1.00)
INFO 2023/09/12 08:42:26 awx_1: receptor-hop(1.00)
INFO 2023/09/12 08:42:26 Routing Table:
INFO 2023/09/12 08:42:26 awx_1 via receptor-hop
INFO 2023/09/12 08:42:26 receptor-hop via receptor-hop
INFO 2023/09/12 08:42:26 receptor-2 via receptor-hop
time=“2023-09-12T08:43:37Z” level=error msg=“Joining network namespace for container 86c0c9321344384f3f8aa8786d4155b5c1949f1095202550d421ab6c08c59d5f: error retrieving network namespace at /run/netns/netns-7c08be18-bb60-b360-0e1e-74e298145251: unknown FS magic on "/run/netns/netns-7c08be18-bb60-b360-0e1e-74e298145251": 794c7630”
time=“2023-09-12T08:43:37Z” level=error msg=“Joining network namespace for container 9df3294dd4f9e1dab30de3624bdd05f39ed6a9155aad782b5461ac8455b9e925: error retrieving network namespace at /run/netns/netns-4bc93030-52fe-e2ce-8314-c081f0f50402: unknown FS magic on "/run/netns/netns-4bc93030-52fe-e2ce-8314-c081f0f50402": 794c7630”
time=“2023-09-12T08:43:37Z” level=error msg=“Joining network namespace for container a9a589fa03b535e3904a854ef8add36c3f455b02d54add814b4a82dcb59616e7: error retrieving network namespace at /run/netns/netns-53503e17-b868-32ce-4eab-cf8a09e3b550: unknown FS magic on "/run/netns/netns-53503e17-b868-32ce-4eab-cf8a09e3b550": 794c7630”
stopped 86c0c9321344384f3f8aa8786d4155b5c1949f1095202550d421ab6c08c59d5f
cannot stop container 86c0c9321344384f3f8aa8786d4155b5c1949f1095202550d421ab6c08c59d5f: error joining network namespace of container 86c0c9321344384f3f8aa8786d4155b5c1949f1095202550d421ab6c08c59d5f: error retrieving network namespace at /run/netns/netns-7c08be18-bb60-b360-0e1e-74e298145251: unknown FS magic on “/run/netns/netns-7c08be18-bb60-b360-0e1e-74e298145251”: 794c7630
INFO 2023/09/12 08:43:37 Connection established with receptor-hop
INFO 2023/09/12 08:43:37 Known Connections:
INFO 2023/09/12 08:43:37 receptor-1: receptor-hop(1.00)
INFO 2023/09/12 08:43:37 receptor-hop: awx_1(1.00) receptor-1(1.00)
INFO 2023/09/12 08:43:37 Routing Table:
INFO 2023/09/12 08:43:37 receptor-hop via receptor-hop
INFO 2023/09/12 08:43:37 Running control service control
INFO 2023/09/12 08:43:37 Initialization complete
INFO 2023/09/12 08:43:37 Known Connections:
INFO 2023/09/12 08:43:37 receptor-hop: receptor-2(1.00) awx_1(1.00) receptor-1(1.00)
INFO 2023/09/12 08:43:37 receptor-2: receptor-hop(1.00)
INFO 2023/09/12 08:43:37 receptor-1: receptor-hop(1.00)
INFO 2023/09/12 08:43:37 Routing Table:
INFO 2023/09/12 08:43:37 receptor-hop via receptor-hop
INFO 2023/09/12 08:43:37 receptor-2 via receptor-hop
ERROR 2023/09/12 08:43:42 Error translating data to message struct: hash not found
ERROR 2023/09/12 08:43:43 Error translating data to message struct: hash not found
ERROR 2023/09/12 08:43:43 Error translating data to message struct: hash not found
ERROR 2023/09/12 08:43:43 Error translating data to message struct: hash not found
ERROR 2023/09/12 08:43:43 Error translating data to message struct: hash not found
ERROR 2023/09/12 08:43:44 Error translating data to message struct: hash not found
ERROR 2023/09/12 08:43:44 Error translating data to message struct: hash not found
ERROR 2023/09/12 08:43:45 Error translating data to message struct: hash not found
ERROR 2023/09/12 08:43:45 Error translating data to message struct: hash not found
INFO 2023/09/12 08:43:46 Known Connections:
INFO 2023/09/12 08:43:46 receptor-1: receptor-hop(1.00)
INFO 2023/09/12 08:43:46 receptor-hop: awx_1(1.00) receptor-1(1.00) receptor-2(1.00)
INFO 2023/09/12 08:43:46 receptor-2: receptor-hop(1.00)
INFO 2023/09/12 08:43:46 awx_1: receptor-hop(1.00)
INFO 2023/09/12 08:43:46 Routing Table:
INFO 2023/09/12 08:43:46 receptor-hop via receptor-hop
INFO 2023/09/12 08:43:46 receptor-2 via receptor-hop
INFO 2023/09/12 08:43:46 awx_1 via receptor-hop
ERROR 2023/09/12 08:43:51 Error locating unit: cBA0NTcI
ERROR 2023/09/12 08:43:51 unknown work unit cBA0NTcI
time=“2023-09-12T08:44:07Z” level=error msg=“Joining network namespace for container 86c0c9321344384f3f8aa8786d4155b5c1949f1095202550d421ab6c08c59d5f: error retrieving network namespace at /run/netns/netns-7c08be18-bb60-b360-0e1e-74e298145251: unknown FS magic on "/run/netns/netns-7c08be18-bb60-b360-0e1e-74e298145251": 794c7630”
time=“2023-09-12T08:44:07Z” level=error msg=“Joining network namespace for container 9df3294dd4f9e1dab30de3624bdd05f39ed6a9155aad782b5461ac8455b9e925: error retrieving network namespace at /run/netns/netns-4bc93030-52fe-e2ce-8314-c081f0f50402: unknown FS magic on "/run/netns/netns-4bc93030-52fe-e2ce-8314-c081f0f50402": 794c7630”
time=“2023-09-12T08:44:07Z” level=error msg=“Joining network namespace for container a9a589fa03b535e3904a854ef8add36c3f455b02d54add814b4a82dcb59616e7: error retrieving network namespace at /run/netns/netns-53503e17-b868-32ce-4eab-cf8a09e3b550: unknown FS magic on "/run/netns/netns-53503e17-b868-32ce-4eab-cf8a09e3b550": 794c7630”
stopped 86c0c9321344384f3f8aa8786d4155b5c1949f1095202550d421ab6c08c59d5f
cannot stop container 86c0c9321344384f3f8aa8786d4155b5c1949f1095202550d421ab6c08c59d5f: error joining network namespace of container 86c0c9321344384f3f8aa8786d4155b5c1949f1095202550d421ab6c08c59d5f: error retrieving network namespace at /run/netns/netns-7c08be18-bb60-b360-0e1e-74e298145251: unknown FS magic on “/run/netns/netns-7c08be18-bb60-b360-0e1e-74e298145251”: 794c7630
WARNING 2023/09/12 08:44:07 Backend connection failed (will retry): dial tcp: lookup tools_receptor_hop on 127.0.0.11:53: server misbehaving
INFO 2023/09/12 08:44:07 Running control service control
INFO 2023/09/12 08:44:07 Initialization complete
INFO 2023/09/12 08:44:12 Connection established with receptor-hop
INFO 2023/09/12 08:44:12 Known Connections:
INFO 2023/09/12 08:44:12 receptor-1: receptor-hop(1.00)
INFO 2023/09/12 08:44:12 receptor-hop: receptor-1(1.00)
INFO 2023/09/12 08:44:12 Routing Table:
INFO 2023/09/12 08:44:12 receptor-hop via receptor-hop
INFO 2023/09/12 08:44:12 Known Connections:
INFO 2023/09/12 08:44:12 receptor-hop: awx_1(1.00) receptor-1(1.00)
INFO 2023/09/12 08:44:12 receptor-1: receptor-hop(1.00)
INFO 2023/09/12 08:44:12 Routing Table:
INFO 2023/09/12 08:44:12 receptor-hop via receptor-hop
INFO 2023/09/12 08:44:13 Known Connections:
INFO 2023/09/12 08:44:13 receptor-1: receptor-hop(1.00)
INFO 2023/09/12 08:44:13 receptor-hop: awx_1(1.00) receptor-1(1.00) receptor-2(1.00)
INFO 2023/09/12 08:44:13 receptor-2: receptor-hop(1.00)
INFO 2023/09/12 08:44:13 Routing Table:
INFO 2023/09/12 08:44:13 receptor-2 via receptor-hop
INFO 2023/09/12 08:44:13 receptor-hop via receptor-hop
INFO 2023/09/12 08:44:18 Known Connections:
INFO 2023/09/12 08:44:18 receptor-1: receptor-hop(1.00)
INFO 2023/09/12 08:44:18 receptor-hop: awx_1(1.00) receptor-1(1.00) receptor-2(1.00)
INFO 2023/09/12 08:44:18 receptor-2: receptor-hop(1.00)
INFO 2023/09/12 08:44:18 awx_1: receptor-hop(1.00)
INFO 2023/09/12 08:44:18 Routing Table:
INFO 2023/09/12 08:44:18 receptor-hop via receptor-hop
INFO 2023/09/12 08:44:18 receptor-2 via receptor-hop
INFO 2023/09/12 08:44:18 awx_1 via receptor-hop
WARNING 2023/09/12 08:46:14 Received unreachable message from awx_1
WARNING 2023/09/12 08:46:14 Received unreachable message from awx_1
ERROR 2023/09/12 08:47:21 Error locating unit: W7taWtHZ
ERROR 2023/09/12 08:47:21 unknown work unit W7taWtHZ
When I start a job, the cpu usage in receptor_1 container goes up to 400% and the jobs take very long.

Can you help me how to fix this situation?

I am also sending the topology of our AWX environment.
I would be very grateful if you can help me.

18 Mayıs 2023 Perşembe tarihinde saat 04:44:19 UTC+3 itibarıyla AWX Project şunları yazdı:

(attachments)

hi ,

can you help us ?

12 Eylül 2023 Salı tarihinde saat 12:27:57 UTC+3 itibarıyla hulya kayikci şunları yazdı: