Connection plugin with slow _connect

Hi all,

I'm working on a guestfs [1] connection plugin and seeking for a design
advice.

libguestfs provides a set of command line tools that can be used to
operate on virtual machine disk images and modify its contents.

For every task, the connection plugin:

  1. Starts guestfish in --remote mode on a remote host over ssh and adds
     a disk (passed as a parameter to guestfs connection).

  2. Runs supermin applicance [2][3]. It typically takes two to four
     seconds to spin up the applicance VM.

  3. Mounts root fs partition (partition number is passed as a parameter to the
     guestfs connection)

  4. Performs the task:

     Some implementation details:
     - put_file/fetch_file is implemented using copy-in/copy-out [4][5]
       guestfish commands

     - there's intermediate copy to/from remote host over ssh (to enable
       remote guestfs operation)

     - exec_command is implemented using "command" [6] guestfish command

  5. Stops supermin appliance/a guestfish instance

Here's an example how it looks in a playbook:

    - name: Add disk image to inventory
      add_host:
        name: "{{ vm_disk_path }}"
        ansible_host: "{{ ansible_host }}"
        ansible_connection: guestfs
        ansible_guestfs_disk_path: "{{ vm_disk_path }}"
        ansible_guestfs_root_partnum: "{{ root_partnum }}"
      changed_when: false

    - name: Test guestfs
      ping:
      delegate_to: "{{ vm_disk_path }}"

The ping command is performed using the execution environment from
within the disk image on remote host:

  TASK [Add disk image to inventory] ******************************************
  ok: [remote-hypervisor]

  TASK [Test guestfs] *********************************************************
  ok: [remote-hypervisor -> /home/user/test.qcow2]

Likewise, a role can be delegated to the guestfs disk image.

The problem is that _connect() spins up supermin VM on every task and
stops afterwards. So, it takes at least two seconds only to perform
_connect(). Obviously it's very slow for plays with a lot of tasks and
roles.

The question is how it can be optimized to avoid costly _connect caused
by appliance start?

I think of the following approaches:

1. Introduce a separate module that starts up or stops guestfs appliance
   and remove the action from the connection plugin

   Pros: similar to ldx, docker, virt connections that have separate
         tasks for start/stop of the conntainers/VMs
   Cons: extra tasks need to be added for every play to start/stop guestfish

2. Add a separate meta task that closes the connection and a connection
   flag that effectively doesn't stop guestfish after the first task

   The meta task 'close_connection' can either be added as a separate
   module or as an extension to builtin meta module.

   Cons:
     - it looks flaky - guestfish might be unintentionally left running
       somewhere in the middle of the play in case of an error. Extra
       care (i.e. blocks) might be needed to always close guestfs
       connection.

3. Extend persistent connection framework [7]. There might be new mode
   that keeps connection open for a sequence of tasks running on the
   same connection without an explicit timeout. So this mode looks like
   this:

   task 1 on a guestfs connection - implicit _connect
   task 2 on the same guestfs connection - no _connect
   ...
   task n on the same guestfs connection - no _connect
   task z on any other connection or the end of play - implicit close()
                                                       of the guestfs connection

   Pros: reliable, tidy - no need of extra tasks/blocks
   Cons:
     - need to modify ansible core - task_executor, etc :slight_smile:
     - not sure if ansible is able persist connections across the roles

Looking forward to a feedback on what of the approaches is the most
solid/sane.

1. https://libguestfs.org
2. https://libguestfs.org/guestfs-internals.1.html#architecture
3. https://libguestfs.org/supermin.1.html
4. https://libguestfs.org/guestfish.1.html#copy-in
5. https://libguestfs.org/guestfish.1.html#copy-out
6. https://libguestfs.org/guestfish.1.html#command
7. https://www.ansible.com/deep-dive-with-network-connection-plugins

Thanks,
Roman

> Hi all,
>
> I'm working on a guestfs [1] connection plugin and seeking for a design
> advice.
>
> libguestfs provides a set of command line tools that can be used to
> operate on virtual machine disk images and modify its contents.
>
> For every task, the connection plugin:
>
> 1. Starts guestfish in --remote mode on a remote host over ssh and adds
> a disk (passed as a parameter to guestfs connection).
>
> 2. Runs supermin applicance [2][3]. It typically takes two to four
> seconds to spin up the applicance VM.

Depending on the target, simply running something like

  guestfish -a /dev/null run

will create and cache an appliance in /var/tmp/.guestfs-$UID/ (and
it's safe if two processes run in parallel). Once the appliance is
cached new libguestfs instances will use the cached appliance without
any delay.

Doesn't this mechanism work?

Hi Rich,

Appliance caching indeed works. If I remove it, it takes around 20
seconds to rebuild new appliance. Then it's used for new libguestfs
instances.

I was rather talking about the inherent latency caused by instance/VM
start. In the current implementation of guestfs plugin, the appliance is
started before each task and stopped afterwards.

My intent is to find a way to run multiple ansible tasks on the same
libguestfs instance. That saves up to 2-4 seconds per task.

Nevertheless for virt-v2v we have something similar because virt-v2v
is a long-running process that we want to start and query status from.
My colleague wrote a wrapper (essentially a sort of daemon) which
manages virt-v2v, and I guess may be useful to look at:

https://github.com/ManageIQ/manageiq-v2v-conversion_host/tree/master/wrapper

I'm doing something similar, except I'm running guestfish ---remote
under nohup, remember PID and then interact with it. If we find a way to
pass PID associated to a connneciton from task to task in ansible and
kill it when it's no longer needed (to be able to start a real VM with
the disk image) then we can achieve very fast and reliable task
execution on the disk images.

Thanks,
Roman

> > > Hi all,
> > >
> > > I'm working on a guestfs [1] connection plugin and seeking for a design
> > > advice.
> > >
> > > libguestfs provides a set of command line tools that can be used to
> > > operate on virtual machine disk images and modify its contents.
> > >
> > > For every task, the connection plugin:
> > >
> > > 1. Starts guestfish in --remote mode on a remote host over ssh and adds
> > > a disk (passed as a parameter to guestfs connection).
> > >
> > > 2. Runs supermin applicance [2][3]. It typically takes two to four
> > > seconds to spin up the applicance VM.
> >
> > Depending on the target, simply running something like
> >
> > guestfish -a /dev/null run
> >
> > will create and cache an appliance in /var/tmp/.guestfs-$UID/ (and
> > it's safe if two processes run in parallel). Once the appliance is
> > cached new libguestfs instances will use the cached appliance without
> > any delay.
> >
> > Doesn't this mechanism work?
>
> Hi Rich,
>
> Appliance caching indeed works. If I remove it, it takes around 20
> seconds to rebuild new appliance. Then it's used for new libguestfs
> instances.
>
> I was rather talking about the inherent latency caused by instance/VM
> start. In the current implementation of guestfs plugin, the appliance is
> started before each task and stopped afterwards.

Oh I see, yes that's right.

> My intent is to find a way to run multiple ansible tasks on the same
> libguestfs instance. That saves up to 2-4 seconds per task.

You shouldn't really reuse the same appliance across trust boundaries
(eg. if processing two disks which are owned by different tenants of
your cloud), since it means one tenant would be able to interfere with
or extract secrets from the other tenant. The 2-4 seconds is the
price you pay here I'm afraid :-/

If all disks you are processing are owned by the same tenant then
there's no worry about security.

Right, I'm only trying to optimize access to the same disk by a set of
consecutive ansible tasks in the same playbook (and typically belonging
to a VM owned by a specific user) to ensure the trust boundaries.

Thanks,
Roman

Hi all,

I'm working on a guestfs [1] connection plugin and seeking for a design
advice.

libguestfs provides a set of command line tools that can be used to
operate on virtual machine disk images and modify its contents.

For every task, the connection plugin:

  1. Starts guestfish in --remote mode on a remote host over ssh and adds
     a disk (passed as a parameter to guestfs connection).

  2. Runs supermin applicance [2][3]. It typically takes two to four
     seconds to spin up the applicance VM.

  3. Mounts root fs partition (partition number is passed as a parameter to the
     guestfs connection)

  4. Performs the task:

     Some implementation details:
     - put_file/fetch_file is implemented using copy-in/copy-out [4][5]
       guestfish commands

     - there's intermediate copy to/from remote host over ssh (to enable
       remote guestfs operation)

     - exec_command is implemented using "command" [6] guestfish command

  5. Stops supermin appliance/a guestfish instance

Here's an example how it looks in a playbook:

    - name: Add disk image to inventory
      add_host:
        name: "{{ vm_disk_path }}"
        ansible_host: "{{ ansible_host }}"
        ansible_connection: guestfs
        ansible_guestfs_disk_path: "{{ vm_disk_path }}"
        ansible_guestfs_root_partnum: "{{ root_partnum }}"
      changed_when: false

    - name: Test guestfs
      ping:
      delegate_to: "{{ vm_disk_path }}"

The ping command is performed using the execution environment from
within the disk image on remote host:

  TASK [Add disk image to inventory] ******************************************
  ok: [remote-hypervisor]

  TASK [Test guestfs] *********************************************************
  ok: [remote-hypervisor -> /home/user/test.qcow2]

Likewise, a role can be delegated to the guestfs disk image.

The problem is that _connect() spins up supermin VM on every task and
stops afterwards. So, it takes at least two seconds only to perform
_connect(). Obviously it's very slow for plays with a lot of tasks and
roles.

The question is how it can be optimized to avoid costly _connect caused
by appliance start?

I think of the following approaches:

1. Introduce a separate module that starts up or stops guestfs appliance
   and remove the action from the connection plugin

   Pros: similar to ldx, docker, virt connections that have separate
         tasks for start/stop of the conntainers/VMs
   Cons: extra tasks need to be added for every play to start/stop guestfish

Brian Coca from Ansible team told on IRC that this is the way to go.

I'm all set,
Thanks!