unarchive always unzips

Is there any reasoning behind the fact that unarchive module using zip never checks the destination if the files already exists. This is quite inconvenient when extracting very large files.

I understand that tar has --diff argument and it makes it easy to use, but It could be fairly trivial to implement this with python zipfile. It is already in in use in unarchive.py:

https://github.com/ansible/ansible-modules-core/blob/devel/files/unarchive.py

We could use ZipFile.infolist() to get filenames and file sizes and check them against destination. There’s already a method which decides that zip files are always not unarchived:

def is_unarchived(self, mode, owner, group): return dict(unarchived=False)

I could definitely contribute if this sounds plausible.

  • t-m

Try adding a creates: clause on the unarchive task, that might do what you want.

Thanks for the answer.

I tried that but this just doesn’t overwrite the existing files. It takes the same amount of time than just overwriting.

It’s not that usable when extracting multiple GB’s of binaries.

Tar however skips the whole thing with --diff arg

I believe it is just that no one has implemented that yet however I
vaguely recall looking into it once and finding that it would be
harder to implement than I had originally thought. If you take a look
at what tar --compare checks:
https://www.gnu.org/software/tar/manual/tar.html#SEC66 it seems to be
file size, mode, owner, modification date and contents. IIRC, I think
checking some of those with zip wasn't as easy as looking into data
structures that were available from the ZipFile API (some may have
been extensions to the zip standard which would need a fallback and
others may not have existed at all... I can't recall). It would
certainly be nice if we could do this but if zip files don't have
enough information inside to do that it might be easier to figure out
why creates= is too slow for you and fixing that.

-Toshio

Ah, checking the creates clause _after_ extracting the file seems like a bug.

That said, mine tend to be around the 50Mb mark - but are you unarchiving
from a local file on the node, or directly off the network / control machine?

Even with files that small, up/downloading each time was pretty slow.
Instead we do
something like

- name: download kafka {{ kafka_version }}
  get_url: url={{ kafka_tarball_url }}
           dest=/root/{{ kafka_version }}.tgz

- name: extract tarball to {{ kafka_dir }}
  unarchive: src=/root/{{ kafka_version }}.tgz dest=/opt/
             copy=no creates={{ kafka_dir }}

I believe it’s because zip doesn’t support it like tar does. So I suppose there is no fix for that other than manually comparing the contents with python ZipFile api. Unfortunately zip lacks most of the unixy features which tar provides like user/group modes, modification date and so on.

So the final question is do we choose convenience over control? Just check file sizes from headers and don’t extract if matches. Is this reasonable or unacceptable?

Just those two would be too big a change. We’d need to support at least contents as well. If zip supports permissions as an extension we probably should support that when present as well.

-Toshio

I must've misunderstood how 'creates=' is implemented, sorry.

it makes no sense to me that the type of archive would make any difference;
surely if a 'guard' is there (that's just checking if a file path
exists) that should
prevent the task from running regardless of what the task is?

Can anyone with a better understanding explain this?

I implemented exactly what youneeded.

Could you test the following pull-request and provide feedback ?

     https://github.com/ansible/ansible-modules-core/pull/3307

Next I would like to make unarchive idempotent completely, using native zipfile and tarfile modules. A lot of the code to make zip support idempotent can simply be reused, however it also means we have to implement whatever functionality tar has ourselves (with the added benefit, that this functionality will work identically for all other archive formats).

So first things first, improve the zip support as-is.

More information in the PR description linked above.