Very Inefficient Remote Copy

The copy module seams to always remote copy the file, event if it is
not necessary. Example:

Part of my play:
  - name: copy installation archive file
    action: copy src=files/foo.tar.gz dest=/var/cache/ansible/
foo.tar.gz owner=root group=root mode=0644

Output (first run):
  TASK: [copy installation archive file] *********
  ok: [myhost] => copy src=/var/tmp/ansible.4Po0nW/foo.tar.gz dest=/
var/cache/ansible/foo.tar.gz

Output (second run - the dest file already exists before):
  ok: [myhost] => copy src=/var/tmp/ansible.6xPjhj/foo.tar.gz dest=/
var/cache/ansible/foo.tar.gz

Because the file is quite big it takes very long time!

Similar behavior with "only_if: False": it takes a long time (probably
copying the file over the network) and then(!!!) it skips with the
following output:
  TASK: [copy installation archive file] *********
  skipping: [myhost]
  err: [myhost] => NoneNone

Is this a bug of do I have to use it in a different way?

Frank

I've noticed this as well (although I usually copy moderately small
files) and it slows down re-running of plays rather dramatically.

Also, as an addendum (could be another email thread) is recursive
directory copy something that's in the works, or just not in the docs
yet?

Thanks!

Q1: Having it take a remote md5sum before deciding to do a remote copy if the file size is above a certain threshold is reasonable.
(It’s already smart enough to not replace the file on the remote end, but that’s beside the point)

Q2: Not in the works that I know of, would be nice

Patches accepted for both.

I believe you’ve previously thrown some cold water on the idea of incorporating rsync into ansible’s core functionality, and I don’t doubt that there are good reasons for that. It sure is a sweet tool when it comes to stuff like this, though. Given ansible’s push orientation, it would possibly be helpful for scaling purposes if it could incorporate some of rsync’s efficiency cleverness. I’m curious what the pluses and minuses of incorporating rsync into the file-transfer components of ansible would be from your perspective.

Note that I’m not questioning your choice not to include it. I’m just curious what the tradeoffs were/are.

Note that I’m not questioning your choice not to include it. I’m just curious what the tradeoffs were/are.

Because it wouldn’t be using paramiko, and thus sudo and various other options would work in completely different ways.

I expect people with a large number of files to use NFS, use packages, etc.

Could you give a bit more information by what you mean when you say "Use NFS" ?

Are you thinking that people would mount up a share, then run a local
copy command? Or some other method?

I (perhaps incorrectly) had assumed people would not be moving “large” files through Ansible, but I also maybe incorrectly come from the school where war files are ideally delivered via yum or git or NFS or … something else
rather than something that resembles rsync. That all being said, optimizing the ansible copy routines to be better with large files is still fine.

In the case of NFS, I’d generally expect files to be mounted in the right place, and the config files path to them, so it would be unnecessary to copy the data – the whole point of it being mounted on network storage
and not using local disk.

I think he is saying the copy would happen once, and then be reflected on all servers mounting the share.

I think I’m saying Ansible isn’t involved – setup NFS first, but Ansible may configure the mount point.

I think this has been an assumption that every configuration
management system makes and then it runs quickly into the standard
sysadmin item... yes I could go get the hammer but if I turn over this
screwdriver it will work just as well. I have run into this with
cfengine2 and early versions of puppet where the answers from the
developer was usually the same "why would you want to do that?" and
the answer went anywhere from "I get one tool at a time on my
network.." to "well my systems are all over the world.. shared storage
doesn't work well" to "well why did you give me a recurse option if
you didn't want me to recurse" (my favorite one from the cfengine1 an
cfengine2 days :)).

LOL… yeah. Hammer analogy is well put.

Puppet’s early fileserver encased transmitted files in XML, so I think we’ve done at least better than that. For multiple geo’s, I’d probably propose something like NetApp snap mirror, etc, if it
gets really crazy :slight_smile:

Early server preview shows we are working with different levels of systems being managed – some people have 10, others hundreds.

The way you do things if you have 10 and one location is going to be different than if you have 10 locations or even 100 in one.

I think it’s usually best to talk in terms of use cases, if we can better understand what you are trying to do, the best solution usually ends up showing itself. (And it won’t always be Ansible in terms of pushing content, but Ansible can have a role to play).

–Michael

I (perhaps incorrectly) had assumed people would not be moving "large" files
through Ansible, but I also maybe incorrectly come from the school where war

I think this has been an assumption that every configuration
management system makes and then it runs quickly into the standard
sysadmin item... yes I could go get the hammer but if I turn over this
screwdriver it will work just as well. I have run into this with
cfengine2 and early versions of puppet where the answers from the
developer was usually the same "why would you want to do that?" and
the answer went anywhere from "I get one tool at a time on my
network.." to "well my systems are all over the world.. shared storage
doesn't work well" to "well why did you give me a recurse option if
you didn't want me to recurse" (my favorite one from the cfengine1 an
cfengine2 days :)).

LOL… yeah. Hammer analogy is well put.

Puppet's early fileserver encased transmitted files in XML, so I think we've
done at least better than that. For multiple geo's, I'd probably propose
something like NetApp snap mirror, etc, if it
gets really crazy :slight_smile:

Early server preview shows we are working with different levels of systems
being managed -- some people have 10, others hundreds.

The way you do things if you have 10 and one location is going to be
different than if you have 10 locations or even 100 in one.

Well I was more of trying to explain why I see this question come up
over and over again in various remote configuration tools. This
conversation has identical twins from cfengines early days. Hard
experience from cfengine and early puppet has made it clear that if I
want to push files.. put them in an rpm and use yum (or deb and apt
or tgz or whatever).

My use case is if I need to mirror large sets of data, I will use
ansible like I would func.. call a helper script which rsync's it from
a tree. If there ever becomes a python-rsync module that uses paramiko
then maybe I would look at using it inside of ansible. Until then, I
will find my hammer and use the screwdriver as a chisel like god
intended :).

I was thinking about this some for the playbook case.

Would it be possible to run a command (like setup) but after the
playbooks tasks are parsed that's gone through and expanded out all the
files it will be impacting per host and generate their checksums? I
guess that could get expensive/heavy if we're recursing...

Anyway - I was just thinking - then we're doing the md5sums for a whole
range of files (copy or template for example) once and getting back one
output to parse, vs doing many connections.

-sv

If I'm forced to use NFS or an own package repo, then in my opinion
this violates the simple and secure infrastructure idea "Ansible just
uses SSH".

I haven’t forced anyone to do anything.

If people would like to upgrade the md5 handling, submit a patch.