get_url suppot for comparing remote file size against local file size

Hi Guys
I submitted a pull request today but wanted to provide some background on the use case. I ran into some situations where it was desirable to be able to update an artifact in a content repository (say artifactory) and then re-run ansible as is and have that updated artifact be pulled down. Currently, I didn’t see a way to do this with get_url since it will skip it if already sees the file on disk. So after talking about it with the team, we came up with an idea to detect these changes. The method is low cost since it doesn’t require any md5 or sha hashing of the local file on each run, and for the remote side it just makes a http header request for the file size. There is a small chance that this method could return a false positive in the case of different files being the same size, but its a trade-off for speed. Looking forward to your feedback!

https://github.com/ansible/ansible/pull/5538

Thanks,

William

Was the time of doing something SHA related actually being a problem, or more a problem of needing to calculate the SHA?

The latter, if we have to SHA/MD5 a 500M+ file every time we run ansbile the thought was that would be too slow.

Seems you would want to run it once and then put that value in the playbook.

If the SHA/MD5 was of a build product, maybe it could be generated by Jenkins as an artifact?

Just playing Devil’s advocate somewhat – I suspect people may raise the “but… but… file sizes aren’t secure” complaint without seeing the SHA option. We could also of course just make sure this was very very very very obvious in the docs, but I’m also a little wary of including a feature for a specific use case, so if there’s a more mainstream way, we should perhaps see if it’s workable first?

Thoughts welcome.

Those are good points. I guess the challenge is how much time are you willing to spend when running playbooks to compute sha/md5’s of files on disk? If we are OK spending that time, then we could just as easily have the conditionals do that. I was looking at the file module to see what approach is used elsewhere and from what I can tell, ansible isn’t computing hashes, it is just asserting the file isn’t already there. So doing this in get_url seems to be a new thing for file related modules.

What about adding both options, file-length or md5/sha for checking files. Both are optional, and both offer different trade-offs of cost during playbook run time vs level of accuracy?

This is just me thinking out loud so please point out any oversight in my logic.

File sizes can be tricky, most systems report how much the file occupies, which is normally larger than the exact bit count of the file itself (be it fs blocks, packets or encoded base64/mime encapsulated/armored/etc).

I hit the wrong reply option previously. Sorry for the duplicate in private, Brian.

I see this pretty rarely. It’s normally the length of the file in bytes.

I think we’re overthinking this. I had originally suggested to William that comparing the timestamps and length would be a super inexpensive way to determine conclusively we needed to download the file without having to download the hash file and do a hash locally. Obviously, if all of those indicators agree, we’d need to do that anyway.

If the math is too hard, just hash it each time.

Brian,

Are you sure that’s true of os.stat() and not just the shell commands?

Seems like it would not be.

I think I’m ok with adding a bytes= parameter if we can get around this question.

I've been burned by this before, stat is supposed to return the size of the
file in bytes.

I haven't really checked in a long time as I've grown accustomed to not
rely on size for file comparisons, but my issues stemmed from some
tools/implementations using # of blocks * block_size to measure the file
size.

After more thought on this and further discussion, it seems that MD5 is the right way to do this. The use case that I am designing for however requires the repo manager aritfactory because it looks for the md5 sums that it generates for each artifact (other similar tools likely have the same feature). With that in mind, it might make more sense to create an artifactory module that then incorporates the md5 sum, but is essentially the same functionality as get_url. This also satisfies the concern of the increased cost to run get_url with md5 sum, because it can be made clear in the documentation that the artifactory module does this validation as opposed to straight get_url.

It also sets a framework for future artifactory specific features by having a new module for it.

What do you all think about this?