This is a little thread to discuss the issue that I opened on github.
This would enable people to specify how long the local cache version of a repo is valid for (default would be 0 meaning things are always pulled from the remote).
This would mean that if the version: was set to a branch or tag name (or maybe even a hash) we would only look at checking out the most recent version in the local repo, unless the cache time was invalid.
This would drastically speed up / give the options to speed up ansible playbooks that have a lot of git module / repo usage.
mpdehaan was having a little trouble understanding my first explantation and suggested I post / discuss here, so I have!
To explain a little further:
Example 1 - If the cache_valid_time of a git module task is set to 0 the task would pull / update every time the task was run
Example 2 - If the cache_valid_time of a git module task was set to 1d (1day) the repo would only ever be pulled / updated at most once a day.
In example 1 you are guaranteed to, for example, always have the newest commit from a branch when your version is set to a branch
In example 2 you are not always guaranteed to have the latest commit from a branch, but you are guaranteed that you branch will only ever be 1 day behind.
I am aware git doesn’t really have a concept of caching in this way but I see no issue with building this nice feature into ansible.
As written in my issue on github this give users the option to drastically speed up playbooks that have a lot of git module tasks (for example one playbook that I work on has over 100 git module tasks)
Currently the above mentioned playbook tags all of the git module tasks with the tag ‘slow’. The cron that then runs the playbook runs the whole playbook with --skip-tags=“slow” on the 10,20,30,40,50 mins of an hour and runs the whole playbook (including slow tags) at 0 every hour. My idea is a much nicer way of effectively doing the same!
Does anyone have any comments, further thoughts or different ideas?
So I don’t think the git module should do anything that isn’t git native.
In this case, the time to check to see if something is up to date in git should be pretty minimal.
An unrelated request was to just have it check and see if the SHA was the latest, and not pull – though I think that will still take some time.
I suspect this might only be an extra few seconds per repo, might this be a good time for a quick coffee break?
You could also emulate this yourself by saving a timestamp file on the system and using register, and then seeing if it was from today or this hour, or similar?
To me this is git native, especially when tracking branches.
We could rename cache_valid_time to only_update_git_every_x it it feels better.
So when tracking a branch in git I don’t to a git pull / update everytime I want to work on a branch, I will, for example do it daily, or even weekly.
This is similar to apt, when we say we want the latest, but we only want the latest according to our local copy of what is latest.
This would be the same in git, we only want to latest version of a branch according to our local copy.
The process of doing a git pull on hundreds of repos is a timely process, hence why when I am local working on hundreds of repos I do not git pull before working on them, I do a mass pull every few days.
In my opinion this is a great thing to include in the module, limit pulling!
I could probably do it locally within my playbook but I feel that this is a feature that anybody else doing large numbers of tasks using the git module would appreciate.
To me this is git native, especially when tracking branches.
What I mean is, can show me the pointers in the git module where this is a
feature? It's not really.
Right now this would still be workable via logic in your playbook, using
shell commands / register, and so on.
We could rename cache_valid_time to only_update_git_every_x it it feels
better.
So when tracking a branch in git I don't to a git pull / update everytime
I want to work on a branch, I will, for example do it daily, or even weekly.
Ansible only needs to issue changes when there are changes upstream, but it
still needs to check.