So this week, among other things I've been working on ansible playbooks
and how to structure a repository for a bunch of systems. For our
purposes let's say a bunch is not a massive number but a middle-range
number - 100-150ish servers of various kinds on multiple distros - but
nothing dramatically different.
I've also been working on a couple of scripts using the ansible api for
letting me more easily have a simple programmatic interface to ssh.
I talked to some collaborators and coworkers about ansible and fleshed
out some areas where it needs improvement or changes. I wanted to
articulate those thoughts.
So let's start with the positive items:
1. the api and the async modules work pretty well for letting me
communicate to a bunch of boxes (or a bunch of boxes about different
things). I think I can use this effectively for a number of projects.
Not the least of them being post kickstart provisioning and to sanely
script a large-ish number of processes where doing them via puppet is
not convenient or just not possible and doing them manually is
error-prone.
The api is a great addition on top of paramiko it is what I think
people want paramiko to be and I like that a lot. Even if you just used
the simplest modules (command) and added async to it - it gives you A
LOT of power in a sensible mechanism for communication.
2. the modules - most of them work well - some need some love but b/c
of how they run fixing and testing them is trivially easy. That's good.
In comparison to some other tools the ansible modules have a way to go
- and that's tricky. On the one hand I think moving that distance is
not hard - otoh I think it would be worthwhile to discuss
a common module basis with other python-based projects so we could
stop duplicating code. I've looked at a bunch of the salt modules and
the func modules and some of how bcfg2 runs things and I do think all
of these tools could all gain ground with a common way of editing
fstab or common set of service/chkconfig callers, a common yum/apt
module, etc, etc.
3. the inventory - the host inventory is a nice and straightforward
mechanism. I like that. I also like the idea of combining my existing
inventory system easily. I think there is room for growth there.
Things I'm concerned about:
1. playbooks/dsl - the playbooks and the yaml-ishness are tricky.
There are still a fair number of ways to make the playbook/yaml parser
traceback and sometimes figuring out what the syntax issues are can be
tricky. Also I worked through a number of "does this variable exist
here or not" issues this week (and some of them have been fixed,
rapidly, so I acknowledge it is improving) but I'm concerned about how
far it needs to go and if there are growing pains that might be
onerous. The issue in general is that I think there is some resistance
to calling this a language - but, ultimately, it is and that resistance
is making things harder to admit. This could be one reason to keep the
playbooks around but to write a module to use some other projects' DSL
on the system or it could be a reason to bite the bullet and have it be
a language. I think everyone is concerned about the long term
implications of writing and maintaining all the modules AND a language.
Coming from func I know how problematic maintaining the modules can
become.
2. performance scale. One of the reason I like ansible is the
push-mechanism being ssh. Now - for "I just kickstarted 5 systems - go
provision them with this playbook" is more or less fine. Running that
same playbook on all my systems - even forked - is gonna take a while,
though. I did a preliminary test and a bunch of the modules and
multiple-item execution is going to need a fair bit of work to keep the
ssh connection time from eating us alive. To be clear - I don't think
ansible was intended for the 1000+ machine scale - but if I'm going to
be learning a dsl I'd rather learn the same dsl that I can use for
c&c/post-provisioning mechanisms AND for my
maintenance-mode-run-every-30-minutes mgmt tooling. Right now if I have
150 systems and I fork off 25 forks on my master system to run the
playbooks. Then it will take 6 cycles of 25 to cover them all. So for
the partial playbook I've implemented for one system provisioning took
47s - and I suspect I wrote about 1/5th of what we normally do in
puppet to provision the system - I'm guessing about 5 minutes per
server. So if we figure 5 minutes per server and 6 cycles to cover all
150 - we're talking a half hour to get all of it done. That's a concern.
Things I'm intrigued by (as answers to what I'm concerned with :)):
Ansible Pull:
So the ansible pull mode that Stephen Fromm has worked on seems
great - but it needs the dsl and the modules to be improved to really
be more useful. I am concerned about making my whole git repo available
to each node, though. I'm more inclined to want to say "take this
playbook, collect all the files, modules and templates that it will
need to run for each node and put that into a tarball or a git repo or
whatever PER NODE and shove just that at each node" - my reasoning is
simple:
- my git repo(s) will eventually contain certs/keys/passwords -
all sorts of random stuff and I definitely do not want that, for all
my systems on every system.
- I also do not want the node to be able to get to anything other than
explicitly what they have.
I think that set of changes should actually be quite simple to achieve.
If that's the case then many of my playbook performance concerns just
go away.
Ansible API:
As I said before I like the power of the api. If the python api is
stabilizing - and I can see a handful of other items people may want
but it looks stablish to me. Then I think I would want to start
writing other non-playbook-based programs with it. I have the start of
a couple already:
http://fedorapeople.org/gitweb?p=skvidal/public_git/scripts.git;a=tree;f=ansible;hb=HEAD
and I have 2 or 3 more that I think will be ports of things I wrote
for func.
Is there any reason to think it's not stablizing at this point?
So those are a collection of my (and other folks) thoughts after
spending most of a week trying to figure out how to make this all
function together and what portions of the infrastructure I help
maintain can benefit from this.
I'm fairly hopeful about the possibilities here. Hope that comes across
in my mail.
thanks,
-sv