Hadoop Example

I've put together some Ansible playbooks and templates that I've
successfully used to install and manage Hadoop (CDH4) on a cluster.
I'd really appreciate feedback, especially from anyone who wants to
try it.

https://github.com/jkleint/ansible-contrib

Thanks!

Nice. Very glad to see a complex example. Pretty easy to follow.

I think we have enough to start that contrib repo of examples (via git submodules, so everybody can host their own stuff) now.

Style wise, I think a line of whitespace between tasks and before/after the line that says “tasks:” is a good idea. I’m personally not a fan of YAML anchors so much, as I see
that playbooks happen to CURRENTLY be written in YAML, but eventually parsing and implementation are going to be separated out a bit.

I don’t understand how the anchor and the list and the array are mixed with the anchor here, in particular:


dfs_datanode_data_dir: &datadir 
    - /data1/hdfs
    - /data2/hdfs

That first one looks like a hash element, but then it's followed up by two list elements.

Organization wise, I think it’s clearer to not mix playbooks, templates, and included handlers all in the same directory.

This is how I would have it, if you assume a directory named “playbooks” exists one level up. It would be nice if everyone adopted a similar standard (or proposed a better one), such that
folks hopping between examples can easily grok each other’s content.

https://github.com/mpdehaan/ansible-examples

I would also think install_hadoop would be a pretty minimal playbook with there being an install-hadoop and an install-java in tasks, all included. This promotes applying a playbook
that says “be this role”, rather than having to remember to run 3 playbooks to get things in line. It also means install-java is more or less reusable by other things that need Java (maybe).

Since tasks are intended to be idempotent, it’s safe to apply “java” to things that already have Java, too.

See the monitoring.yml in my example about how the playbook is mostly just includes.

Nice. Very glad to see a complex example. Pretty easy to follow.

Thanks!

I think we have enough to start that contrib repo of examples (via git submodules, so everybody can host their own stuff) now.

I'm all for it; let me know how to get it going.

Style wise, I think a line of whitespace between tasks and before/after the line that says "tasks:" is a good idea. I'm personally not a fan of YAML anchors so much, as I see
that playbooks happen to CURRENTLY be written in YAML, but eventually parsing and implementation are going to be separated out a bit.

Yeah, the anchors are a hack to be able to use a list in both
with_items and templates.

I don't understand how the anchor and the list and the array are mixed with the anchor here, in particular:

dfs_datanode_data_dir: &datadir
- /data1/hdfs
- /data2/hdfs

That first one looks like a hash element, but then it's followed up by two list elements.

Yeah, YAML's a bit weird. dfs_datanode_data_dir is a list of two
items, and &datadir is a pointer to it. Why YAML doesn't just use
*dfs_datanode_data_dir to reference the variable directly, I dunno.

Organization wise, I think it's clearer to not mix playbooks, templates, and included handlers all in the same directory.

This is how I would have it, if you assume a directory named "playbooks" exists one level up. It would be nice if everyone adopted a similar standard (or proposed a better one), such that
folks hopping between examples can easily grok each other's content.

https://github.com/mpdehaan/ansible-examples

That's a good idea, I'll re-organize a bit.

I would also think install_hadoop would be a pretty minimal playbook with there being an install-hadoop and an install-java in tasks, all included. This promotes applying a playbook
that says "be this role", rather than having to remember to run 3 playbooks to get things in line. It also means install-java is more or less reusable by other things that need Java (maybe).

Totally agreed. I really wanted to make small chunks of reusable
functionality, so people could re-use playbooks unmodified and just
tweak a vars file. The problem is I have lists of directories that I
need to both create and stuff in a config file template. In order to
loop over the directories using with_items, they have to be in a real
YAML list declared in the same YAML file with an anchor that I can
reference. Support for list variables in with_items would be really
great here.

Since tasks are intended to be idempotent, it's safe to apply "java" to things that already have Java, too.

See the monitoring.yml in my example about how the playbook is mostly just includes.

My install-java is a bit "opinionated" about which Java to install, so
I didn't want to foist it on everybody. :wink: Plus even if an action is
a no-op because of idempotence, it still takes time to run, and my
playbook is already too slow as it is -- install-hadoop does double-
duty as "update-hadoop-config" (because of the aforementioned
limitations), and waiting for yum to run a dozen times with no effect
gets old fast.

Thanks for the feedback.

-John

I’ve updated my Hadoop example to use some of the new features of 0.6. You can find it here:

https://github.com/jkleint/ansible-contrib/tree/master/playbooks/jkleint-hadoop

I tried to follow the best practices. The improvements since the last go-round really have helped a lot.

Host groups and group variables are awesome, so you can just target playbooks at a “hadoop” group and people can define that however they want in their inventory without having to edit half a dozen playbooks. Group variables work great as host aliases, so you can define “namenode=mymaster” in your inventory and not have to edit playbooks.

Being able to use with_items: $list_from_vars_file is awesome: I can define a list once and use it in multiple playbooks and templates. Plus, not launching a separate task for each item is much faster.

Tags and wholesale tagging included tasks are awesome: I can say ansible-playbook install.yml --tags=pkgs to just update the RPMs, or ansible-playbook install.yml --tags=config to just update the config files, which saves a lot of time.

I’d appreciate feedback from anyone who’s interested. Thanks for such a great tool.

I’ve updated my Hadoop example to use some of the new features of 0.6. You can find it here:

https://github.com/jkleint/ansible-contrib/tree/master/playbooks/jkleint-hadoop

I tried to follow the best practices. The improvements since the last go-round really have helped a lot.

Host groups and group variables are awesome, so you can just target playbooks at a “hadoop” group and people can define that however they want in their inventory without having to edit half a dozen playbooks. Group variables work great as host aliases, so you can define “namenode=mymaster” in your inventory and not have to edit playbooks.

FYI – I think you no longer get group variables as $groups[varname], you get the list of hosts in the group.

If you want group variables exposed this way, it will need to be patched back in (might suggest group_vars to go with host_vars).

I had a need to get the list of systems in a group and I thought it returned a hash with each system’s variables in it.

Being able to use with_items: $list_from_vars_file is awesome: I can define a list once and use it in multiple playbooks and templates. Plus, not launching a separate task for each item is much faster.

Tags and wholesale tagging included tasks are awesome: I can say ansible-playbook install.yml --tags=pkgs to just update the RPMs, or ansible-playbook install.yml --tags=config to just update the config files, which saves a lot of time.

I’d appreciate feedback from anyone who’s interested. Thanks for such a great tool.

(Speaking for everyone who has worked on stuff, which includes you, of course) … You are welcome!

Offhand, if you want to you could put all those tasks in one playbook and tag each play, and then by default it would do everything, and then if given a tag it would do just those things.

That would perhaps make it a bit easier if you knew you were going the full 9-yards and not having to remember which steps to run in order.

Looking good, I’ve open sourced my playbook as well:

https://github.com/analytically/hadoop-ansible