I've come up with what I think is a safe way to rolling restart an Elasticsearch cluster using Ansible handlers.
Why is this needed?
Even if you use a serial setting to limit the number of nodes processed at one time, Ansible will restart elasticsearch nodes and continue processing as soon as the elasticsearch service restart reports itself complete. This pushes the cluster into a red state due to muliple data nodes being restarted at once, and can cause performance problems.
Solution:
Perform a rolling restart of each elasticsearch node and wait for the cluster to stabilize before continuing processing nodes. This set of chained handlers restarts each node while keeping the cluster from thrashing on reallocating shards during the process.
Brian Coca said the following on 9/17/2014 10:22 PM:
check the wait_for module, it was created for these situations.
How do you make wait_for wait on an URL? I know you can have it wait on
a socket to come up and look for data on the port connection, but I need
more than that to ensure the cluster is in the correct state before
proceeding.
I’m operating a Solr Cluster of 50 servers and I have been confronted to the same question than you. The warmup of the Solr index can take up to 30 seconds and wait_for on the port was not enough.
At first, I wrote a new module to launch a query every X seconds and expect a result (wait_for_result command=‘wget -O - “http://{{ ansible_fqdn }}:{{ active_country_port[country].port }}/searchsolrnode{{ country }}/{{ country}}/select?q=:&rows=0”’ result=“numFound” delay=10 timeout=600) until Ansible came up with it’s do…until loop which is perfect in this case :
I had that configuration initially after I realized wait_for would not do what I needed. Then I realized setting a timeout on the curl call (-m 2) gets me effectively the same thing. If the node isn't back up yet, the curl call times out and the grep fails. No need to run two tasks.
I’ve written two rolling Elasticsearch playbooks: one for restarting the cluster, the other for updating the cluster. I use the feature of Ansible that will intelligently take JSON output in a register variable and make it addressable (not sure if I’m using the right terminology). Because Elasticsearch returns JSON when you query the API, you can use the registered output of the uri module and a do-until loop to wait for a certain condition before continuing. So instead of using result.stdout or result.content with a search, you can use result.json.[field] to be very precise.
`
`
`
name: Wait for cluster health to return to yellow
uri: url=http://localhost:{{ es_http_port }}/_cluster/health method=GET
register: response
until: “response.json.status == ‘yellow’”
retries: 5
delay: 30
`