Rolling Restart an Elasticsearch Cluster with Ansible

Lance_A_Brown · September 17, 2014, 7:40pm

I've come up with what I think is a safe way to rolling restart an Elasticsearch cluster using Ansible handlers.

Why is this needed?

Even if you use a serial setting to limit the number of nodes processed at one time, Ansible will restart elasticsearch nodes and continue processing as soon as the elasticsearch service restart reports itself complete. This pushes the cluster into a red state due to muliple data nodes being restarted at once, and can cause performance problems.

Solution:

Perform a rolling restart of each elasticsearch node and wait for the cluster to stabilize before continuing processing nodes. This set of chained handlers restarts each node while keeping the cluster from thrashing on reallocating shards during the process.

A gist of the handlers/main.yml file for my Elasticsearch role is at https://gist.github.com/labrown/5341ebec47bfba6dd7d4

I welcome any comments and/or suggestions.

--[Lance]

Brian_Coca1 · September 18, 2014, 2:22am

check the wait_for module, it was created for these situations.

Lance_A_Brown · September 18, 2014, 2:42am

Brian Coca said the following on 9/17/2014 10:22 PM:

check the wait_for module, it was created for these situations.

How do you make wait_for wait on an URL? I know you can have it wait on
a socket to come up and look for data on the port connection, but I need
more than that to ensure the cluster is in the correct state before
proceeding.

--[Lance]

Brian_Coca1 · September 18, 2014, 3:06am

well, in that case I would use wait_for to let the server come up and
then get_url inside a do-until loop to get the 'green light'.

Ludovic_Petetin · September 18, 2014, 12:47pm

Hi,

I’m operating a Solr Cluster of 50 servers and I have been confronted to the same question than you. The warmup of the Solr index can take up to 30 seconds and wait_for on the port was not enough.
At first, I wrote a new module to launch a query every X seconds and expect a result (wait_for_result command=‘wget -O - “http://{{ ansible_fqdn }}:{{ active_country_port[country].port }}/searchsolrnode{{ country }}/{{ country}}/select?q=:&rows=0”’ result=“numFound” delay=10 timeout=600) until Ansible came up with it’s do…until loop which is perfect in this case :

shell: wget -O - “http://{{ ansible_fqdn }}:{{ active_country_port[country].port }}/searchsolrnode{{ country }}/{{ country}}/select?q=:&rows=0”
register: result
until: result.stdout.find(“numFound”) != -1
retries: 20
delay: 5

Lance_A_Brown · September 18, 2014, 1:21pm

I had that configuration initially after I realized wait_for would not do what I needed. Then I realized setting a timeout on the curl call (-m 2) gets me effectively the same thing. If the node isn't back up yet, the curl call times out and the grep fails. No need to run two tasks.

--[Lance]

- name: wait node {{ansible_hostname}}-{{service}}
   local_action: "shell curl -s -m 2 'localhost:9200/_cat/nodes?h=name' | tr -d ' ' | grep -E '^{{ansible_hostname}}-{{service}}$' "
   register: result
   until: result.rc == 0
   retries: 200
   delay: 3
   notify:
   - cluster routing all {{ansible_hostname}}-{{service}}

James_Martin1 · September 18, 2014, 2:05pm

I think a more “ansiblish” approach would be:

name: wait node {{ansible_hostname}}-{{service}}
uri: url=http://localhost:9200/_cat/nodes?h=name
timeout=120
return_content=yes
register: result
until: (ansible_hostname + ‘-’ + service) in result.content
retries: 200
delay: 3

Lance_A_Brown · September 18, 2014, 2:17pm

Oh! I like that. Thanks!

--[Lance]

Sam_Doran · September 22, 2014, 4:23pm

I’ve written two rolling Elasticsearch playbooks: one for restarting the cluster, the other for updating the cluster. I use the feature of Ansible that will intelligently take JSON output in a register variable and make it addressable (not sure if I’m using the right terminology). Because Elasticsearch returns JSON when you query the API, you can use the registered output of the uri module and a do-until loop to wait for a certain condition before continuing. So instead of using result.stdout or result.content with a search, you can use result.json.[field] to be very precise.

`

name: Wait for cluster health to return to yellow
uri: url=http://localhost:{{ es_http_port }}/_cluster/health method=GET
register: response
until: “response.json.status == ‘yellow’”
retries: 5
delay: 30
`

`

The contents of response.json are:

`

ok: [node01.acme.com] => { "response": { "changed": false, "content_length": "227", "content_location": "http://localhost:9500/_cluster/health", "content_type": "application/json; charset=UTF-8", "invocation": { "module_args": "url=http://localhost:9500/_cluster/health method=GET", "module_name": "uri" }, "json": { "active_primary_shards": 405, "active_shards": 810, "cluster_name": "elasticsearch", "initializing_shards": 0, "number_of_data_nodes": 3, "number_of_nodes": 4, "relocating_shards": 0, "status": "green", "timed_out": false, "unassigned_shards": 0 }, "redirected": false, "status": 200 } }

`

I’m not sure of the value of using pre and post tasks with a serial of one in my playbooks. It could all most likely be one list of tasks.

I based the playbooks on the Elasticsearch documentation for updating 1.0 and later and rolling restart.

Topic		Replies	Views
Is there a way in Ansible to check a particular condition (e.g., the health status of Elasticsearch cluster) in a loop? Ansible Project	1	19	January 2, 2017
Recurring tasks Ansible Project	0	3	June 26, 2015
Elasticsearch restart task hangs up in Ansible Playbook Ansible Project	4	11	September 12, 2014
What is the best way of restarting services that are part of a cluster without taking the cluster down? Ansible Project	8	47	July 25, 2014
Retrying a command until a JSON response contains a specific key Ansible Project aws	0	7	August 26, 2016

Rolling Restart an Elasticsearch Cluster with Ansible

Related topics