Ansible playbook crashes after archiving with tar gz

Hello!
So my problem is, I need to make a compressed archive of quite a lot of files (about 14Go for one machine, could go up to like 30 maybe…)

I tried to use the archive Ansible module, the command module and the shell module, and then a script that I copy onto the target node and execute in a shell.

The problem is, when I do it manually directly with the target machine shell, it works fine (though it takes 30min to complete with tar -czvf). Each time I tried archiving within the playbook, regardless of the method it crashes.
Here’s my code:

Script in role-archive/file/save_dir.sh

#!/bin/bash
dir="$1"
tar --ignore-failed-read -zcvf "$dir"/tools.tar.gz /app/tools

Role in role-archive/tasks/main.yml

- name: Copy archive script on targeted node
  become: true
  become_method: sudo
  copy:
    src: "save_dir.sh"
    dest: "/scripts/"
    mode: '0755'
    owner: root
    group: root

- name: Execute script
  become: true
  become_method: sudo
  shell: "./save_dir.sh {{ tools_dir }}"
  args:
    chdir: "/scripts"
  async: 3600
  poll: 60

Ansible version: 2.9.10
Python version: 2.7.5
OS: Linux (obviously)

Also maybe tar+gzip isn’t the quickest method to compress, but it’s not my main concern, I just want my playbook not to crash, finish the archive process properly and continue.

Let me know if you have any advices, tips, commentaries or if you need further details to help me.
Thank you in advance :slight_smile:

Hi axellere669. What error message are you getting? You can run Ansible with -vvv to show debug info.

Hi @bvitnik, thank you for the quick answer.
I’ve tried to get the trace before but because of the verbose option of tar and the huge number of files, the error was very long and fast to appear, so I couldn’t go back to the Ansible error message.
Here’s the trace (sorry for not including it):

[list of files to be archived], "cmd": "./save_dir.sh /archive", file changed as we read it\\ntar: /app/tools/log/x/: file changed as we read it", "rc": 1, "invocation": {"module_args": {"creates": null, "executable": null, "_uses_shell": true, "_raw_params": "./save_dir.sh /archive" urn code"}\r\n', 'Shared connection to x1d1 closed.\r\n')

It’s probably because the logs are constantly updated, but I’m not sure that’s actually the case. Regardless, these logs are very needed. I’d like an archive format/mode that could “capture” these logs at the very moment the archive is being made and avoid returning an error, especially because:

  1. I’ve already mentioned --ignore-failed-read and it doesn’t seem to be taken into account;
  2. Once again, when I create the archive with the exact same command manually, it works perfectly fine though it takes a lot of time. :confused:

I thought maybe I should switch again to Ansible archive module? And try a different archive mode than tar with gzip. Might cause less Ansible-related errors.
Sorry for the rant and thank you again!

Edit: at the end, it seems that the archive is done, and it’s the exact same size than when I compress manually. The main problem is that it still crashes and refuses to continue with the next task. Maybe add ignore_erros: true at the task level? But it must fail for a reason…

You can suppress the warning with

--warning=no-file-changed

but I doubt that changes the return code.

You could extend your save_dir.sh script to count the files in the new tar.gz file and exit 0 if there seem to be enough, and exit 1 otherwise. That seems a bit shady, but it makes the “Execute script” task more immune to a condition you know will almost always be true.

1 Like

From “man tar”

RETURN VALUE
       Tar exit code indicates whether it was able to successfully perform the requested operation, and if not, what kind of error occurred.

       0      Successful termination.

       1      Some  files  differ.   If  tar  was invoked with the --compare (--diff, -d) command line option, this means that some files in the archive differ from their disk
              counterparts.  If tar was given one of the --create, --append or --update options, this exit code means that some files were changed while being archived and  so
              the resulting archive does not contain the exact copy of the file set.

       2      Fatal error.  This means that some fatal, unrecoverable error occurred.

Therefore I guess you can do something like

- name: Execute script
  become: true
  become_method: sudo
  shell: "./save_dir.sh {{ tools_dir }}"
  args:
    chdir: "/scripts"
  async: 3600
  poll: 60
  register: save_dir
  failed_when: save_dir.rc == 2

Hope it makes sense

4 Likes

@axellere669 as already mentioned in previous posts, you can suppress or ignore the error in multiple ways, of which I would recommend @matteoclc approach. You can also test (and ignore) a specific error message in tar output with something like:

failed_when: "file changed as we read it" not in save_dir.stderr

On the other side, I would recommend you to think a bit about what is going on in the background. You are trying to archive a moving target which conceptually cannot yield a consistent result. Let’s assume only log files are problematic and only they are being changed while you are trying to create an archive. In that case, you have two choices:

  • Stop the app/service that is creating the logs before you run tar and start it afterwards. This assumes that stopping the app/service is acceptable.
  • Rotate your logs and let the tar ignore the live logs by using --exclude. You can either rotate the logs in regular intervals (i.e. daily) or rotate them explicitly just before you run tar. Yes, your archive will be missing some fresh log messages but some messages will always be missing no matter what you do because archiving is not an atomic operation.

This way you do not have to ignore the error. Ignoring errors is generally not a good practice if you can work around your problem in other ways.

4 Likes

Hi again everyone, and thank you for your propositions. I’ve thought about them, but after the last message, let me rephrase my concern.
When I execute my tar command within the shell of the target node, it works fine. The archive complete, no error message. Live logs are never a problem in this case and no error is thrown.
When I do it via Ansible, whether it is with the command or the shell module, whether I use the free form of the shell module or I call a script that does this exact same command again, it does not work. Regardless of the method, the same error is thrown.
I was pretty seduced by the idea of ignoring the error only if certain conditions are met, but it turns out that now that I’ve tried again my tar via Ansible (command module), my playbook crashed as always, and the size of the archive is less than what I’ve expected. So it’s not a good workaround (but thank you again everyone, it could have been but the outcome of the command is not stable enough).
If I could understand why Ansible throws an error and stops the playbook when the shell doesn’t, that would help me to go on. :slight_smile:

Is it possible that Ansible’s activity is contributing to the logs’ contents, and that’s why it changes during an Ansible run but not when you run it manually?

I’m concerned by this recurring terminology, “…it still crashes…”, “my playbook crashed as always”, etc. Please reassure me that we’re talking about a “normal” occurrence of a task entering a failed state rather than Ansible itself actually crashing. I mean, it’s understandable that you might feel it’s a “crash” when you don’t get the results you wanted, but is it not the case that Ansible is behaving correctly, just not as desired?

There’s one other part of the error message we have not discussed, and perhaps it is relevant: “Shared connection to x1d1 closed.” What do you suppose “x1d1” is?

1 Like

The bottom line is that this is not a deterministic problem and your experience where manual archiving works but Ansible does not could be just a coincidence (logs did not change in that moment). Also, when you manually run tar, you are probably still getting exit code 1 but you are choosing to ignore it and error/warning message is hidden somewhere in the lengthy terminal output. Ansible cannot be aware of all the different exit codes of various shell commands and their meaning or importance so any exit code except 0 is considered a failure status.

All in all, I don’t see any problem here. Everything is as it should be :slight_smile:.

Either that or there is something you are missing related to your system that we have not covered here.

1 Like