Is this a known issue?

Nolan_Darilek · February 14, 2013, 4:42am

I finally rebuilt all of my servers with Ansible with one as the central, pushing out updates.

Before unleashing cron to run things automatically, I tried running the script by hand. It hung on a setup run.

I ran with -vvv, and it seemed like it hung after running the script that setup uploaded. I considered running this by hand to see if something was breaking, and discovered a whole bunch of files in /root/.ansible/tmp.

Thinking these were probably left around from previous runs that I aborted, I just rm'd the .ansible directory completely. Everything worked after that.

Wondering if this is a known issue, and if it might make sense to clean up ~/.ansible on interrupt? I know that Ansible can't guarantee any external results in that case, but at the very least it should clean up after itself if artifacts it leaves behind might cause hangs. Maybe ctrl-c twice could abort the cleanup if that is desired for whatever reason, or maybe Ansible can clean this directory when launching so it starts from a clean slate?

As soon as I removed .ansible on one host, the hang appeared on another with the same solution. This is under 1.0.0.

Thanks.

Michael_DeHaan · February 14, 2013, 4:53am

I finally rebuilt all of my servers with Ansible with one as the central,
pushing out updates.

Before unleashing cron to run things automatically, I tried running the
script by hand. It hung on a setup run.

I ran with -vvv, and it seemed like it hung after running the script that
setup uploaded. I considered running this by hand to see if something was
breaking, and discovered a whole bunch of files in /root/.ansible/tmp.

Curious why this part hung actually?

Can you debug with the ./hacking/test-module script on the remote node perhaps?

Thinking these were probably left around from previous runs that I aborted,
I just rm'd the .ansible directory completely. Everything worked after that.

Wondering if this is a known issue, and if it might make sense to clean up
~/.ansible on interrupt? I know that Ansible can't guarantee any external
results in that case, but at the very least it should clean up after itself
if artifacts it leaves behind might cause hangs. Maybe ctrl-c twice could
abort the cleanup if that is desired for whatever reason, or maybe Ansible
can clean this directory when launching so it starts from a clean slate?

As soon as I removed .ansible on one host, the hang appeared on another with
the same solution. This is under 1.0.0.

Ansible will remove intermediate files on a failed execution with a ";
rm ..." whether the module fails
or not.

If you are control-C'ing them, though, that explains it, it would not
have gotten to the removal.

Not as critical, but I agree, leaving them around is suboptimal.

I am thinking it may be reasonable for the setup module to clean up
~/.ansible/* tmp
files in any directory if it can, that way they at least don't accumulate.

What do folks think about that?

We wouldn't cause any extra traffic or work for trying at that point.

kahlil.hodgson · February 14, 2013, 7:25am

What if two separate administrators where using ansible to modify different unrelated aspects of a server at the same time, say, one modifying the webserver configuration and the other the yum repos. That seems a legitimate usage pattern that this would break.

K

Robert_Verspuy · February 14, 2013, 7:58am

Could it be that you are trying to use fireball mode? When i was testing the first time with fireball mode, it also hung somewhere around the setup step (I think) without any information what it was waiting for, because the firewall was not opened yet for fireball mode. Regards, Robert

Michael_DeHaan · February 14, 2013, 12:26pm

Correct, yeah, we probably don't want to do that.

Perhaps just having a cleanup task/example would be sufficient.

Michael_DeHaan · February 14, 2013, 12:30pm

Another common thing can be you're not logging is as the user you
/think/ you are logging is as (specify -u if it's not the current user
account), or sudo requires a password and --ask-sudo-password wasn't
set.

Nolan_Darilek · February 14, 2013, 2:51pm

I am too, it did seem a bit odd.

At the time I just wanted to get this working, especially since I've been Ansible-izing everything for a while now and kind of want to get on with the business of letting automated things be automatic. If it happens again, I'll look into using the hacking module and debugging.

Nolan_Darilek · February 14, 2013, 2:56pm

Running fireball, but the firewall should be open. I have an explicit setup playbook that runs first everywhere, that opens up the firewall so fireball can connect.

Nolan_Darilek · February 14, 2013, 2:57pm

In this instance, I was logging in as root, running the script by hand, and the play in question specifies "user: root". Also, as soon as $HOME/.ansible was removed, everything seemed to work with no other changes.

Nolan_Darilek · February 14, 2013, 3:25pm

OK, a bit more information:

I'm experiencing this again. In syslog, I see that fireball can't continue because of an exception, but the exception isn't logged. I just killed a bunch of hung ansible-playbook tasks.

The message happens on unattended tasks, so the exception isn't me interrupting.

Maybe this exception should be logged as well? Or is there already a way to do that?

Michael_DeHaan · February 14, 2013, 3:32pm

Fireball is a bit new so it's possible we have some things to work out
still. Open a ticket please and include whatever info you get, and
how to reproduce it if you can.

Nolan_Darilek · February 16, 2013, 1:20am

I figured out what was happening.

I have a 00setup play that runs before all others, which launches fireball and performs other bootstrap steps that generally need to run everywhere. Unfortunately, this play hadn't completed on some hosts, so fireball didn't start, and some ansibles were hanging for hours until I killed them.

Is there a timeout on fireball connects?

It also wasn't logging the exception, which would have helped. It just noted that there was one. Not sure whether a timeout makes sense (seems like it should, but maybe there's some complexity I'm missing) but exception logging does. I opened a ticket.

Thanks.

Michael_DeHaan · February 16, 2013, 2:02am

You need to finish start the fireball play prior to the others, so like

ansible-playbook engage-fireball.yml do-fireball-stuff.yml

That would do it.

If you're seeing that problem and doing something like that, let me know.

I suspect what you have found is there is no timeout, but there really
needs to be one

Nolan_Darilek · February 16, 2013, 2:39am

Yeah, I just went through today and renamed my playbooks such that their filenames imply an order. So 00setup.yml does generic bootstrapping and starts fireball, and everything after uses fireball. Originally I'd had that play in _setup.yml, thinking the _ would come before other characters, but that was naive.

Now everything works more or less smoothly.

Michael_DeHaan · February 16, 2013, 3:35am

Cool. Well starting a new fireball by the same user kills the same
fireball, so you will be ok.

Also feel free to include the fireball stanza at the top of each
playbook if you want, and that way while it may re-init for each play,
you're guaranteed to have it set up.

Topic		Replies	Views
Ansible hanging Ansible Project	10	9	March 2, 2013
$HOME/.ansible/tmp not cleaned up Ansible Project	2	48	January 11, 2017
Odd Quirk (probably me being dumb) Ansible Project	1	2	October 23, 2014
Ansible hangs Ansible Project	1	18	April 23, 2016
Ansible playbook sporadically hangs when run from cron Ansible Project aws	2	4	May 24, 2016

Is this a known issue?

Related topics