freebsd buildworld/buildkernel crashes host OS ??

Hi,

I am trying to automate “jail” creation under FreeBSD which includes building the OS from sources which is rather lengthy.

I’m using “polling” while doing it:

  • name: build world
    shell: executable=/bin/sh chdir=/usr/src SRCCONF={{ build_conf }} make buildworld > /tmp/build.log 2>&1

async: 45

poll: 30

  • name: install world
    shell: executable=/bin/sh chdir=/usr/src SRCCONF={{ install_conf }} make installworld DESTDIR={{ disk_mount_point }} >> /tmp/build.log 2>&1
    poll: 30

  • name: make distribution
    shell: executable=/bin/sh chdir=/usr/src SRCCONF={{ install_conf }} make distribution DESTDIR={{ disk_mount_point }} >> /tmp/build.log 2>&1
    poll: 30

  • name: build kernel
    shell: executable=/bin/sh chdir=/usr/src SRCCONF={{ build_conf }} make buildkernel DESTDIR={{ disk_mount_point }} KERNCONFIG={{ kernel_confi
    g }} >> /tmp/build.log 2>&1
    poll: 30

  • name: install kernel
    shell: executable=/bin/sh chdir=/usr/src SRCCONF={{ install_conf }} make installkernel DESTDIR={{ disk_mount_point }} KERNCONFIG={{ kernel_config }} >> /tmp/build.log 2>&1
    poll: 30

however every time I hit “build kernel” my builder VM crashes. When doing the same task manually - I have no such issue. (whether using screen or without it).

Anybody seen something similar in the wild/experienced similar issues?

Any logs available?

Likely not an ansible issue, though I’m not sure how it would be different.

Any logs available?

nothing that would identify the culprit. I’m re-running the playbook at the moment to test some other aspects of it. if anything pops up - I’ll post here.

The thing is - ansible just “sits there” while VM has rebooted itself (naturally, since there’s a “poll interval” involved) so it’s not like ansible is crashing on controller end. I’m just wondering whether something is leaking memory on the host side. Build world produces a lot of output. However I did pipe it to a file to avoid such a thing. Not sure what else could be causing it.

Likely not an ansible issue, though I’m not sure how it would be different.

see above. I don’t have any hard evidence one way or another, however indirect evidence suggests that something about ansible is what affecting it. I ran exact same commands either via straight SSH session or “screen” - in both cases not a problem.

Question: I didn’t look at the code (yet) however due to the polling I’m assuming python script on the host side will be running sub-process with redirected outputs etc. could there be a memory leak due to a significant number of polls within that time (build time is about 1-2h on that box).

Am I using the right strategy for this? Since I can’t go async with that task - should I forgo “poll” completely? My only worry is that intermittent network issues might terminate task and “poll” may prevent that. Am I correct here?

"
Question: I didn’t look at the code (yet) however due to the polling I’m assuming python script on the host side will be running sub-process with redirected outputs etc. could there be a memory leak due to a significant number of polls within that time (build time is about 1-2h on that box). "

Shouldn’t be the case – poll options don’t really save previous results.

I should say though (this is unrelated), Ansible is not meant to be a build system. Using something like Jenkins to produce build products and then to have Ansible deploy artifacts from it is the norm. If a deploy process is taking 1-2 hours that sounds really strange to me, and I think there may be some better ways to optimize things. That being said, I haven’t seen a 2 hour build since an unoptimized Java compile back in ~2004, so it’s been a while, and you might have other reasons or I might not be understanding the use cases.

If you can dig more and find out what’s up, I’d be very interested in findings.

the worst thing has happened - my last re-run went through just fine. :wink:

I do not like “inconsistent” but have no idea what is happening and why. I did update playbook but not the section that was “crashing”. I’ll dig deeper next time problem surfaces. I’ll probably be re-running playbook within the next week so that should give it a chance to crash again :slight_smile:

Jenkins won’t help as far as I know (could be wrong) as I still need to fire off build as part of playbook and wait for it’s artifacts to be produced.

For posterity: I’ve tracked down the issue - it was FreeBSD’s UFS “Journaling” that was crashing things. Looks like ansible was able to thrash system well enough to expose issues with that FS feature. After disabling it - things seems to be running as expected.

Exciting!