AIX execution time disastrous, too much system calls - EBADF

Hello,

We use Ansible to manage our +/- 450 AIX lpars & wpars + some linux and some Solaris server.

For the big picture :

The ansible server is an AIX lpar with 5 CPU POWER9 in SMT8 mode ( 40 threads ), 128 Gb of ram. we use ssh multiplexing, SAN disks, etc.

All is well configured and any other program that we run for admin on this LPAR is lightspeed execution. The ansible setup is done by a collogue of mine.

Some day ago I have to run a role that I’ve write to simply copy a a file on multiple LPARs and… it was slow as hell!

Here the result of time …
real 57m59.909s
user 21m35.657s
sys 40m17.558s

The system time is, for me, way too high.

One more information: we use vaults.

I’ve made some truss and I see that the process do an inimaginable number of errors “close” with the following code: EBADF

22610386: 88670545: close(9) Err#9 EBADF

The number of of syscall is almsot 2Million / minute.

I’ve done a truss -fc for the ansible command and the profile is the following :
( I’ve stopped it, running for 3hours and not finished )

signals ------------
SIGHUP 17
SIGTERM 5
SIGCHLD 55
total: 77

syscall seconds calls errors
kfork 86.39 10130
execve 74.58 9109 1
thread_create .00 613
kill .00 8 1
getuidx .06 126144
getgidx .03 82074
checkpnt_block .00 38382
_exit .00 9859
kwaitpid .01 82043 54
getprocs64 .00 2
getprocs .01 8313
appgetrlimit .00 2
getrpid .00 8315
_ksetjmp .00 60
sigaction .08 215799
sigprocmask .06 164428
thread_waitlock
.00 2090
thread_unlock
.03 627360
thread_twakeup .00 9352
thread_tsleep .05 14477
kthread_ctl .00 602
thread_setstate .00 613
thread_setmymask_fast .03 83114
thread_setmystate_fas .00 2113
thread_setmystate .01 80825
thread_init .00 8315
times .00 3
_nsleep .00 10 2
ksetcontext_sigreturn .00 55
getgroups .00 15
getkerninfo .00 1
getpeereid .00 16
ngetpeername .00 48
ngetsockname .00 70
getsockopt .14 233929 5
setsockopt .01 16782
shutdown .00 129
_enrecvmsg .00 52 4
_erecv .01 17694 2807
_enrecvfrom .00 30
_ensendmsg .03 24852
_esend .01 12216
accept1 .00 16
listen .00 19
bind .00 19
socket .02 16869
connext .05 16850 64
gethostname .00 8314
unicount .00 1
close 214.11597170646 595396092
kwrite .02 36669
kread 7.40 669885 29591
klseek .05 219649
lseek .57 1965085 56688
kloadquery .00 2
_select .14 508357
_poll 1.20 39916 20
sysconfig .00 1
sys_parm .00 9045
chmod .00 1
finfo .00 1
getdirent64 .14 335910
getdirent .03 151048
kioctl .10 323190 258064
link .00 20
mkdir .00 3 1
mknodat .00 3
mntctl .06 134364
kopen .48 474148 27041
rename .00 3
fstatx .13 497636
statx .57 991522 79975
readlink .00 6 1
umask .00 9086
uname .00 9048
unlink .00 42 1
_sem_wait .00 3654 3
_sem_open .00 641
sem_unlink .00 641
sem_post .00 3650
sem_getvalue .00 3650
read_sysconfig 1.20 9108
__libc_sbrk .03 44142
sbrk .01 26586
sigaltstack .00 1
appulimit .00 8315
_setsid .00 5
_getpgrp .00 2
_getppid .00 733
_thread_self .08 252144
_getpid .06 208113
pipe .01 35446
mprotect .00 3529
munmap .09 15101
mmap .04 18081
vmgetinfo .01 17423
corral_getcid .00 1
smcr_procattr .00 8315
access .00 8439 4
kfcntl .24 961488 10
kusla .00 2263
kload .02 2888 629
accessx .09 155777 63642
__statxacl .00 1
getpriv .00 9045
_kgrantpt .00 1346
loadquery .02 49878
__loadx .88 1015019


sys totals: .00608342843 595914700
usr time: .00
elapsed: .00

You can see the number of close error. I don’t have word to describe that.
If I run ansible with only one host, the close is not present.

I’ve search on the net for any kind of information about that but I don’t find any.

Here the ansible --vesion

/opt/ansible$ ansible --version
ansible 2.9.14
config file = /opt/ansible/ansible.cfg
configured module search path = [‘/home/XXXX/.ansible/plugins/modules’, ‘/usr/share/ansible/plugins/modules’]
ansible python module location = /opt/freeware/lib/python3.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 3.7.11 (default, Sep 6 2021, 10:46:05) [GCC 8.3.0]

Do some one have an idea about that ?

Thank you!

EBADF is “Bad file descriptor”, which on a close() is supposed to mean it’s not open.
On the truss output, prior to the ‘close(9)’, do you see an open() that returns the 9 file descriptor ?