MPI4PY BUG

MPI4PY

mpi4py is a python wrapper of mpi.

I was testing the MPI4PY unit tests in mpi4py/test, and found there might be a bug here.

What I tested is test_spawn.py. I have 3 test cases as following:

  1. It works fine when python test_spawn.py
  2. It still works fine when mpirun --oversubscribe -np 2 -H host1,host2 python test_spawn.py or make it running in background, like mpirun --oversubscribe -np 2 -H host1,host2 python test_spawn.py &. I set both host1 and host2 with the same environment. (Although there might be some tmp file mismatch, it still could pass a few tests)
  3. Here comes the bug (I think it is): I use two scripts, namely script.sh and run.sh, respectively.

script.sh is like

mpirun --oversubscribe -np 2 -H host1,host2 python test_spawn.py

run.sh is just:

sh script.sh &

In this case, there will be a timeout exception:

[@nmyjs_104_22 test]$ [nmyjs_104_37:34159] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 193
E[nmyjs_104_22:23226] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 193
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
[warn] Epoll ADD(4) on fd 35 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 51 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 48 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
[warn] Epoll ADD(4) on fd 30 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor

I’m thinking that this is a bug, because the only difference between case2 and case3 is that case3 invoke mpirun ... in background from a bash script.

This might be a bug that mpi4py team should fix. This might also be a bug that openmpi team should fix.

Published: April 04 2017

  • tags:
blog comments powered by Disqus