Horovod/openmp error on GPU nodes
Biswajit Paria
bparia at cs.cmu.edu
Fri Sep 28 18:21:10 EDT 2018
Hi,
I am facing the following error when trying to run horovod (which uses
openmp) with tensorflow on the gpu nodes. What is interesting is that the
error is not permanent. My code runs fine for sometime, and then the errors
start appearing, after which I have to shift to a new GPU node. I suspect
this is again related to the NFS and permissions like the previous GPU
issue.
Please let me know if you have a solution to this. Thanks.
--------------------------------------------------------------------------------------------------------------
/zfsauton/home/bparia/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36:
FutureWarning: Conversion of the second argument of issubdtype from `float`
to `np.floating` is deprecated. In future, it will be treated as
`np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
[gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file
gds_dstore.c at line 1178
[gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file
gds_dstore.c at line 1313
[gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file
gds_dstore.c at line 2331
[gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file
gds_dstore.c at line 3148
[gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file
gds_dstore.c at line 3180
[gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file
server/pmix_server.c at line 2151
[gpu6.int.autonlab.org:17406] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 228
[gpu6.int.autonlab.org:17406] OPAL ERROR: Error in file pmix2x_client.c at
line 109
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[gpu6.int.autonlab.org:17406] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages, and
not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[30728,1],0]
Exit code: 1
----------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
Biswajit Paria
PhD student
MLD CMU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180928/3c00b61a/attachment.html>
More information about the Autonlab-users
mailing list