<div dir="ltr">Hi Predrag,<div><br></div><div>Thanks a lot for clarification.</div><div><br></div><div>In my original question, I should have stated that I have already implemented my code via mpi4py to make it work distributedly. I should have asked the question in another way directly related to the cross-node InfiniBand communication (gladly you provided this info. in your response so I am clear right now). I have already checked a lot and tried to set up jobs running across nodes in different ways but failed. At the time I was sending you the email, I am 99% sure that the infrastructure doesn't support this but I just want to confirm with you.</div><div><br></div><div>I feel guilty that due to my vagueness, you end up with such a long feedback. Thank you for your detailed and informative explanation. Very appreciated.</div><div><br></div><div>Most sincerely,</div><div>Zhe</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 5, 2021 at 8:01 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Hi Zhe,<div><br></div><div>I hope you don't mind me replying to the mailing list as you are asking a question that might be of concern to others. </div><div><br></div><div>I hold a terminal degree in pure mathematics and work in math-physics when I have time. I am sure most of you know infinitely more about computing than I. Please take my answer with a grain of salt. There are two things in pre GPU era that people who were involved in HPC (high-performance computing) had to understand. One is OpenMPI and the second one is MPI computing. </div><div><br></div><div>OpenMP is a way to program on shared memory devices. This means that parallelism occurs where every parallel thread has access to all of your<br>data. You can think of it as: parallelism can happen during the execution of a specific for a loop by splitting up the loop among the different threads.</div><div>Our CPU computing nodes are built for multi-threading computing. Unfortunately, most of you are using Python.  Python doesn't support multi-threading due to GIL (global-interpreter-lock). Thus you have people spawning numerous scripts and crushing machines. I don't know about R which is essentially a fancy wrapper on the top of pure C to tell you how efficient it is. I do know enough about Julia to tell you that multi-threading is built in.  Julia uses Threads.@threads macro to parallelize loops and Threads.@spawn to launch tasks on separate system threads. Use locks or atomic values to control the parallel execution. </div><div><br></div><div><a href="https://docs.julialang.org/en/v1/manual/parallel-computing/" target="_blank">https://docs.julialang.org/en/v1/manual/parallel-computing/</a><br></div><div>        </div><div>MPI is a way to program on distributed memory devices. This means that parallelism occurs where every parallel process is working in its<br>own memory space in isolation from the others. You can think of it as every bit of code you've written is executed independently by every process. The parallelism occurs because you tell each process exactly which part of the global problem they should be working on based entirely on their process ID. Historically, with the exception of the short Hadoop period when we run the Rocks cluster which comes pre-configured for distributed computing, we didn't utilize distributed computing. If you force me to speculate why that was the case I think it is due to the fact that the primary method of hardware acquisition in our lab was(still is) accretion. Our infrastructure was too inhomogeneous put together in an ad hoc fashion rather than the careful design. Blame it on the funding sources. We have never had the luxury of spending half a million dollars on the carefully designed cluster utilizing InfiniBand. Currently, our hardware is homogenous enough and 40 Gigabit InfiniBand are dirt cheap due to the fact that national labs have largely migrated to 100 Gigabit that I could clamp a few CPU or even GPU clusters if I get few thousands for used InfiniBand. IIRC Python uses a multiprocessing library for multiprocessing </div><div><br></div><div><a href="https://docs.python.org/3.8/library/multiprocessing.html" target="_blank">https://docs.python.org/3.8/library/multiprocessing.html</a></div><div><br></div><div>and does support distributed computing</div><div><br></div><div><a href="https://wiki.python.org/moin/ParallelProcessing" target="_blank">https://wiki.python.org/moin/ParallelProcessing</a><br></div><div><br></div><div>but I am not familiar with it. Julia which I use does have native support for distributive computing. Please see the above link. </div><div><br>       <br>The way in which you write an OpenMP an MPI program, of course, is also very different.<br>       <br>MPI stands for Message Passing Interface. It is a set of API declarations on message passing (such as to send, receive, broadcast,<br>etc.), and what behavior should be expected from the implementations. I have not done enough of C and Fortran programming to know how to correctly use MPI. Also for the record, I don't know C++. People who know me well are well aware of how irritated I get when C and C++ are interchangeably used in a single sentence. </div><div>       <br>The idea of "message passing" is rather abstract. It could mean passing the message between local processes or processes distributed across<br>networked hosts, etc. Modern implementations try very hard to be versatile and abstract away the multiple underlying mechanisms (shared<br>memory access, network IO, etc.).<br>       <br>OpenMP is an API that is all about making it (presumably) easier to write shared-memory multi-processing programs. There is no notion of<br>passing messages around. Instead, with a set of standard functions and compiler directives, you write programs that execute local threads in<br>parallel, and you control the behavior of those threads (what resource they should have access to, how they are synchronized, etc.). OpenMP<br>requires the support of the compiler, so you can also look at it as an extension of the supported languages.<br>       <br>And it's not uncommon that an application can use both MPI and OpenMP.<br></div><div><br></div><div>I am afraid if you were hoping for the pre-configured distributed environment which will enable you to execute the single magic command like</div><div>mpiexec you will be disappointed. This is an instance where using Pittsburg Supercomputing Center is probably more appropriate. There are limitations to a one-man IT department model currently utilized by the Auton Lab. You just exposed the ugly truth. </div><div><br></div><div>For the record, I would be far happier to spend more time on genuine HPC and never be bothered with trivialities but budgetary constraints are the major obstacle. </div><div><br></div><div>Most Kind Regards,</div><div>Predrag</div><div><br></div><div>P.S. Please don't get me started with HPC GPU computing :-)</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 5, 2021 at 5:00 PM Zhe Huang <<a href="mailto:zhehuang@cmu.edu" target="_blank">zhehuang@cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Predrag,<div><br></div><div>Sorry to bother you. I have been trying to run my experiment across multiple nodes (e.g. on both gpu16 and gpu17) in a distributed manner. I saw there is MPI backend pre-installed on the Auton cluster. However, I tested it and I felt like it didn't work at all (I was using this command on gpu16 to run jobs on gpu17: <span style="font-variant-ligatures:no-common-ligatures;color:rgb(0,0,0);font-family:Menlo;font-size:11px">mpiexec -n 8 -hosts <a href="http://gpu17.int.autonlab.org" target="_blank">gpu17.int.autonlab.org</a> echo "hello"</span>).</div>


<div><br></div><div>Actually, is there no cross-node communication on the cluster at all or did I do it wrong? If the latter is the case, could you point me to a one-liner working example? Thanks.</div><div><br></div><div>Sincerely,</div><div>Zhe</div></div>

</blockquote></div></div>

</blockquote></div>