gpu24 and gpu25 added to the cluster

Ifigeneia Apostolopoulou iapostol at andrew.cmu.edu
Thu Dec 16 12:40:27 EST 2021


Hi all,

Has anyone tried to test the new servers?

I have not managed to run neither pytorch nor tensorflow processes. I am
getting the following errors:

tensorflow: CUDA runtime implicit initialization on GPU:0 failed. Status:
device kernel image is invalid

pytorch: RuntimeError: CUDA error: no kernel image is available for
execution on the device

I am not sure whether this is a CUDA installation issue / incompatibility
(however, I am facing a problem with both pytorch and tensorflow processes
that can run on the rest of the servers).

thanks!



On Wed, Dec 15, 2021 at 10:26 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Dear Autonians,
>
> I just finished provisioning two new GPU nodes. The purchase was approved
> by Dr. Schneider in July but the order was not placed until late August due
> to CMU internal issues just in time to be affected by supply chain
> disruption. The servers were finally shipped on 11/24/2021
> and received last Wednesday 12/8/2021. To add the final insult to the
> injury the nodes were not tagged until Monday afternoon. I had literally to
> hunt down people to do the work.
> I spent half a day yesterday getting power cables and other misc supplies.
> Thus they are only done today. However, I think they are definitely worth
> the trouble.
>
> Each server comes with 8 NVIDIA RTX A6000 connected by high-speed GPU
> interconnect NVIDIA links beside PCIe. Each server has 2 AMD EPYC 7502
> 32-Core Processors for a total of 128 threads per server. These CPUs are
> almost as fast as your desktop processors 3.5 GHz.
> Each server has 512GB of RAM and 2TB of scratch. These servers have
> 24 2'5" HDD bays so they could potentially be used as a storage space. I
> don't have 2'5" HDDs in the lab right now to populate the bays.
>
> There is one thing which is for now done suboptimally. Namely the servers
> were shipped with 1Gbs copper NIC and 10Gbs fiber optical NIC. I could not
> locate long enough optical cables in our lab yesterday but I will try to
> address this issue soon. I have exactly 2 optical connectors on the switch
> so it is down to cabling.
>
> Have fun and sorry for a long delay.
>
> Predrag
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20211216/dbfdd186/attachment.html>


More information about the Autonlab-users mailing list