gpu24 and gpu25 added to the cluster

Thu Dec 16 13:27:25 EST 2021

Just to add to this info. The installed version of CUDA is

cuda-11.5.1-1.x86_64

We already have a bunch of servers using cuda 11.1 but perhaps nothing
newer than 11.3. Rolling back to EOL version CUDA 10 is the option of the
last resort.

I installed /opt/miniconda-py39

which is Python 3.9.5. Most older servers run Python 3.8 branch or even 3.7
branch.

I would like everyone to keep in mind that the OS packaging problem is NP
hard so rolling things back to some "sweet spot" might be a
prohibitively expensive approach.

Predrag

On Thu, Dec 16, 2021 at 12:40 PM Ifigeneia Apostolopoulou <
iapostol at andrew.cmu.edu> wrote:

> Hi all,
>
> Has anyone tried to test the new servers?
>
> I have not managed to run neither pytorch nor tensorflow processes. I am
> getting the following errors:
>
> tensorflow: CUDA runtime implicit initialization on GPU:0 failed. Status:
> device kernel image is invalid
>
> pytorch: RuntimeError: CUDA error: no kernel image is available for
> execution on the device
>
> I am not sure whether this is a CUDA installation issue / incompatibility
> (however, I am facing a problem with both pytorch and tensorflow processes
> that can run on the rest of the servers).
>
> thanks!
>
>
>
> On Wed, Dec 15, 2021 at 10:26 PM Predrag Punosevac <
> predragp at andrew.cmu.edu> wrote:
>
>> Dear Autonians,
>>
>> I just finished provisioning two new GPU nodes. The purchase was approved
>> by Dr. Schneider in July but the order was not placed until late August due
>> to CMU internal issues just in time to be affected by supply chain
>> disruption. The servers were finally shipped on 11/24/2021
>> and received last Wednesday 12/8/2021. To add the final insult to the
>> injury the nodes were not tagged until Monday afternoon. I had literally to
>> hunt down people to do the work.
>> I spent half a day yesterday getting power cables and other misc
>> supplies. Thus they are only done today. However, I think they are
>> definitely worth the trouble.
>>
>> Each server comes with 8 NVIDIA RTX A6000 connected by high-speed GPU
>> interconnect NVIDIA links beside PCIe. Each server has 2 AMD EPYC 7502
>> 32-Core Processors for a total of 128 threads per server. These CPUs are
>> almost as fast as your desktop processors 3.5 GHz.
>> Each server has 512GB of RAM and 2TB of scratch. These servers have
>> 24 2'5" HDD bays so they could potentially be used as a storage space. I
>> don't have 2'5" HDDs in the lab right now to populate the bays.
>>
>> There is one thing which is for now done suboptimally. Namely the servers
>> were shipped with 1Gbs copper NIC and 10Gbs fiber optical NIC. I could not
>> locate long enough optical cables in our lab yesterday but I will try to
>> address this issue soon. I have exactly 2 optical connectors on the switch
>> so it is down to cabling.
>>
>> Have fun and sorry for a long delay.
>>
>> Predrag
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20211216/be581cf0/attachment-0001.html>