gpu24 and gpu25 added to the cluster

Fri Dec 17 10:09:42 EST 2021

It is not installed right now. It is proprietary software and I have to
locate my NVIDIA developer credentials to get RPS. If someone can download
it quickly for me I will install it.

On Fri, Dec 17, 2021, 9:11 AM Ifigeneia Apostolopoulou <
iapostol at andrew.cmu.edu> wrote:

> Hello Predrag,
>
> could you also please provide the cuDNN version? I couldn't find cudnn.h
> in /usr/include, /usr/local/cuda-11/include, /usr/local/cuda/include,
> /usr/local/cuda-11/include, /usr/local/cuda-11.5/include
>
> thanks!
>
>
> On Thu, Dec 16, 2021 at 1:27 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> Just to add to this info. The installed version of CUDA is
>>
>> cuda-11.5.1-1.x86_64
>>
>> We already have a bunch of servers using cuda 11.1 but perhaps nothing
>> newer than 11.3. Rolling back to EOL version CUDA 10 is the option of the
>> last resort.
>>
>> I installed /opt/miniconda-py39
>>
>> which is Python 3.9.5. Most older servers run Python 3.8 branch or even
>> 3.7 branch.
>>
>> I would like everyone to keep in mind that the OS packaging problem is NP
>> hard so rolling things back to some "sweet spot" might be a
>> prohibitively expensive approach.
>>
>> Predrag
>>
>> On Thu, Dec 16, 2021 at 12:40 PM Ifigeneia Apostolopoulou <
>> iapostol at andrew.cmu.edu> wrote:
>>
>>> Hi all,
>>>
>>> Has anyone tried to test the new servers?
>>>
>>> I have not managed to run neither pytorch nor tensorflow processes. I am
>>> getting the following errors:
>>>
>>> tensorflow: CUDA runtime implicit initialization on GPU:0 failed.
>>> Status: device kernel image is invalid
>>>
>>> pytorch: RuntimeError: CUDA error: no kernel image is available for
>>> execution on the device
>>>
>>> I am not sure whether this is a CUDA installation issue /
>>> incompatibility (however, I am facing a problem with both pytorch and
>>> tensorflow processes that can run on the rest of the servers).
>>>
>>> thanks!
>>>
>>>
>>>
>>> On Wed, Dec 15, 2021 at 10:26 PM Predrag Punosevac <
>>> predragp at andrew.cmu.edu> wrote:
>>>
>>>> Dear Autonians,
>>>>
>>>> I just finished provisioning two new GPU nodes. The purchase was
>>>> approved by Dr. Schneider in July but the order was not placed until late
>>>> August due to CMU internal issues just in time to be affected by supply
>>>> chain disruption. The servers were finally shipped on 11/24/2021
>>>> and received last Wednesday 12/8/2021. To add the final insult to the
>>>> injury the nodes were not tagged until Monday afternoon. I had literally to
>>>> hunt down people to do the work.
>>>> I spent half a day yesterday getting power cables and other misc
>>>> supplies. Thus they are only done today. However, I think they are
>>>> definitely worth the trouble.
>>>>
>>>> Each server comes with 8 NVIDIA RTX A6000 connected by high-speed GPU
>>>> interconnect NVIDIA links beside PCIe. Each server has 2 AMD EPYC 7502
>>>> 32-Core Processors for a total of 128 threads per server. These CPUs are
>>>> almost as fast as your desktop processors 3.5 GHz.
>>>> Each server has 512GB of RAM and 2TB of scratch. These servers have
>>>> 24 2'5" HDD bays so they could potentially be used as a storage space. I
>>>> don't have 2'5" HDDs in the lab right now to populate the bays.
>>>>
>>>> There is one thing which is for now done suboptimally. Namely the
>>>> servers were shipped with 1Gbs copper NIC and 10Gbs fiber optical NIC. I
>>>> could not locate long enough optical cables in our lab yesterday but I will
>>>> try to address this issue soon. I have exactly 2 optical connectors on the
>>>> switch so it is down to cabling.
>>>>
>>>> Have fun and sorry for a long delay.
>>>>
>>>> Predrag
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20211217/4d55a7a9/attachment.html>