gpu24 and gpu25 added to the cluster

Fri Dec 17 11:12:55 EST 2021

Hi Predrag,

This should be sitting in the scratch. Let me know if there are any issues.

Cheers,
Viraj

On Fri, Dec 17, 2021 at 9:38 AM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Yes. Please get me 5 RPS for RHEL 8.1 and put them in your scratch on
> GPU24. Make sure they are for 64 bit AMD/Intel. They have them for ARM and
> Power architecture.
>
> On Fri, Dec 17, 2021, 10:31 AM Viraj Mehta <virajm at cs.cmu.edu> wrote:
>
>> Hey Predrag,
>>
>> I can get it for you. Out of the options listed in the attached image,
>> which one would make sense to install? I was thinking the RHEL x86  version
>> would be most appropriate.
>>
>> Best,
>> Viraj
>>
>>
>>
>> On Fri, Dec 17, 2021 at 9:10 AM Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>>
>>> It is not installed right now. It is proprietary software and I have to
>>> locate my NVIDIA developer credentials to get RPS. If someone can download
>>> it quickly for me I will install it.
>>>
>>> On Fri, Dec 17, 2021, 9:11 AM Ifigeneia Apostolopoulou <
>>> iapostol at andrew.cmu.edu> wrote:
>>>
>>>> Hello Predrag,
>>>>
>>>> could you also please provide the cuDNN version? I couldn't find
>>>> cudnn.h in /usr/include, /usr/local/cuda-11/include,
>>>> /usr/local/cuda/include, /usr/local/cuda-11/include,
>>>> /usr/local/cuda-11.5/include
>>>>
>>>> thanks!
>>>>
>>>>
>>>> On Thu, Dec 16, 2021 at 1:27 PM Predrag Punosevac <
>>>> predragp at andrew.cmu.edu> wrote:
>>>>
>>>>> Just to add to this info. The installed version of CUDA is
>>>>>
>>>>> cuda-11.5.1-1.x86_64
>>>>>
>>>>> We already have a bunch of servers using cuda 11.1 but perhaps nothing
>>>>> newer than 11.3. Rolling back to EOL version CUDA 10 is the option of the
>>>>> last resort.
>>>>>
>>>>> I installed /opt/miniconda-py39
>>>>>
>>>>> which is Python 3.9.5. Most older servers run Python 3.8 branch or
>>>>> even 3.7 branch.
>>>>>
>>>>> I would like everyone to keep in mind that the OS packaging problem is
>>>>> NP hard so rolling things back to some "sweet spot" might be a
>>>>> prohibitively expensive approach.
>>>>>
>>>>> Predrag
>>>>>
>>>>> On Thu, Dec 16, 2021 at 12:40 PM Ifigeneia Apostolopoulou <
>>>>> iapostol at andrew.cmu.edu> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Has anyone tried to test the new servers?
>>>>>>
>>>>>> I have not managed to run neither pytorch nor tensorflow processes. I
>>>>>> am getting the following errors:
>>>>>>
>>>>>> tensorflow: CUDA runtime implicit initialization on GPU:0 failed.
>>>>>> Status: device kernel image is invalid
>>>>>>
>>>>>> pytorch: RuntimeError: CUDA error: no kernel image is available for
>>>>>> execution on the device
>>>>>>
>>>>>> I am not sure whether this is a CUDA installation issue /
>>>>>> incompatibility (however, I am facing a problem with both pytorch and
>>>>>> tensorflow processes that can run on the rest of the servers).
>>>>>>
>>>>>> thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 15, 2021 at 10:26 PM Predrag Punosevac <
>>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>>
>>>>>>> Dear Autonians,
>>>>>>>
>>>>>>> I just finished provisioning two new GPU nodes. The purchase was
>>>>>>> approved by Dr. Schneider in July but the order was not placed until late
>>>>>>> August due to CMU internal issues just in time to be affected by supply
>>>>>>> chain disruption. The servers were finally shipped on 11/24/2021
>>>>>>> and received last Wednesday 12/8/2021. To add the final insult to
>>>>>>> the injury the nodes were not tagged until Monday afternoon. I had
>>>>>>> literally to hunt down people to do the work.
>>>>>>> I spent half a day yesterday getting power cables and other misc
>>>>>>> supplies. Thus they are only done today. However, I think they are
>>>>>>> definitely worth the trouble.
>>>>>>>
>>>>>>> Each server comes with 8 NVIDIA RTX A6000 connected by high-speed
>>>>>>> GPU interconnect NVIDIA links beside PCIe. Each server has 2 AMD EPYC 7502
>>>>>>> 32-Core Processors for a total of 128 threads per server. These CPUs are
>>>>>>> almost as fast as your desktop processors 3.5 GHz.
>>>>>>> Each server has 512GB of RAM and 2TB of scratch. These servers have
>>>>>>> 24 2'5" HDD bays so they could potentially be used as a storage space. I
>>>>>>> don't have 2'5" HDDs in the lab right now to populate the bays.
>>>>>>>
>>>>>>> There is one thing which is for now done suboptimally. Namely the
>>>>>>> servers were shipped with 1Gbs copper NIC and 10Gbs fiber optical NIC. I
>>>>>>> could not locate long enough optical cables in our lab yesterday but I will
>>>>>>> try to address this issue soon. I have exactly 2 optical connectors on the
>>>>>>> switch so it is down to cabling.
>>>>>>>
>>>>>>> Have fun and sorry for a long delay.
>>>>>>>
>>>>>>> Predrag
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20211217/2991b1b9/attachment.html>