gpu24 and gpu25 added to the cluster

Fri Dec 17 13:58:14 EST 2021

We'll try to use CUDNN. Maybe it is ok

On Fri, Dec 17, 2021, 1:53 PM Viraj Mehta <virajm at cs.cmu.edu> wrote:

> Hmm, I don't see those available on the NVIDIA site. That being said, I
> tested my code that runs using CUDA/CUDNN on other GPU machines and it
> doesn't find these GPUs. So perhaps I missed something, I'll look around.
>
> Viraj
>
> On Fri, Dec 17, 2021 at 12:41 PM Predrag Punosevac <
> predragp at andrew.cmu.edu> wrote:
>
>> I just installed the RPM provided by you. However I am not sure that this
>> is the correct RPM. In the past I used to install something like
>>
>>
>>
>> libcudnn8-8.1.1.33-1.cuda11.2.x86_64.rpm
>> libcudnn8-devel-8.1.1.33-1.cuda11.2.x86_64.rpm
>> libcudnn8-samples-8.1.1.33-1.cuda11.2.x86_64.rpm
>>
>> On Fri, Dec 17, 2021 at 11:13 AM Viraj Mehta <virajm at cs.cmu.edu> wrote:
>>
>>> Hi Predrag,
>>>
>>> This should be sitting in the scratch. Let me know if there are any
>>> issues.
>>>
>>> Cheers,
>>> Viraj
>>>
>>> On Fri, Dec 17, 2021 at 9:38 AM Predrag Punosevac <
>>> predragp at andrew.cmu.edu> wrote:
>>>
>>>> Yes. Please get me 5 RPS for RHEL 8.1 and put them in your scratch on
>>>> GPU24. Make sure they are for 64 bit AMD/Intel. They have them for ARM and
>>>> Power architecture.
>>>>
>>>> On Fri, Dec 17, 2021, 10:31 AM Viraj Mehta <virajm at cs.cmu.edu> wrote:
>>>>
>>>>> Hey Predrag,
>>>>>
>>>>> I can get it for you. Out of the options listed in the attached image,
>>>>> which one would make sense to install? I was thinking the RHEL x86  version
>>>>> would be most appropriate.
>>>>>
>>>>> Best,
>>>>> Viraj
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 17, 2021 at 9:10 AM Predrag Punosevac <
>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>
>>>>>> It is not installed right now. It is proprietary software and I have
>>>>>> to locate my NVIDIA developer credentials to get RPS. If someone can
>>>>>> download it quickly for me I will install it.
>>>>>>
>>>>>> On Fri, Dec 17, 2021, 9:11 AM Ifigeneia Apostolopoulou <
>>>>>> iapostol at andrew.cmu.edu> wrote:
>>>>>>
>>>>>>> Hello Predrag,
>>>>>>>
>>>>>>> could you also please provide the cuDNN version? I couldn't find
>>>>>>> cudnn.h in /usr/include, /usr/local/cuda-11/include,
>>>>>>> /usr/local/cuda/include, /usr/local/cuda-11/include,
>>>>>>> /usr/local/cuda-11.5/include
>>>>>>>
>>>>>>> thanks!
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 16, 2021 at 1:27 PM Predrag Punosevac <
>>>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>>>
>>>>>>>> Just to add to this info. The installed version of CUDA is
>>>>>>>>
>>>>>>>> cuda-11.5.1-1.x86_64
>>>>>>>>
>>>>>>>> We already have a bunch of servers using cuda 11.1 but perhaps
>>>>>>>> nothing newer than 11.3. Rolling back to EOL version CUDA 10 is the option
>>>>>>>> of the last resort.
>>>>>>>>
>>>>>>>> I installed /opt/miniconda-py39
>>>>>>>>
>>>>>>>> which is Python 3.9.5. Most older servers run Python 3.8 branch or
>>>>>>>> even 3.7 branch.
>>>>>>>>
>>>>>>>> I would like everyone to keep in mind that the OS packaging problem
>>>>>>>> is NP hard so rolling things back to some "sweet spot" might be a
>>>>>>>> prohibitively expensive approach.
>>>>>>>>
>>>>>>>> Predrag
>>>>>>>>
>>>>>>>> On Thu, Dec 16, 2021 at 12:40 PM Ifigeneia Apostolopoulou <
>>>>>>>> iapostol at andrew.cmu.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Has anyone tried to test the new servers?
>>>>>>>>>
>>>>>>>>> I have not managed to run neither pytorch nor tensorflow
>>>>>>>>> processes. I am getting the following errors:
>>>>>>>>>
>>>>>>>>> tensorflow: CUDA runtime implicit initialization on GPU:0 failed.
>>>>>>>>> Status: device kernel image is invalid
>>>>>>>>>
>>>>>>>>> pytorch: RuntimeError: CUDA error: no kernel image is available
>>>>>>>>> for execution on the device
>>>>>>>>>
>>>>>>>>> I am not sure whether this is a CUDA installation issue /
>>>>>>>>> incompatibility (however, I am facing a problem with both pytorch and
>>>>>>>>> tensorflow processes that can run on the rest of the servers).
>>>>>>>>>
>>>>>>>>> thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Dec 15, 2021 at 10:26 PM Predrag Punosevac <
>>>>>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Dear Autonians,
>>>>>>>>>>
>>>>>>>>>> I just finished provisioning two new GPU nodes. The purchase was
>>>>>>>>>> approved by Dr. Schneider in July but the order was not placed until late
>>>>>>>>>> August due to CMU internal issues just in time to be affected by supply
>>>>>>>>>> chain disruption. The servers were finally shipped on 11/24/2021
>>>>>>>>>> and received last Wednesday 12/8/2021. To add the final insult to
>>>>>>>>>> the injury the nodes were not tagged until Monday afternoon. I had
>>>>>>>>>> literally to hunt down people to do the work.
>>>>>>>>>> I spent half a day yesterday getting power cables and other misc
>>>>>>>>>> supplies. Thus they are only done today. However, I think they are
>>>>>>>>>> definitely worth the trouble.
>>>>>>>>>>
>>>>>>>>>> Each server comes with 8 NVIDIA RTX A6000 connected by high-speed
>>>>>>>>>> GPU interconnect NVIDIA links beside PCIe. Each server has 2 AMD EPYC 7502
>>>>>>>>>> 32-Core Processors for a total of 128 threads per server. These CPUs are
>>>>>>>>>> almost as fast as your desktop processors 3.5 GHz.
>>>>>>>>>> Each server has 512GB of RAM and 2TB of scratch. These servers
>>>>>>>>>> have 24 2'5" HDD bays so they could potentially be used as a storage space.
>>>>>>>>>> I don't have 2'5" HDDs in the lab right now to populate the bays.
>>>>>>>>>>
>>>>>>>>>> There is one thing which is for now done suboptimally. Namely the
>>>>>>>>>> servers were shipped with 1Gbs copper NIC and 10Gbs fiber optical NIC. I
>>>>>>>>>> could not locate long enough optical cables in our lab yesterday but I will
>>>>>>>>>> try to address this issue soon. I have exactly 2 optical connectors on the
>>>>>>>>>> switch so it is down to cabling.
>>>>>>>>>>
>>>>>>>>>> Have fun and sorry for a long delay.
>>>>>>>>>>
>>>>>>>>>> Predrag
>>>>>>>>>>
>>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20211217/1dbfc1d8/attachment.html>