gpu24 and gpu25 added to the cluster

Wed Dec 22 08:13:33 EST 2021

Hi all,

I am still getting: tensorflow.python.framework.errors_impl.InternalError:
CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel
image is invalid
(due to the missing cudnn?).

Here is the compatible version for cuda11.5.
https://developer.nvidia.com/rdp/cudnn-archive

Also another thing that worries me is that although CUDA11.5 is installed,
it is loading some libraries from 10.0. There seems to be a hybrid cuda-11,
cuda-10 installation potentially clashing with tensorflow/ pytorch
libraries:

2021-12-22 08:01:29.540914: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully
opened dynamic library libcudart.so.10.1
2021-12-22 08:01:30.059368: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully
opened dynamic library libcublas.so.10
2021-12-22 08:01:30.731369: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully
opened dynamic library libcufft.so.10
2021-12-22 08:01:30.838834: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully
opened dynamic library libcurand.so.10
2021-12-22 08:01:31.162863: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully
opened dynamic library libcusolver.so.10
2021-12-22 08:01:31.286939: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully
opened dynamic library libcusparse.so.10

On Fri, Dec 17, 2021 at 1:59 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> We'll try to use CUDNN. Maybe it is ok
>
> On Fri, Dec 17, 2021, 1:53 PM Viraj Mehta <virajm at cs.cmu.edu> wrote:
>
>> Hmm, I don't see those available on the NVIDIA site. That being said, I
>> tested my code that runs using CUDA/CUDNN on other GPU machines and it
>> doesn't find these GPUs. So perhaps I missed something, I'll look around.
>>
>> Viraj
>>
>> On Fri, Dec 17, 2021 at 12:41 PM Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>>
>>> I just installed the RPM provided by you. However I am not sure that
>>> this is the correct RPM. In the past I used to install something like
>>>
>>>
>>>
>>> libcudnn8-8.1.1.33-1.cuda11.2.x86_64.rpm
>>> libcudnn8-devel-8.1.1.33-1.cuda11.2.x86_64.rpm
>>> libcudnn8-samples-8.1.1.33-1.cuda11.2.x86_64.rpm
>>>
>>> On Fri, Dec 17, 2021 at 11:13 AM Viraj Mehta <virajm at cs.cmu.edu> wrote:
>>>
>>>> Hi Predrag,
>>>>
>>>> This should be sitting in the scratch. Let me know if there are any
>>>> issues.
>>>>
>>>> Cheers,
>>>> Viraj
>>>>
>>>> On Fri, Dec 17, 2021 at 9:38 AM Predrag Punosevac <
>>>> predragp at andrew.cmu.edu> wrote:
>>>>
>>>>> Yes. Please get me 5 RPS for RHEL 8.1 and put them in your scratch on
>>>>> GPU24. Make sure they are for 64 bit AMD/Intel. They have them for ARM and
>>>>> Power architecture.
>>>>>
>>>>> On Fri, Dec 17, 2021, 10:31 AM Viraj Mehta <virajm at cs.cmu.edu> wrote:
>>>>>
>>>>>> Hey Predrag,
>>>>>>
>>>>>> I can get it for you. Out of the options listed in the attached
>>>>>> image, which one would make sense to install? I was thinking the RHEL x86
>>>>>> version would be most appropriate.
>>>>>>
>>>>>> Best,
>>>>>> Viraj
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 17, 2021 at 9:10 AM Predrag Punosevac <
>>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>>
>>>>>>> It is not installed right now. It is proprietary software and I have
>>>>>>> to locate my NVIDIA developer credentials to get RPS. If someone can
>>>>>>> download it quickly for me I will install it.
>>>>>>>
>>>>>>> On Fri, Dec 17, 2021, 9:11 AM Ifigeneia Apostolopoulou <
>>>>>>> iapostol at andrew.cmu.edu> wrote:
>>>>>>>
>>>>>>>> Hello Predrag,
>>>>>>>>
>>>>>>>> could you also please provide the cuDNN version? I couldn't find
>>>>>>>> cudnn.h in /usr/include, /usr/local/cuda-11/include,
>>>>>>>> /usr/local/cuda/include, /usr/local/cuda-11/include,
>>>>>>>> /usr/local/cuda-11.5/include
>>>>>>>>
>>>>>>>> thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 16, 2021 at 1:27 PM Predrag Punosevac <
>>>>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>>>>
>>>>>>>>> Just to add to this info. The installed version of CUDA is
>>>>>>>>>
>>>>>>>>> cuda-11.5.1-1.x86_64
>>>>>>>>>
>>>>>>>>> We already have a bunch of servers using cuda 11.1 but perhaps
>>>>>>>>> nothing newer than 11.3. Rolling back to EOL version CUDA 10 is the option
>>>>>>>>> of the last resort.
>>>>>>>>>
>>>>>>>>> I installed /opt/miniconda-py39
>>>>>>>>>
>>>>>>>>> which is Python 3.9.5. Most older servers run Python 3.8 branch or
>>>>>>>>> even 3.7 branch.
>>>>>>>>>
>>>>>>>>> I would like everyone to keep in mind that the OS packaging
>>>>>>>>> problem is NP hard so rolling things back to some "sweet spot" might be a
>>>>>>>>> prohibitively expensive approach.
>>>>>>>>>
>>>>>>>>> Predrag
>>>>>>>>>
>>>>>>>>> On Thu, Dec 16, 2021 at 12:40 PM Ifigeneia Apostolopoulou <
>>>>>>>>> iapostol at andrew.cmu.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Has anyone tried to test the new servers?
>>>>>>>>>>
>>>>>>>>>> I have not managed to run neither pytorch nor tensorflow
>>>>>>>>>> processes. I am getting the following errors:
>>>>>>>>>>
>>>>>>>>>> tensorflow: CUDA runtime implicit initialization on GPU:0 failed.
>>>>>>>>>> Status: device kernel image is invalid
>>>>>>>>>>
>>>>>>>>>> pytorch: RuntimeError: CUDA error: no kernel image is available
>>>>>>>>>> for execution on the device
>>>>>>>>>>
>>>>>>>>>> I am not sure whether this is a CUDA installation issue /
>>>>>>>>>> incompatibility (however, I am facing a problem with both pytorch and
>>>>>>>>>> tensorflow processes that can run on the rest of the servers).
>>>>>>>>>>
>>>>>>>>>> thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 15, 2021 at 10:26 PM Predrag Punosevac <
>>>>>>>>>> predragp at andrew.cmu.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear Autonians,
>>>>>>>>>>>
>>>>>>>>>>> I just finished provisioning two new GPU nodes. The purchase was
>>>>>>>>>>> approved by Dr. Schneider in July but the order was not placed until late
>>>>>>>>>>> August due to CMU internal issues just in time to be affected by supply
>>>>>>>>>>> chain disruption. The servers were finally shipped on 11/24/2021
>>>>>>>>>>> and received last Wednesday 12/8/2021. To add the final insult
>>>>>>>>>>> to the injury the nodes were not tagged until Monday afternoon. I had
>>>>>>>>>>> literally to hunt down people to do the work.
>>>>>>>>>>> I spent half a day yesterday getting power cables and other misc
>>>>>>>>>>> supplies. Thus they are only done today. However, I think they are
>>>>>>>>>>> definitely worth the trouble.
>>>>>>>>>>>
>>>>>>>>>>> Each server comes with 8 NVIDIA RTX A6000 connected by
>>>>>>>>>>> high-speed GPU interconnect NVIDIA links beside PCIe. Each server has 2 AMD
>>>>>>>>>>> EPYC 7502 32-Core Processors for a total of 128 threads per server. These
>>>>>>>>>>> CPUs are almost as fast as your desktop processors 3.5 GHz.
>>>>>>>>>>> Each server has 512GB of RAM and 2TB of scratch. These servers
>>>>>>>>>>> have 24 2'5" HDD bays so they could potentially be used as a storage space.
>>>>>>>>>>> I don't have 2'5" HDDs in the lab right now to populate the bays.
>>>>>>>>>>>
>>>>>>>>>>> There is one thing which is for now done suboptimally. Namely
>>>>>>>>>>> the servers were shipped with 1Gbs copper NIC and 10Gbs fiber optical NIC.
>>>>>>>>>>> I could not locate long enough optical cables in our lab yesterday but I
>>>>>>>>>>> will try to address this issue soon. I have exactly 2 optical connectors on
>>>>>>>>>>> the switch so it is down to cabling.
>>>>>>>>>>>
>>>>>>>>>>> Have fun and sorry for a long delay.
>>>>>>>>>>>
>>>>>>>>>>> Predrag
>>>>>>>>>>>
>>>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20211222/7a202dd1/attachment.html>