No GPU drivers detected on any gpu machine?

Predrag Punosevac predragp at andrew.cmu.edu
Thu Aug 1 12:29:18 EDT 2019


Could you be little bit more precises. Are you using pip from
/opt/rh/python-36 or /opt/minconda3/python37 or from the base?

Predrag

On Thu, Aug 1, 2019 at 11:53 AM Sarveshwaran Jayaraman <
sarveshj at andrew.cmu.edu> wrote:

> Hi Yusha,
>
>
> I was able to install Tensorflow-gpu version 1.14.0 using the following
> command
>
>
> (note $: refers to the shell prompt)
>
> $: source <your_virtual_environment>
>
> $: pip install tensorflow-gpu
>
>
> # sanity check in python shell
>
> $: python
>
> >>> import tensorflow as tf
>
> >>> tf.__version__ # should give you the installed version
>
>
>
> Please let me know if these commands work for you. If not, please feel
> free to get in touch with me. Thanks!
>
>
>
>
> [image: 1562005799537] <https://www.autonlab.org/>
>
> Sarvesh Jayaraman <https://www.linkedin.com/in/sarveshjayaraman/>
> Sr. Research Analyst, Auton Lab
> Carnegie Mellon University
> Mob: +1-240-893-4287
>
> ------------------------------
> *From:* Autonlab-users <autonlab-users-bounces at autonlab.org> on behalf of
> Yusha Liu <yushal at andrew.cmu.edu>
> *Sent:* Thursday, August 1, 2019 11:26:48 AM
> *To:* users at autonlab.org
> *Subject:* Re: No GPU drivers detected on any gpu machine?
>
> Hi all,
>
> Could anyone help give me a guide on how to install tensorflow (<2.0 beta)
> compatible with CUDA 10.1 on gpus? I haven't succeed on that. Thanks and
> sorry for the overhead.
>
> Yours,
> Yusha
>
>
>
>
>
> On Wed, Jul 24, 2019 at 10:16 PM Predrag Punosevac <
> predragp at andrew.cmu.edu> wrote:
>
>> Predrag Punosevac <predragp at andrew.cmu.edu> wrote:
>>
>> I apologize for top posting. Just a quick update. As of 5 minutes ago
>> machines gpu[2-10] appear to have no issues. After all the upgrades and
>> reboots it appears that we don't have any dead GPU cards on them and
>> that drivers and CUDA 10.1 work as expected. I understand that this is a
>> little comfort to people who need to regenerate tensorflow, py-torch,
>> and all that "deep-learning" stuff but I have no control over the
>> upstream decisions.
>>
>> GPU1 appears to be broken at the moment. Without attaching consol to the
>> machine it is difficult for me to asses the complexity of the problem.
>>
>> One more time sorry for the down time.
>>
>> Cheers,
>> Predrag
>>
>>
>>
>>
>>
>>
>> > A quick update on this issue and a resolution. I took a clue from the
>> > fact that GPU10 was working as expected and narrowed down the issue to
>> > CUDA 9.1 installation.  It appears that upstream has broken CUDA 9.1
>> > purposely via dkms utility which is used to recompile kernel modules
>> > to fit specific kernel release. They probably want people to move to
>> > CUDA 10.1.
>> >
>> > Long story short. I upgraded NVidia driver and CUDA to 10.1 on GPU2
>> > and GPU3 servers. They appear to be working flawlessly on my end as
>> > tested with nvidia-smi utility as well as MATLAB. I have recreated
>> > GPU3 scratch directory which was 100% used for almost half a year. I
>> > have also reinstalled libcudnn library on both machines but I am
>> > unable to test it.
>> >
>> > This is all good but it also means that people will have to regenerate
>> > their tools from the scratch to match the kernel, driver, and CUDA
>> > versions. If you have things on GPU10 you probably could just migrate
>> > them. This is very time consuming but we have no choice.
>> >
>> > The major bad news is that one of the GPU servers I tried to work on
>> > GPU1 (commissioned almost five years ago) didn't survive reboot. It
>> > also uses older Tesla K80 cards. I will have to attach the screen and
>> > troubleshoot this machine. That will not happen today or for that
>> > matter this week.
>> >
>> > My plan is now to move and fix machines GPU[4-9] which would take the
>> > rest of the day.Note that GPU7 is designated for a special project and
>> > not generally accessible.
>> >
>> > Most Kind Regards,
>> > Predrag Punosevac
>> >
>> >
>> >
>> >
>> > On Wed, Jul 24, 2019 at 1:09 PM Predrag Punosevac
>> > <predragp at andrew.cmu.edu> wrote:
>> > >
>> > > Thank you so much for bringing this to my attention. GPU10 is not
>> > > broken but sure enough you are right about the other machines. It
>> > > appears that one of recent updates have broken the driver. I will
>> > > reinstall drivers shortly and reboot the machines. This is also notice
>> > > for everyone else that GPU1-9 will have to be rebooted.
>> > >
>> > > Predrag
>> > >
>> > > On Wed, Jul 24, 2019 at 10:52 AM Chufan Gao <chufang at andrew.cmu.edu>
>> wrote:
>> > > >
>> > > > Hi Predrag,
>> > > >
>> > > >
>> > > > I discovered today that when I run nvidia-smi, I get this error:
>> > > >
>> > > >
>> > > > NVIDIA-SMI has failed because it couldn't communicate with the
>> NVIDIA driver. Make sure that the latest NVIDIA driver is installed and
>> running.
>> > > >
>> > > > The same happens for all of the gpu machines that I tried. I am
>> confused - was there an update that broke it?
>> > > >
>> > > > Sincerely,
>> > > > Andy Gao
>>
>
>
> --
> Yusha Liu, Master's Student
> Machine Learning Department
> Carnegie Mellon University
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20190801/998416ae/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OutlookEmoji-1562005799537cc72768c-8612-43dd-a28b-d178cd220172.png
Type: image/png
Size: 5461 bytes
Desc: not available
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20190801/998416ae/attachment.png>


More information about the Autonlab-users mailing list