GPU3 is "configured"

Thu Oct 13 10:44:16 EDT 2016

On 2016-10-12 23:26, Arne Suppe wrote:
> Hmm - I don’t use matlab for deep learning, but gpuDevice also hangs
> on my computer with R2016a.
> 

We would have to escalate this with MathWorks. I have seen work around 
Internet but it looks like a bug in one of Mathworks provided MEX files.

> I was able compile the matrixMul example in the CUDA samples and run
> it on gpu3, so I think the build environment is probably all set.
> 
> As for the openGL, I think its possibly a problem with their build
> script findgl.mk which is not familiar with Springdale OS.  The
> demo_suite directory has a precompiled nbody binary you may try, but I
> suspect most users will not need graphics.
> 

That should not be too hard to fix. Some header files have to be 
manually edited. The funny part until 7.2 Princeton people didn't bother 
to remove RHEL branding which actually made things easier for us.

Doug is trying right now to compile the latest Caffe, TensorFlow, and 
protobuf-3. We will try to create an RPM for that so that we don't have 
to go through this again. I also asked Princeton and Rutgers guys if 
they
have WIP RPMs to share.

Predrag

> Arne
> 
> 
> 
> 
>> On Oct 12, 2016, at 10:23 PM, Predrag Punosevac <predragp at cs.cmu.edu> 
>> wrote:
>> 
>> Arne Suppe <suppe at andrew.cmu.edu> wrote:
>> 
>>> Hi Predrag,
>>> Don???t know if this applies to you, but I just build a machines with 
>>> a GTX1080 which has the same PASCAL architecture as the Titan.  After 
>>> installing CUDA 8, I still found I needed to install the latest 
>>> driver off of the NVIDIA web site to get the card recognized.  Right 
>>> now, I am running 367.44.
>>> 
>>> Arne
>> 
>> Arne,
>> 
>> Thank you so much for this e-mail. Yes it is damn PASCAL arhitecture I
>> see lots of people complaining about it on the forums. I downloaded 
>> and
>> installed driver from
>> 
>> http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run&lang=us&type=GeForce
>> 
>> That seems to made a real difference. Check out this beautiful outputs
>> 
>> root at gpu3$ ls nvidia*
>> nvidia0  nvidia1  nvidia2  nvidia3  nvidiactl  nvidia-uvm
>> nvidia-uvm-tools
>> 
>> root at gpu3$ lspci | grep -i nvidia
>> 02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev
>> a1)
>> 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
>> 03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev
>> a1)
>> 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
>> 82:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev
>> a1)
>> 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
>> 83:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev
>> a1)
>> 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
>> 
>> 
>> root at gpu3$ ls /proc/driver
>> nvidia  nvidia-uvm  nvram  rtc
>> 
>> root at gpu3$ lsmod |grep nvidia
>> nvidia_uvm            738901  0
>> nvidia_drm             43405  0
>> nvidia_modeset        764432  1 nvidia_drm
>> nvidia              11492947  2 nvidia_modeset,nvidia_uvm
>> drm_kms_helper        125056  2 ast,nvidia_drm
>> drm                   349210  5 ast,ttm,drm_kms_helper,nvidia_drm
>> i2c_core               40582  7
>> ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia
>> 
>> root at gpu3$ nvidia-smi
>> Wed Oct 12 22:03:27 2016
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 367.57                 Driver Version: 367.57
>>    |
>> |-------------------------------+----------------------+----------------------+
>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>> Uncorr. ECC |
>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>> Compute M. |
>> |===============================+======================+======================|
>> |   0  TITAN X (Pascal)    Off  | 0000:02:00.0     Off |
>> N/A |
>> | 23%   32C    P0    56W / 250W |      0MiB / 12189MiB |      0%
>> Default |
>> +-------------------------------+----------------------+----------------------+
>> |   1  TITAN X (Pascal)    Off  | 0000:03:00.0     Off |
>> N/A |
>> | 23%   36C    P0    57W / 250W |      0MiB / 12189MiB |      0%
>> Default |
>> +-------------------------------+----------------------+----------------------+
>> |   2  TITAN X (Pascal)    Off  | 0000:82:00.0     Off |
>> N/A |
>> | 23%   35C    P0    57W / 250W |      0MiB / 12189MiB |      0%
>> Default |
>> +-------------------------------+----------------------+----------------------+
>> |   3  TITAN X (Pascal)    Off  | 0000:83:00.0     Off |
>> N/A |
>> |  0%   35C    P0    56W / 250W |      0MiB / 12189MiB |      0%
>> Default |
>> +-------------------------------+----------------------+----------------------+
>> 
>> 
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID  Type  Process name                               
>> Usage
>>    |
>> |=============================================================================|
>> |  No running processes found
>>    |
>> +-----------------------------------------------------------------------------+
>> 
>> 
>> 
>> /usr/local/cuda/extras/demo_suite/deviceQuery
>> 
>>  Alignment requirement for Surfaces:            Yes
>>  Device has ECC support:                        Disabled
>>  Device supports Unified Addressing (UVA):      Yes
>>  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
>>  Compute Mode:
>>     < Default (multiple host threads can use ::cudaSetDevice() with
>> device simultaneously) >
>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU1) :
>> Yes
>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU2) :
>> No
>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU3) :
>> No
>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU0) :
>> Yes
>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU2) :
>> No
>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU3) :
>> No
>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU0) :
>> No
>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU1) :
>> No
>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU3) :
>> Yes
>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU0) :
>> No
>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU1) :
>> No
>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU2) :
>> Yes
>> 
>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA
>> Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X (Pascal), 
>> Device1
>> = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = TITAN X
>> (Pascal)
>> Result = PASS
>> 
>> 
>> 
>> Now not everything is rosy
>> 
>> root at gpu3$ cd ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody
>> root at gpu3$ make
>>>>> WARNING - libGL.so not found, refer to CUDA Getting Started Guide
>> for how to find and install them. <<<
>>>>> WARNING - libGLU.so not found, refer to CUDA Getting Started Guide
>> for how to find and install them. <<<
>>>>> WARNING - libX11.so not found, refer to CUDA Getting Started Guide
>> for how to find and install them. <<<
>> 
>> 
>> even though those are installed. For example
>> 
>> root at gpu3$ yum whatprovides  */libX11.so
>> libX11-devel-1.6.3-2.el7.i686 : Development files for libX11
>> Repo        : core
>> Matched from:
>> Filename    : /usr/lib/libX11.so
>> 
>> also
>> 
>> mesa-libGLU-devel
>> mesa-libGL-devel
>> xorg-x11-drv-nvidia-devel
>> 
>> but
>> 
>> root at gpu3$ yum -y install mesa-libGLU-devel mesa-libGL-devel
>> xorg-x11-drv-nvidia-devel
>> Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already installed and
>> latest version
>> Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 already 
>> installed
>> and latest version
>> Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 already
>> installed and latest version
>> 
>> Also from MATLAB gpuDevice hangs.
>> 
>> So we still don't have a working installation. Any help would be
>> appreciated.
>> 
>> Best,
>> Predrag
>> 
>> P.S. Once we have a working installation we can think of installing
>> Caffe and TensorFlow. For now we have to see why the things are not
>> working.
>> 
>> 
>> 
>> 
>> 
>> 
>>> 
>>>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac <predragp at cs.cmu.edu> 
>>>> wrote:
>>>> 
>>>> Dear Autonians,
>>>> 
>>>> GPU3 is "configured". Namely you can log into it and all packages 
>>>> are
>>>> installed. However I couldn't get NVIDIA provided CUDA driver to
>>>> recognize GPU cards. They appear to be properly installed from the
>>>> hardware point of view and you can list them with
>>>> 
>>>> lshw -class display
>>>> 
>>>> root at gpu3$ lshw -class display
>>>> *-display UNCLAIMED
>>>>      description: VGA compatible controller
>>>>      product: NVIDIA Corporation
>>>>      vendor: NVIDIA Corporation
>>>>      physical id: 0
>>>>      bus info: pci at 0000:02:00.0
>>>>      version: a1
>>>>      width: 64 bits
>>>>      clock: 33MHz
>>>>      capabilities: pm msi pciexpress vga_controller cap_list
>>>>      configuration: latency=0
>>>>      resources: iomemory:383f0-383ef iomemory:383f0-383ef
>>>> memory:cf000000-cfffffff memory:383fe0000000-383fefffffff
>>>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128)
>>>> memory:d0000000-d007ffff
>>>> *-display UNCLAIMED
>>>>      description: VGA compatible controller
>>>>      product: NVIDIA Corporation
>>>>      vendor: NVIDIA Corporation
>>>>      physical id: 0
>>>>      bus info: pci at 0000:03:00.0
>>>>      version: a1
>>>>      width: 64 bits
>>>>      clock: 33MHz
>>>>      capabilities: pm msi pciexpress vga_controller cap_list
>>>>      configuration: latency=0
>>>>      resources: iomemory:383f0-383ef iomemory:383f0-383ef
>>>> memory:cd000000-cdffffff memory:383fc0000000-383fcfffffff
>>>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128)
>>>> memory:ce000000-ce07ffff
>>>> *-display
>>>>      description: VGA compatible controller
>>>>      product: ASPEED Graphics Family
>>>>      vendor: ASPEED Technology, Inc.
>>>>      physical id: 0
>>>>      bus info: pci at 0000:06:00.0
>>>>      version: 30
>>>>      width: 32 bits
>>>>      clock: 33MHz
>>>>      capabilities: pm msi vga_controller bus_master cap_list rom
>>>>      configuration: driver=ast latency=0
>>>>      resources: irq:19 memory:cb000000-cbffffff
>>>> memory:cc000000-cc01ffff ioport:4000(size=128)
>>>> *-display UNCLAIMED
>>>>      description: VGA compatible controller
>>>>      product: NVIDIA Corporation
>>>>      vendor: NVIDIA Corporation
>>>> physical id: 0
>>>>      bus info: pci at 0000:82:00.0
>>>>      version: a1
>>>>      width: 64 bits
>>>>      clock: 33MHz
>>>>      capabilities: pm msi pciexpress vga_controller cap_list
>>>>      configuration: latency=0
>>>>      resources: iomemory:387f0-387ef iomemory:387f0-387ef
>>>> memory:fa000000-faffffff memory:387fe0000000-387fefffffff
>>>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128)
>>>> memory:fb000000-fb07ffff
>>>> *-display UNCLAIMED
>>>>      description: VGA compatible controller
>>>>      product: NVIDIA Corporation
>>>>      vendor: NVIDIA Corporation
>>>>      physical id: 0
>>>>      bus info: pci at 0000:83:00.0
>>>>      version: a1
>>>>      width: 64 bits
>>>>      clock: 33MHz
>>>>      capabilities: pm msi pciexpress vga_controller cap_list
>>>>      configuration: latency=0
>>>>      resources: iomemory:387f0-387ef iomemory:387f0-387ef
>>>> memory:f8000000-f8ffffff memory:387fc0000000-387fcfffffff
>>>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128)
>>>> memory:f9000000-f907ffff
>>>> 
>>>> 
>>>> However what scares the hell out of me is that I don't see NVIDIA 
>>>> driver
>>>> loaded
>>>> 
>>>> lsmod|grep nvidia
>>>> 
>>>> and the device nodes /dev/nvidia are not created. I am guessing I 
>>>> just
>>>> missed some trivial step during the CUDA installation which is very
>>>> involving. I am unfortunately too tired to debug this tonight.
>>>> 
>>>> Predrag
>>