From yeehos at andrew.cmu.edu Tue Aug 4 16:14:24 2020 From: yeehos at andrew.cmu.edu (Yeeho Song) Date: Tue, 4 Aug 2020 16:14:24 -0400 Subject: LOV2 and GPU15 scratch memories almost / completely full Message-ID: Dear all, This is a gentle reminder that LOV2 and GPU15 scratch are almost / completely full. Please check and delete / move your files from the scratch directories if possible. Thank you! yeehos at lov2$ df -h /home/scratch Filesystem Size Used Avail Use% Mounted on /dev/mapper/sl-home 1.8T 1.8T 2.0M 100% /home yeehos at gpu15$ df -h /home/scratch Filesystem Size Used Avail Use% Mounted on /dev/mapper/sl-home 184G 184G 60K 100% /home Sincerely, Yeeho Song -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Thu Aug 6 10:53:55 2020 From: ngisolfi at cs.cmu.edu (Nicholas Gisolfi) Date: Thu, 6 Aug 2020 10:53:55 -0400 Subject: [Lunch] Today @noon over Zoom Message-ID: https://cmu.zoom.us/j/492870487 We hope to see you there! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Fri Aug 7 15:43:23 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 07 Aug 2020 15:43:23 -0400 Subject: gpu11 scratch Message-ID: <20200807194323.ZAu3x%predragp@andrew.cmu.edu> HDD holding scratch directory on GPU11 appears to be dead. I don't have a stomach to reboot the server. Please continue to use From predragp at andrew.cmu.edu Fri Aug 7 17:10:20 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 7 Aug 2020 17:10:20 -0400 Subject: Important: Scratch access denied. In-Reply-To: References: Message-ID: This is the second report. I added an account earlier today. I have seen this happening with a runaway script I use to create scratch directories across computing nodes. I am looking into it. Predrag On Fri, Aug 7, 2020 at 5:07 PM Tanmay Agarwal wrote: > Hi Predrag, > > Hope you are doing well! > > I am unable to access the scratch directories across all GPUs. Can you > help fix it as I need some data out of it, specifically for the GPU3 node? > Please let us know if it can be fixed at the earliest and I look forward to > your reply. > > > Thanking you, > > Warm Regards, > > Tanmay Agarwal | MSR Graduate Student > Robotics Institute @ CMU > mailto: tanmaya at andrew.cmu.edu > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Fri Aug 7 18:01:55 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 7 Aug 2020 18:01:55 -0400 Subject: Important: Scratch access denied. In-Reply-To: References: Message-ID: Fixed! Thanks for reporting. I forgot to export a variable before running the script which creates a new scratch directory. Predrag On Fri, Aug 7, 2020 at 5:10 PM Predrag Punosevac wrote: > This is the second report. I added an account earlier today. I have seen > this happening with a runaway script I use to create scratch directories > across computing nodes. I am looking into it. > > Predrag > > On Fri, Aug 7, 2020 at 5:07 PM Tanmay Agarwal > wrote: > >> Hi Predrag, >> >> Hope you are doing well! >> >> I am unable to access the scratch directories across all GPUs. Can you >> help fix it as I need some data out of it, specifically for the GPU3 node? >> Please let us know if it can be fixed at the earliest and I look forward to >> your reply. >> >> >> Thanking you, >> >> Warm Regards, >> >> Tanmay Agarwal | MSR Graduate Student >> Robotics Institute @ CMU >> mailto: tanmaya at andrew.cmu.edu >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tanmaya at andrew.cmu.edu Sat Aug 8 22:09:20 2020 From: tanmaya at andrew.cmu.edu (Tanmay Agarwal) Date: Sat, 8 Aug 2020 22:09:20 -0400 Subject: Unable to login into GPU21. In-Reply-To: References: Message-ID: Hi Predrag, Thanks for updating me on this. It seems that my runs are still running on the server and I hope if we can wait until Monday to see if the processes terminate or if the kernel crashes. Meanwhile, I will copy the results of my runs that I save on the zfsauton disk for backup. Thanking you, Warm Regards, Tanmay Agarwal | MSR Graduate Student Robotics Institute @ CMU mailto: tanmaya at andrew.cmu.edu On Sat, Aug 8, 2020 at 10:01 PM Predrag Punosevac wrote: > I have checked the server. You overloaded it. I can log as a root but not > as a regular user. I can reboot it or we can wait out to see if the process > self terminates. If the kernel crashes we will have to wait until Monday to > be able to reboot it. gpu21 is not located in our rack and it is not > connected to our IPMI console. Somebody will actually have to walk to the > server and press the reset button. > > Cheers, > Predrag > > On Sat, Aug 8, 2020 at 7:32 PM Tanmay Agarwal > wrote: > >> Hi Predrag, >> >> Hope you are doing well! >> >> I wanted to bring to your attention that the GPU21 seems to be not >> accessible. Can you please check and get back to us as I have a few >> important experiments running on it and need to access them? >> >> Looking forward to hearing from you. >> >> Thanking you, >> >> Warm Regards, >> >> Tanmay Agarwal | MSR Graduate Student >> Robotics Institute @ CMU >> mailto: tanmaya at andrew.cmu.edu >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Mon Aug 10 14:23:11 2020 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 10 Aug 2020 14:23:11 -0400 Subject: CMU Auton Lab Team continues to lead in DARPA AutoML evaluations Message-ID: Team, We all got used to Saswati Ray and Team, and their automated machine learning pipeline construction algorithm beating everyone else in periodic DARPA evaluation contests for the past 2 years or so. So the news of the day today may not feel so new anymore: she remains the reigning Queen of AutoML, according to just-in DARPA D3M Summer evaluation leaderboard scores :) We got used to Saswati doing this, but to continue being #1 in such a tight contest for this long... It's like winning 7 or 8 Stanley Cups or Superbowls in a row, completely out of charts! In addition, somewhat unexpectedly but very pleasantly, our ML component-building sub-team, led by Jarod Wang and Cristian Challu, also got #1 rank in their respective category in the same DARPA evaluation :) This is nothing but amazing, thanks to everyone involved for your efforts, congratulations! Cheers, Artur -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.boecking at googlemail.com Mon Aug 10 19:08:14 2020 From: ben.boecking at googlemail.com (Ben Boecking) Date: Mon, 10 Aug 2020 18:08:14 -0500 Subject: Please free compute resources where possible Message-ID: <5FB77ACD-9221-46E8-B9A2-EB1174C39B85@googlemail.com> Hi everyone, I see lots of our computing nodes are under heavy load from scripts that in some cases have been running for weeks. I would like to ask you to close/kill jobs that you do not need anymore to free up compute resources. Thanks very much in advance! Best, Ben From awd at cs.cmu.edu Tue Aug 11 11:39:38 2020 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Tue, 11 Aug 2020 11:39:38 -0400 Subject: CMU Auton Lab Team continues to lead in DARPA AutoML evaluations In-Reply-To: References: Message-ID: Team, I feel stupid - blame it on my vacationinig mindset if you can - but I completely overlooked the key contribution to the second part of the story below from Kin Gutierrez. He was fundamental in developing and coding the advanced time series forecasting tool that pushed our ML primitive development team to #1 rank in the most recent evaluation. Thank you Kin, very well earned congratulations, and please accept my apologies for not including you in the note below. Cheers Artur On Mon, Aug 10, 2020 at 2:23 PM Artur Dubrawski wrote: > Team, > > We all got used to Saswati Ray and Team, and their automated machine > learning pipeline construction algorithm beating everyone else in periodic > DARPA evaluation contests for the past 2 years or so. > > So the news of the day today may not feel so new anymore: she remains the > reigning Queen of AutoML, according to just-in DARPA D3M Summer evaluation > leaderboard scores :) > > We got used to Saswati doing this, but to continue being #1 in such a > tight contest for this long... It's like winning 7 or 8 Stanley Cups or > Superbowls in a row, completely out of charts! > > In addition, somewhat unexpectedly but very pleasantly, our ML > component-building sub-team, led by Jarod Wang and Cristian Challu, also > got #1 rank in their respective category in the same DARPA evaluation :) > > This is nothing but amazing, thanks to everyone involved for your efforts, > congratulations! > > Cheers, > Artur > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Thu Aug 13 10:23:13 2020 From: ngisolfi at cs.cmu.edu (Nicholas Gisolfi) Date: Thu, 13 Aug 2020 10:23:13 -0400 Subject: [Lunch] Today @noon over Zoom Message-ID: https://cmu.zoom.us/j/492870487 We hope to see you there! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sun Aug 16 14:18:38 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 16 Aug 2020 14:18:38 -0400 Subject: OT: Xen1 is down WAS: Re: Is lop2 down? In-Reply-To: <20200816181309.AITHr%predragp@andrew.cmu.edu> References: <85b1ab2fbdc64a1ca5c6155f94255666@andrew.cmu.edu> <20200816181309.AITHr%predragp@andrew.cmu.edu> Message-ID: Chufan Gao wrote: > Hi Predrag, > > > I'm trying to ssh into lop2 but I can't seem to do it. Is it down? > Hi Gao, Yes, it is down and the situation is even worse than that. Namely, lop2 is a virtual server running of the virtual host xen1.int.autonlab.org. The virtual host is down. It can't be reached it via IPMI. That means that it is not plugged into electricity or the power supply is dead. I can't do much about it without physically accessing the host. The server room is not crewed over the weekend due to the Covid19. For now please use the other 2 ssh gateways bash.autonlab.org SHA256:Pf/uiR0Hzw9HpSNaf3/fRXon9gdXFes5KP7HEobNaW4 lion.auton.cs.cmu.edu SHA256:BL7KygrfP6PApBpf6BFHlphnc9f0KpsdhSvsguAhP4I I will try to bring another shell gateway during the day. Note for everyone. The following virtual servers are gone and can't be restored until I repair Xen1. observium.int.autonlab.org cmds.int.autonlab.org cmds-demo.dmz.autonlab.org comp.dmz.autonlab.org edshat.dmz.autonlab.org lop2.int.autonlab.org Best, Predrag > > Sincerely, > > Chufan (Andy) Gao -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Mon Aug 17 15:07:51 2020 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 17 Aug 2020 15:07:51 -0400 Subject: CMU Auton Lab Team continues to lead in DARPA AutoML evaluations In-Reply-To: References: Message-ID: Team, The story has made it to the SCS website, check it out here: https://www.scs.cmu.edu/news/scs-researchers-top-leaderboard-darpa-automl-evaluations Congrats again to Saswati, Jarod, Cristian, Kin and all Auton D3M Team! Artur On Tue, Aug 11, 2020 at 11:39 AM Artur Dubrawski wrote: > Team, > > I feel stupid - blame it on my vacationinig mindset if you can - but I > completely overlooked the key contribution to the second part of the story > below from Kin Gutierrez. He was fundamental in developing and coding the > advanced time series forecasting tool that pushed our ML primitive > development team to #1 rank in the most recent evaluation. Thank you Kin, > very well earned congratulations, and please accept my > apologies for not including you in the note below. > > Cheers > Artur > On Mon, Aug 10, 2020 at 2:23 PM Artur Dubrawski wrote: > >> Team, >> >> We all got used to Saswati Ray and Team, and their automated machine >> learning pipeline construction algorithm beating everyone else in periodic >> DARPA evaluation contests for the past 2 years or so. >> >> So the news of the day today may not feel so new anymore: she remains the >> reigning Queen of AutoML, according to just-in DARPA D3M Summer evaluation >> leaderboard scores :) >> >> We got used to Saswati doing this, but to continue being #1 in such a >> tight contest for this long... It's like winning 7 or 8 Stanley Cups or >> Superbowls in a row, completely out of charts! >> >> In addition, somewhat unexpectedly but very pleasantly, our ML >> component-building sub-team, led by Jarod Wang and Cristian Challu, also >> got #1 rank in their respective category in the same DARPA evaluation :) >> >> This is nothing but amazing, thanks to everyone involved for your >> efforts, congratulations! >> >> Cheers, >> Artur >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Aug 18 00:36:28 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 00:36:28 -0400 Subject: Tesla V100 GPU nodes Message-ID: <20200818043628.nRw15%predragp@andrew.cmu.edu> Dear Autonians, As we are gearing up for a new school year I would like to remind everyone that we are sharing finite resources. We are still relying on our ladies'/gentlemen's agreement https://www.autonlab.org/autonlab_wiki/faq.html rather than on the Slurm scheduler. In a lieu of the fact that we started acquiring very expensive high GPU memory servers (Tesla V100) suitable for training of 3D neuronal networks the notable addition to our don'ts is recommendation that those are not to be used when your jobs can be run on lower memory GPUs We will be adding both high memory GPUs as well as lower memory GPUs as the new rack space and electricity becomes available in incoming weeks. Most Kind Regards, Predrag Punosevac From predragp at andrew.cmu.edu Tue Aug 18 01:52:25 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 01:52:25 -0400 Subject: OT: Xen1 is down WAS: Re: Is lop2 down? Message-ID: <20200818055225.cnm77%predragp@andrew.cmu.edu> Predrag Punosevac wrote: > Chufan Gao wrote: > > > Hi Predrag, > > > > > > I'm trying to ssh into lop2 but I can't seem to do it. Is it down? > > > > Hi Gao, > > Yes, it is down and the situation is even worse than that. Namely, lop2 I was able to fix Xen1 remotely using backup IPMI access client. Contrary to my original assessment power supply was not damaged. It was stupid IPMI client problem. lop2 works as expected. Best, Predrag > is a virtual server running of the virtual host xen1.int.autonlab.org. > The virtual host is down. It can't be reached it via IPMI. That means > that it is not plugged into electricity or the power supply is dead. I > can't do much about it without physically accessing the host. The server > room is not crewed over the weekend due to the Covid19. > > For now please use the other 2 ssh gateways > > bash.autonlab.org SHA256:Pf/uiR0Hzw9HpSNaf3/fRXon9gdXFes5KP7HEobNaW4 > lion.auton.cs.cmu.edu SHA256:BL7KygrfP6PApBpf6BFHlphnc9f0KpsdhSvsguAhP4I > > I will try to bring another shell gateway during the day. > > Note for everyone. The following virtual servers are gone and can't be > restored until I repair Xen1. > > observium.int.autonlab.org > cmds.int.autonlab.org > cmds-demo.dmz.autonlab.org > comp.dmz.autonlab.org > edshat.dmz.autonlab.org > lop2.int.autonlab.org > > > > Best, > Predrag > > > > > Sincerely, > > > > Chufan (Andy) Gao From predragp at andrew.cmu.edu Tue Aug 18 02:13:05 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 02:13:05 -0400 Subject: cuda problem In-Reply-To: References: Message-ID: <20200818061305.VcGzT%predragp@andrew.cmu.edu> Ifigeneia Apostolopoulou wrote: > Hi Predrag, > > I hope that this (weird) summer is going well! > > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. > Specifically, I no longer can find I have not touch those servers in a very long time. I am CC-ing users mailing list. My brain is shutting down at this late hour. Maybe somebody could be of more help tomorrow morning. > > /usr/local/cuda/extras/CUPTI > I believe you. > which results in the error when I'm building my tensorflow models. > > Not found: ./bin/ptxas not found. Relying on driver to perform ptx > compilation. This message will be only logged once. > > Any ideas, how could I solve this issue? Would it be possible to restore > the cuda directory? > > Also, I currently do not have access to gpu21. It is fixed now. I just restarted sssd daemon. Please don't use gpu20 and gpu21 unless you are training 3D neuronal networks for which you need lot of GPU memory. Predrag > > Thanks a lot in advance! From mille856 at andrew.cmu.edu Tue Aug 18 09:32:41 2020 From: mille856 at andrew.cmu.edu (Kyle Miller) Date: Tue, 18 Aug 2020 09:32:41 -0400 Subject: cuda problem In-Reply-To: <20200818061305.VcGzT%predragp@andrew.cmu.edu> References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> Message-ID: Ifi, I recently had difficulty on GPU13, having not used it in a long while. For me, the issue was that miniconda had moved. I added /opt/miniconda-py38/bin to my path and rebuilt my environment (not sure if that was necessary). Then it worked. -Kyle On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac wrote: > Ifigeneia Apostolopoulou wrote: > > > Hi Predrag, > > > > I hope that this (weird) summer is going well! > > > > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. > > Specifically, I no longer can find > > I have not touch those servers in a very long time. I am CC-ing users > mailing list. My brain is shutting down at this late hour. Maybe > somebody could be of more help tomorrow morning. > > > > > /usr/local/cuda/extras/CUPTI > > > > I believe you. > > > > which results in the error when I'm building my tensorflow models. > > > > Not found: ./bin/ptxas not found. Relying on driver to perform ptx > > compilation. This message will be only logged once. > > > > Any ideas, how could I solve this issue? Would it be possible to restore > > the cuda directory? > > > > Also, I currently do not have access to gpu21. > > It is fixed now. I just restarted sssd daemon. Please don't use gpu20 > and gpu21 unless you are training 3D neuronal networks for which you > need lot of GPU memory. > > Predrag > > > > > > Thanks a lot in advance! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iapostol at andrew.cmu.edu Tue Aug 18 10:00:35 2020 From: iapostol at andrew.cmu.edu (Ifigeneia Apostolopoulou) Date: Tue, 18 Aug 2020 10:00:35 -0400 Subject: cuda problem In-Reply-To: References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> Message-ID: Hi Kyle, Thanks a lot for your reply! I also had this issue and I solved it as you did. However, this seems to be another issue: I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or anywhere in gpu1 to set it to my path) which causes the issue. I am also attaching the screenshot with the working (gpu3) and not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the directory cuda (and all its content) has been moved (and I can't find it in any other directory). On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller wrote: > Ifi, > I recently had difficulty on GPU13, having not used it in a long while. > For me, the issue was that miniconda had moved. I added > /opt/miniconda-py38/bin to my path and rebuilt my environment (not sure if > that was necessary). Then it worked. > -Kyle > > On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac > wrote: > >> Ifigeneia Apostolopoulou wrote: >> >> > Hi Predrag, >> > >> > I hope that this (weird) summer is going well! >> > >> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. >> > Specifically, I no longer can find >> >> I have not touch those servers in a very long time. I am CC-ing users >> mailing list. My brain is shutting down at this late hour. Maybe >> somebody could be of more help tomorrow morning. >> >> > >> > /usr/local/cuda/extras/CUPTI >> > >> >> I believe you. >> >> >> > which results in the error when I'm building my tensorflow models. >> > >> > Not found: ./bin/ptxas not found. Relying on driver to perform ptx >> > compilation. This message will be only logged once. >> > >> > Any ideas, how could I solve this issue? Would it be possible to restore >> > the cuda directory? >> > >> > Also, I currently do not have access to gpu21. >> >> It is fixed now. I just restarted sssd daemon. Please don't use gpu20 >> and gpu21 unless you are training 3D neuronal networks for which you >> need lot of GPU memory. >> >> Predrag >> >> >> > >> > Thanks a lot in advance! >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gpu1.png Type: image/png Size: 197088 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gpu3.png Type: image/png Size: 157090 bytes Desc: not available URL: From mille856 at andrew.cmu.edu Tue Aug 18 11:28:51 2020 From: mille856 at andrew.cmu.edu (Kyle Miller) Date: Tue, 18 Aug 2020 11:28:51 -0400 Subject: cuda problem In-Reply-To: References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> Message-ID: I see. I ran a few find commands on gpu13, I couldn't find a cuda folder or CUPTI. On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou < iapostol at andrew.cmu.edu> wrote: > Hi Kyle, > Thanks a lot for your reply! > > I also had this issue and I solved it as you did. However, this seems to > be another issue: > I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or anywhere in > gpu1 to set it to my path) which causes the issue. > I am also attaching the screenshot with the working (gpu3) and not-working > (gpu1) case. In gpu1, gpu2, gpu13, it seems that the directory cuda (and > all its content) has been moved (and I can't find it in any other > directory). > > > > > > On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller > wrote: > >> Ifi, >> I recently had difficulty on GPU13, having not used it in a long >> while. For me, the issue was that miniconda had moved. I added >> /opt/miniconda-py38/bin to my path and rebuilt my environment (not sure if >> that was necessary). Then it worked. >> -Kyle >> >> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> Ifigeneia Apostolopoulou wrote: >>> >>> > Hi Predrag, >>> > >>> > I hope that this (weird) summer is going well! >>> > >>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. >>> > Specifically, I no longer can find >>> >>> I have not touch those servers in a very long time. I am CC-ing users >>> mailing list. My brain is shutting down at this late hour. Maybe >>> somebody could be of more help tomorrow morning. >>> >>> > >>> > /usr/local/cuda/extras/CUPTI >>> > >>> >>> I believe you. >>> >>> >>> > which results in the error when I'm building my tensorflow models. >>> > >>> > Not found: ./bin/ptxas not found. Relying on driver to perform ptx >>> > compilation. This message will be only logged once. >>> > >>> > Any ideas, how could I solve this issue? Would it be possible to >>> restore >>> > the cuda directory? >>> > >>> > Also, I currently do not have access to gpu21. >>> >>> It is fixed now. I just restarted sssd daemon. Please don't use gpu20 >>> and gpu21 unless you are training 3D neuronal networks for which you >>> need lot of GPU memory. >>> >>> Predrag >>> >>> >>> > >>> > Thanks a lot in advance! >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Aug 18 11:41:26 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 11:41:26 -0400 Subject: cuda problem In-Reply-To: References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> Message-ID: Because cuda folder is cuda 10.2 folder. Cuda folder is typically just a symbolic link to the curen version of cuda. On Tue, Aug 18, 2020, 11:31 AM Kyle Miller wrote: > I see. I ran a few find commands on gpu13, I couldn't find a cuda folder > or CUPTI. > > On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou < > iapostol at andrew.cmu.edu> wrote: > >> Hi Kyle, >> Thanks a lot for your reply! >> >> I also had this issue and I solved it as you did. However, this seems to >> be another issue: >> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or anywhere >> in gpu1 to set it to my path) which causes the issue. >> I am also attaching the screenshot with the working (gpu3) and >> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the directory >> cuda (and all its content) has been moved (and I can't find it in any other >> directory). >> >> >> >> >> >> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller >> wrote: >> >>> Ifi, >>> I recently had difficulty on GPU13, having not used it in a long >>> while. For me, the issue was that miniconda had moved. I added >>> /opt/miniconda-py38/bin to my path and rebuilt my environment (not sure if >>> that was necessary). Then it worked. >>> -Kyle >>> >>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac < >>> predragp at andrew.cmu.edu> wrote: >>> >>>> Ifigeneia Apostolopoulou wrote: >>>> >>>> > Hi Predrag, >>>> > >>>> > I hope that this (weird) summer is going well! >>>> > >>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. >>>> > Specifically, I no longer can find >>>> >>>> I have not touch those servers in a very long time. I am CC-ing users >>>> mailing list. My brain is shutting down at this late hour. Maybe >>>> somebody could be of more help tomorrow morning. >>>> >>>> > >>>> > /usr/local/cuda/extras/CUPTI >>>> > >>>> >>>> I believe you. >>>> >>>> >>>> > which results in the error when I'm building my tensorflow models. >>>> > >>>> > Not found: ./bin/ptxas not found. Relying on driver to perform ptx >>>> > compilation. This message will be only logged once. >>>> > >>>> > Any ideas, how could I solve this issue? Would it be possible to >>>> restore >>>> > the cuda directory? >>>> > >>>> > Also, I currently do not have access to gpu21. >>>> >>>> It is fixed now. I just restarted sssd daemon. Please don't use gpu20 >>>> and gpu21 unless you are training 3D neuronal networks for which you >>>> need lot of GPU memory. >>>> >>>> Predrag >>>> >>>> >>>> > >>>> > Thanks a lot in advance! >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From iapostol at andrew.cmu.edu Tue Aug 18 11:46:17 2020 From: iapostol at andrew.cmu.edu (Ifigeneia Apostolopoulou) Date: Tue, 18 Aug 2020 11:46:17 -0400 Subject: cuda problem In-Reply-To: References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> Message-ID: yes, but there is still no bin/ptxas in cuda 10.2. actually there's no bin directory. it seems that cuda-10.2 is corrupted? On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac wrote: > Because cuda folder is cuda 10.2 folder. Cuda folder is typically just a > symbolic link to the curen version of cuda. > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller > wrote: > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda folder >> or CUPTI. >> >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou < >> iapostol at andrew.cmu.edu> wrote: >> >>> Hi Kyle, >>> Thanks a lot for your reply! >>> >>> I also had this issue and I solved it as you did. However, this seems to >>> be another issue: >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or anywhere >>> in gpu1 to set it to my path) which causes the issue. >>> I am also attaching the screenshot with the working (gpu3) and >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the directory >>> cuda (and all its content) has been moved (and I can't find it in any other >>> directory). >>> >>> >>> >>> >>> >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller >>> wrote: >>> >>>> Ifi, >>>> I recently had difficulty on GPU13, having not used it in a long >>>> while. For me, the issue was that miniconda had moved. I added >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment (not sure if >>>> that was necessary). Then it worked. >>>> -Kyle >>>> >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac < >>>> predragp at andrew.cmu.edu> wrote: >>>> >>>>> Ifigeneia Apostolopoulou wrote: >>>>> >>>>> > Hi Predrag, >>>>> > >>>>> > I hope that this (weird) summer is going well! >>>>> > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. >>>>> > Specifically, I no longer can find >>>>> >>>>> I have not touch those servers in a very long time. I am CC-ing users >>>>> mailing list. My brain is shutting down at this late hour. Maybe >>>>> somebody could be of more help tomorrow morning. >>>>> >>>>> > >>>>> > /usr/local/cuda/extras/CUPTI >>>>> > >>>>> >>>>> I believe you. >>>>> >>>>> >>>>> > which results in the error when I'm building my tensorflow models. >>>>> > >>>>> > Not found: ./bin/ptxas not found. Relying on driver to perform ptx >>>>> > compilation. This message will be only logged once. >>>>> > >>>>> > Any ideas, how could I solve this issue? Would it be possible to >>>>> restore >>>>> > the cuda directory? >>>>> > >>>>> > Also, I currently do not have access to gpu21. >>>>> >>>>> It is fixed now. I just restarted sssd daemon. Please don't use gpu20 >>>>> and gpu21 unless you are training 3D neuronal networks for which you >>>>> need lot of GPU memory. >>>>> >>>>> Predrag >>>>> >>>>> >>>>> > >>>>> > Thanks a lot in advance! >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Aug 18 17:23:48 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 17:23:48 -0400 Subject: cuda problem In-Reply-To: References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> Message-ID: <20200818212348.C8xKV%predragp@andrew.cmu.edu> Ifigeneia Apostolopoulou wrote: > yes, but there is still no bin/ptxas in cuda 10.2. actually there's no bin > directory. it seems that cuda-10.2 is corrupted? > I took a clue from your message and did the fresh installation of CUDA to GPU1 only. I upgraded the kernel and the driver to the latest one supporting branch 7.8 of RedHat. The driver works as expected in my limited testing. CUDA is upgraded to the newly released 11.0. I really hate that NVidia is intensionally breaking previous stable releases as soon as the new one is branched out. Could you please try building Tensor Flow in GPU1 and report the progress? We will eventually have to upgrade all GPU nodes to CUDA 11 even if they are fully working now. Best, Predrag > On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac > wrote: > > > Because cuda folder is cuda 10.2 folder. Cuda folder is typically just a > > symbolic link to the curen version of cuda. > > > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller > > wrote: > > > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda folder > >> or CUPTI. > >> > >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou < > >> iapostol at andrew.cmu.edu> wrote: > >> > >>> Hi Kyle, > >>> Thanks a lot for your reply! > >>> > >>> I also had this issue and I solved it as you did. However, this seems to > >>> be another issue: > >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or anywhere > >>> in gpu1 to set it to my path) which causes the issue. > >>> I am also attaching the screenshot with the working (gpu3) and > >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the directory > >>> cuda (and all its content) has been moved (and I can't find it in any other > >>> directory). > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller > >>> wrote: > >>> > >>>> Ifi, > >>>> I recently had difficulty on GPU13, having not used it in a long > >>>> while. For me, the issue was that miniconda had moved. I added > >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment (not sure if > >>>> that was necessary). Then it worked. > >>>> -Kyle > >>>> > >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac < > >>>> predragp at andrew.cmu.edu> wrote: > >>>> > >>>>> Ifigeneia Apostolopoulou wrote: > >>>>> > >>>>> > Hi Predrag, > >>>>> > > >>>>> > I hope that this (weird) summer is going well! > >>>>> > > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. > >>>>> > Specifically, I no longer can find > >>>>> > >>>>> I have not touch those servers in a very long time. I am CC-ing users > >>>>> mailing list. My brain is shutting down at this late hour. Maybe > >>>>> somebody could be of more help tomorrow morning. > >>>>> > >>>>> > > >>>>> > /usr/local/cuda/extras/CUPTI > >>>>> > > >>>>> > >>>>> I believe you. > >>>>> > >>>>> > >>>>> > which results in the error when I'm building my tensorflow models. > >>>>> > > >>>>> > Not found: ./bin/ptxas not found. Relying on driver to perform ptx > >>>>> > compilation. This message will be only logged once. > >>>>> > > >>>>> > Any ideas, how could I solve this issue? Would it be possible to > >>>>> restore > >>>>> > the cuda directory? > >>>>> > > >>>>> > Also, I currently do not have access to gpu21. > >>>>> > >>>>> It is fixed now. I just restarted sssd daemon. Please don't use gpu20 > >>>>> and gpu21 unless you are training 3D neuronal networks for which you > >>>>> need lot of GPU memory. > >>>>> > >>>>> Predrag > >>>>> > >>>>> > >>>>> > > >>>>> > Thanks a lot in advance! > >>>>> > >>>> From predragp at andrew.cmu.edu Tue Aug 18 17:38:00 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 17:38:00 -0400 Subject: ipython hangs on Auton cluster In-Reply-To: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> Message-ID: I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. Best, Predag On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta wrote: > Hi Predrag, > > Hope you?re doing well. I?ve been running into an issue the last couple > days on the Auton cluster that is blocking my work on code that used to > work and was hoping to get your thoughts. I have tried to distill this down > to a small but replicable issue, as seen in the attachment, which I have > seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why > this might be? Thanks. > > Best, > Viraj -------------- next part -------------- An HTML attachment was scrubbed... URL: From virajm at andrew.cmu.edu Tue Aug 18 17:39:00 2020 From: virajm at andrew.cmu.edu (Viraj Mehta) Date: Tue, 18 Aug 2020 16:39:00 -0500 Subject: ipython hangs on Auton cluster In-Reply-To: References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> Message-ID: <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> Yeah, I?ll give it a shot. Thanks! > On Aug 18, 2020, at 4:38 PM, Predrag Punosevac wrote: > > I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. > > Best, > Predag > > On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta > wrote: > Hi Predrag, > > Hope you?re doing well. I?ve been running into an issue the last couple days on the Auton cluster that is blocking my work on code that used to work and was hoping to get your thoughts. I have tried to distill this down to a small but replicable issue, as seen in the attachment, which I have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why this might be? Thanks. > > Best, > Viraj -------------- next part -------------- An HTML attachment was scrubbed... URL: From virajm at andrew.cmu.edu Tue Aug 18 17:44:59 2020 From: virajm at andrew.cmu.edu (Viraj Mehta) Date: Tue, 18 Aug 2020 16:44:59 -0500 Subject: ipython hangs on Auton cluster In-Reply-To: <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> Message-ID: <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> Tried this with 3.7 and 3.8 and it still hangs. Also if it?s a good clue, it doesn?t stop even if I send SIGINT or SIGQUIT. Not really sure what?s going on here. > On Aug 18, 2020, at 4:39 PM, Viraj Mehta wrote: > > Yeah, I?ll give it a shot. Thanks! > >> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac > wrote: >> >> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. >> >> Best, >> Predag >> >> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta > wrote: >> Hi Predrag, >> >> Hope you?re doing well. I?ve been running into an issue the last couple days on the Auton cluster that is blocking my work on code that used to work and was hoping to get your thoughts. I have tried to distill this down to a small but replicable issue, as seen in the attachment, which I have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why this might be? Thanks. >> >> Best, >> Viraj > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Aug 18 19:21:47 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 19:21:47 -0400 Subject: ipython hangs on Auton cluster In-Reply-To: <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> Message-ID: I looked a bit more carefully. It could be an upstream bug. It wouldn't be the first time https://github.com/ipython/ipython/issues/11678 You don't need ipython to run Python code. You could work and debug your code on your local machine and just run production code on the server. A typical python code is just a script starting with a shebang following with a path to the binaries. I fail to see how ipython could be useful for that. It is surely useful for the interactive work. Predrag On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta wrote: > Tried this with 3.7 and 3.8 and it still hangs. Also if it?s a good clue, > it doesn?t stop even if I send SIGINT or SIGQUIT. Not really sure what?s > going on here. > > On Aug 18, 2020, at 4:39 PM, Viraj Mehta wrote: > > Yeah, I?ll give it a shot. Thanks! > > On Aug 18, 2020, at 4:38 PM, Predrag Punosevac > wrote: > > I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both > GPU9 and GPU11. Could you please try again? Could you also try with py38 > which is now recommended and report back. If this works I will upgrade > packages across all servers. This could be potentially remotely related to > the fact that Ifegenia could not build TensorFlow. Another thought is that > the ipython SQLite database is corrupted. > > Best, > Predag > > On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta wrote: > >> Hi Predrag, >> >> Hope you?re doing well. I?ve been running into an issue the last couple >> days on the Auton cluster that is blocking my work on code that used to >> work and was hoping to get your thoughts. I have tried to distill this down >> to a small but replicable issue, as seen in the attachment, which I have >> seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why >> this might be? Thanks. >> >> Best, >> Viraj > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iapostol at andrew.cmu.edu Tue Aug 18 21:37:08 2020 From: iapostol at andrew.cmu.edu (Ifigeneia Apostolopoulou) Date: Tue, 18 Aug 2020 21:37:08 -0400 Subject: cuda problem In-Reply-To: <20200818212348.C8xKV%predragp@andrew.cmu.edu> References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> <20200818212348.C8xKV%predragp@andrew.cmu.edu> Message-ID: Predrag, now it works fine. thanks a million! :-D gpu2,10,11,12,13,14,21 seem to have a similar issue. On Tue, Aug 18, 2020 at 5:23 PM Predrag Punosevac wrote: > Ifigeneia Apostolopoulou wrote: > > > yes, but there is still no bin/ptxas in cuda 10.2. actually there's no > bin > > directory. it seems that cuda-10.2 is corrupted? > > > > I took a clue from your message and did the fresh installation of CUDA > to GPU1 only. I upgraded the kernel and the driver to the latest one > supporting branch 7.8 of RedHat. The driver works as expected in my > limited testing. CUDA is upgraded to the newly released 11.0. I really > hate that NVidia is intensionally breaking previous stable releases as > soon as the new one is branched out. > > Could you please try building Tensor Flow in GPU1 and report the > progress? We will eventually have to upgrade all GPU nodes to CUDA 11 > even if they are fully working now. > > Best, > Predrag > > > > > On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac < > predragp at andrew.cmu.edu> > > wrote: > > > > > Because cuda folder is cuda 10.2 folder. Cuda folder is typically just > a > > > symbolic link to the curen version of cuda. > > > > > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller > > > wrote: > > > > > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda > folder > > >> or CUPTI. > > >> > > >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou < > > >> iapostol at andrew.cmu.edu> wrote: > > >> > > >>> Hi Kyle, > > >>> Thanks a lot for your reply! > > >>> > > >>> I also had this issue and I solved it as you did. However, this > seems to > > >>> be another issue: > > >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or > anywhere > > >>> in gpu1 to set it to my path) which causes the issue. > > >>> I am also attaching the screenshot with the working (gpu3) and > > >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the > directory > > >>> cuda (and all its content) has been moved (and I can't find it in > any other > > >>> directory). > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller > > > >>> wrote: > > >>> > > >>>> Ifi, > > >>>> I recently had difficulty on GPU13, having not used it in a long > > >>>> while. For me, the issue was that miniconda had moved. I added > > >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment (not > sure if > > >>>> that was necessary). Then it worked. > > >>>> -Kyle > > >>>> > > >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac < > > >>>> predragp at andrew.cmu.edu> wrote: > > >>>> > > >>>>> Ifigeneia Apostolopoulou wrote: > > >>>>> > > >>>>> > Hi Predrag, > > >>>>> > > > >>>>> > I hope that this (weird) summer is going well! > > >>>>> > > > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. > > >>>>> > Specifically, I no longer can find > > >>>>> > > >>>>> I have not touch those servers in a very long time. I am CC-ing > users > > >>>>> mailing list. My brain is shutting down at this late hour. Maybe > > >>>>> somebody could be of more help tomorrow morning. > > >>>>> > > >>>>> > > > >>>>> > /usr/local/cuda/extras/CUPTI > > >>>>> > > > >>>>> > > >>>>> I believe you. > > >>>>> > > >>>>> > > >>>>> > which results in the error when I'm building my tensorflow > models. > > >>>>> > > > >>>>> > Not found: ./bin/ptxas not found. Relying on driver to perform > ptx > > >>>>> > compilation. This message will be only logged once. > > >>>>> > > > >>>>> > Any ideas, how could I solve this issue? Would it be possible to > > >>>>> restore > > >>>>> > the cuda directory? > > >>>>> > > > >>>>> > Also, I currently do not have access to gpu21. > > >>>>> > > >>>>> It is fixed now. I just restarted sssd daemon. Please don't use > gpu20 > > >>>>> and gpu21 unless you are training 3D neuronal networks for which > you > > >>>>> need lot of GPU memory. > > >>>>> > > >>>>> Predrag > > >>>>> > > >>>>> > > >>>>> > > > >>>>> > Thanks a lot in advance! > > >>>>> > > >>>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Aug 18 21:44:29 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 21:44:29 -0400 Subject: cuda problem In-Reply-To: References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> <20200818212348.C8xKV%predragp@andrew.cmu.edu> Message-ID: <20200819014429.fJrx8%predragp@andrew.cmu.edu> Ifigeneia Apostolopoulou wrote: > Predrag, now it works fine. thanks a million! :-D > > gpu2,10,11,12,13,14,21 seem to have a similar issue. I am going to sit on this info at least until Friday evening. You are not supposed to use more than 2-3 nodes at the same time anyway. If those servers work for other people who might not even use TensorFlow I would prefer not to reboot them. It takes about 1.5h to rebuilt each machine. You just listed 7 machines. That is 10.5h of work if everything goes without a hitch. Cheers, Predrag > > > > On Tue, Aug 18, 2020 at 5:23 PM Predrag Punosevac > wrote: > > > Ifigeneia Apostolopoulou wrote: > > > > > yes, but there is still no bin/ptxas in cuda 10.2. actually there's no > > bin > > > directory. it seems that cuda-10.2 is corrupted? > > > > > > > I took a clue from your message and did the fresh installation of CUDA > > to GPU1 only. I upgraded the kernel and the driver to the latest one > > supporting branch 7.8 of RedHat. The driver works as expected in my > > limited testing. CUDA is upgraded to the newly released 11.0. I really > > hate that NVidia is intensionally breaking previous stable releases as > > soon as the new one is branched out. > > > > Could you please try building Tensor Flow in GPU1 and report the > > progress? We will eventually have to upgrade all GPU nodes to CUDA 11 > > even if they are fully working now. > > > > Best, > > Predrag > > > > > > > > > On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac < > > predragp at andrew.cmu.edu> > > > wrote: > > > > > > > Because cuda folder is cuda 10.2 folder. Cuda folder is typically just > > a > > > > symbolic link to the curen version of cuda. > > > > > > > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller > > > > wrote: > > > > > > > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda > > folder > > > >> or CUPTI. > > > >> > > > >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou < > > > >> iapostol at andrew.cmu.edu> wrote: > > > >> > > > >>> Hi Kyle, > > > >>> Thanks a lot for your reply! > > > >>> > > > >>> I also had this issue and I solved it as you did. However, this > > seems to > > > >>> be another issue: > > > >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or > > anywhere > > > >>> in gpu1 to set it to my path) which causes the issue. > > > >>> I am also attaching the screenshot with the working (gpu3) and > > > >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the > > directory > > > >>> cuda (and all its content) has been moved (and I can't find it in > > any other > > > >>> directory). > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller > > > > > >>> wrote: > > > >>> > > > >>>> Ifi, > > > >>>> I recently had difficulty on GPU13, having not used it in a long > > > >>>> while. For me, the issue was that miniconda had moved. I added > > > >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment (not > > sure if > > > >>>> that was necessary). Then it worked. > > > >>>> -Kyle > > > >>>> > > > >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac < > > > >>>> predragp at andrew.cmu.edu> wrote: > > > >>>> > > > >>>>> Ifigeneia Apostolopoulou wrote: > > > >>>>> > > > >>>>> > Hi Predrag, > > > >>>>> > > > > >>>>> > I hope that this (weird) summer is going well! > > > >>>>> > > > > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. > > > >>>>> > Specifically, I no longer can find > > > >>>>> > > > >>>>> I have not touch those servers in a very long time. I am CC-ing > > users > > > >>>>> mailing list. My brain is shutting down at this late hour. Maybe > > > >>>>> somebody could be of more help tomorrow morning. > > > >>>>> > > > >>>>> > > > > >>>>> > /usr/local/cuda/extras/CUPTI > > > >>>>> > > > > >>>>> > > > >>>>> I believe you. > > > >>>>> > > > >>>>> > > > >>>>> > which results in the error when I'm building my tensorflow > > models. > > > >>>>> > > > > >>>>> > Not found: ./bin/ptxas not found. Relying on driver to perform > > ptx > > > >>>>> > compilation. This message will be only logged once. > > > >>>>> > > > > >>>>> > Any ideas, how could I solve this issue? Would it be possible to > > > >>>>> restore > > > >>>>> > the cuda directory? > > > >>>>> > > > > >>>>> > Also, I currently do not have access to gpu21. > > > >>>>> > > > >>>>> It is fixed now. I just restarted sssd daemon. Please don't use > > gpu20 > > > >>>>> and gpu21 unless you are training 3D neuronal networks for which > > you > > > >>>>> need lot of GPU memory. > > > >>>>> > > > >>>>> Predrag > > > >>>>> > > > >>>>> > > > >>>>> > > > > >>>>> > Thanks a lot in advance! > > > >>>>> > > > >>>> > > From iapostol at andrew.cmu.edu Tue Aug 18 22:03:23 2020 From: iapostol at andrew.cmu.edu (Ifigeneia Apostolopoulou) Date: Tue, 18 Aug 2020 22:03:23 -0400 Subject: cuda problem In-Reply-To: <20200819014429.fJrx8%predragp@andrew.cmu.edu> References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> <20200818212348.C8xKV%predragp@andrew.cmu.edu> <20200819014429.fJrx8%predragp@andrew.cmu.edu> Message-ID: At least for me, Friday evening (or beyond) is fine. All these servers are currently very underutilized (or running very old processes with models probably compiled before the issue popped up). I am not sure if this is because other people have faced similar problems (with me being the first to 'complaint'). In the meantime and for better job scheduling, it may be better if anyone who doesn't encounter a similar problem, prefers one of those nodes (gpu2,10,11,12,13,14,21), though. thanks again and have a good night! On Tue, Aug 18, 2020 at 9:44 PM Predrag Punosevac wrote: > Ifigeneia Apostolopoulou wrote: > > > Predrag, now it works fine. thanks a million! :-D > > > > gpu2,10,11,12,13,14,21 seem to have a similar issue. > > I am going to sit on this info at least until Friday evening. You are > not supposed to use more than 2-3 nodes at the same time anyway. If > those servers work for other people who might not even use TensorFlow I > would prefer not to reboot them. It takes about 1.5h to rebuilt each > machine. You just listed 7 machines. That is 10.5h of work if everything > goes without a hitch. > > Cheers, > Predrag > > > > > > > > > On Tue, Aug 18, 2020 at 5:23 PM Predrag Punosevac < > predragp at andrew.cmu.edu> > > wrote: > > > > > Ifigeneia Apostolopoulou wrote: > > > > > > > yes, but there is still no bin/ptxas in cuda 10.2. actually there's > no > > > bin > > > > directory. it seems that cuda-10.2 is corrupted? > > > > > > > > > > I took a clue from your message and did the fresh installation of CUDA > > > to GPU1 only. I upgraded the kernel and the driver to the latest one > > > supporting branch 7.8 of RedHat. The driver works as expected in my > > > limited testing. CUDA is upgraded to the newly released 11.0. I really > > > hate that NVidia is intensionally breaking previous stable releases as > > > soon as the new one is branched out. > > > > > > Could you please try building Tensor Flow in GPU1 and report the > > > progress? We will eventually have to upgrade all GPU nodes to CUDA 11 > > > even if they are fully working now. > > > > > > Best, > > > Predrag > > > > > > > > > > > > > On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac < > > > predragp at andrew.cmu.edu> > > > > wrote: > > > > > > > > > Because cuda folder is cuda 10.2 folder. Cuda folder is typically > just > > > a > > > > > symbolic link to the curen version of cuda. > > > > > > > > > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller < > mille856 at andrew.cmu.edu> > > > > > wrote: > > > > > > > > > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda > > > folder > > > > >> or CUPTI. > > > > >> > > > > >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou < > > > > >> iapostol at andrew.cmu.edu> wrote: > > > > >> > > > > >>> Hi Kyle, > > > > >>> Thanks a lot for your reply! > > > > >>> > > > > >>> I also had this issue and I solved it as you did. However, this > > > seems to > > > > >>> be another issue: > > > > >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or > > > anywhere > > > > >>> in gpu1 to set it to my path) which causes the issue. > > > > >>> I am also attaching the screenshot with the working (gpu3) and > > > > >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the > > > directory > > > > >>> cuda (and all its content) has been moved (and I can't find it in > > > any other > > > > >>> directory). > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller < > mille856 at andrew.cmu.edu > > > > > > > > >>> wrote: > > > > >>> > > > > >>>> Ifi, > > > > >>>> I recently had difficulty on GPU13, having not used it in a > long > > > > >>>> while. For me, the issue was that miniconda had moved. I added > > > > >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment > (not > > > sure if > > > > >>>> that was necessary). Then it worked. > > > > >>>> -Kyle > > > > >>>> > > > > >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac < > > > > >>>> predragp at andrew.cmu.edu> wrote: > > > > >>>> > > > > >>>>> Ifigeneia Apostolopoulou wrote: > > > > >>>>> > > > > >>>>> > Hi Predrag, > > > > >>>>> > > > > > >>>>> > I hope that this (weird) summer is going well! > > > > >>>>> > > > > > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. > > > > >>>>> > Specifically, I no longer can find > > > > >>>>> > > > > >>>>> I have not touch those servers in a very long time. I am CC-ing > > > users > > > > >>>>> mailing list. My brain is shutting down at this late hour. > Maybe > > > > >>>>> somebody could be of more help tomorrow morning. > > > > >>>>> > > > > >>>>> > > > > > >>>>> > /usr/local/cuda/extras/CUPTI > > > > >>>>> > > > > > >>>>> > > > > >>>>> I believe you. > > > > >>>>> > > > > >>>>> > > > > >>>>> > which results in the error when I'm building my tensorflow > > > models. > > > > >>>>> > > > > > >>>>> > Not found: ./bin/ptxas not found. Relying on driver to > perform > > > ptx > > > > >>>>> > compilation. This message will be only logged once. > > > > >>>>> > > > > > >>>>> > Any ideas, how could I solve this issue? Would it be > possible to > > > > >>>>> restore > > > > >>>>> > the cuda directory? > > > > >>>>> > > > > > >>>>> > Also, I currently do not have access to gpu21. > > > > >>>>> > > > > >>>>> It is fixed now. I just restarted sssd daemon. Please don't use > > > gpu20 > > > > >>>>> and gpu21 unless you are training 3D neuronal networks for > which > > > you > > > > >>>>> need lot of GPU memory. > > > > >>>>> > > > > >>>>> Predrag > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > > >>>>> > Thanks a lot in advance! > > > > >>>>> > > > > >>>> > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chiragn at cs.cmu.edu Tue Aug 18 22:19:07 2020 From: chiragn at cs.cmu.edu (Chirag Nagpal) Date: Tue, 18 Aug 2020 22:19:07 -0400 Subject: ipython hangs on Auton cluster In-Reply-To: References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> Message-ID: FWIW my recommendation is to set up your own conda environment and use the ipython version distributed with it. this way you can easily upgrade/modify your own python version without having to depend on the clusterwide distro On Tue, Aug 18, 2020 at 7:23 PM Predrag Punosevac wrote: > I looked a bit more carefully. It could be an upstream bug. It wouldn't be > the first time > > https://github.com/ipython/ipython/issues/11678 > > You don't need ipython to run Python code. You could work and debug your > code on your local machine and just run production code on the server. A > typical python code is just a script starting with a shebang following with > a path to the binaries. I fail to see how ipython could be useful for that. > It is surely useful for the interactive work. > > Predrag > > On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta wrote: > >> Tried this with 3.7 and 3.8 and it still hangs. Also if it?s a good clue, >> it doesn?t stop even if I send SIGINT or SIGQUIT. Not really sure what?s >> going on here. >> >> On Aug 18, 2020, at 4:39 PM, Viraj Mehta wrote: >> >> Yeah, I?ll give it a shot. Thanks! >> >> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac >> wrote: >> >> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both >> GPU9 and GPU11. Could you please try again? Could you also try with py38 >> which is now recommended and report back. If this works I will upgrade >> packages across all servers. This could be potentially remotely related to >> the fact that Ifegenia could not build TensorFlow. Another thought is that >> the ipython SQLite database is corrupted. >> >> Best, >> Predag >> >> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta >> wrote: >> >>> Hi Predrag, >>> >>> Hope you?re doing well. I?ve been running into an issue the last couple >>> days on the Auton cluster that is blocking my work on code that used to >>> work and was hoping to get your thoughts. I have tried to distill this down >>> to a small but replicable issue, as seen in the attachment, which I have >>> seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why >>> this might be? Thanks. >>> >>> Best, >>> Viraj >> >> >> >> -- *Chirag Nagpal* PhD Student, Auton Lab School of Computer Science Carnegie Mellon University cs.cmu.edu/~chiragn -------------- next part -------------- An HTML attachment was scrubbed... URL: From virajm at andrew.cmu.edu Tue Aug 18 22:26:27 2020 From: virajm at andrew.cmu.edu (Viraj Mehta) Date: Tue, 18 Aug 2020 21:26:27 -0500 Subject: ipython hangs on Auton cluster In-Reply-To: References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> Message-ID: <4A943418-498F-494C-8B94-0CC61DE4E4EB@andrew.cmu.edu> I?m pretty sure it?s not an upstream bug, as many environments (conda and virtualenv) which were working with ipython across several python versions before are now not working. I understand that ipython and ipdb aren?t typically required for Python workflows but certain efforts, like stepping through code that requires a GPU and loads a model from the Auton cluster, are difficult to debug without ipdb. Is there anything else that has changed that might have broken it? Thanks, Viraj > On Aug 18, 2020, at 6:21 PM, Predrag Punosevac wrote: > > I looked a bit more carefully. It could be an upstream bug. It wouldn't be the first time > > https://github.com/ipython/ipython/issues/11678 > > You don't need ipython to run Python code. You could work and debug your code on your local machine and just run production code on the server. A typical python code is just a script starting with a shebang following with a path to the binaries. I fail to see how ipython could be useful for that. It is surely useful for the interactive work. > > Predrag > > On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta > wrote: > Tried this with 3.7 and 3.8 and it still hangs. Also if it?s a good clue, it doesn?t stop even if I send SIGINT or SIGQUIT. Not really sure what?s going on here. > >> On Aug 18, 2020, at 4:39 PM, Viraj Mehta > wrote: >> >> Yeah, I?ll give it a shot. Thanks! >> >>> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac > wrote: >>> >>> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. >>> >>> Best, >>> Predag >>> >>> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta > wrote: >>> Hi Predrag, >>> >>> Hope you?re doing well. I?ve been running into an issue the last couple days on the Auton cluster that is blocking my work on code that used to work and was hoping to get your thoughts. I have tried to distill this down to a small but replicable issue, as seen in the attachment, which I have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why this might be? Thanks. >>> >>> Best, >>> Viraj >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Aug 18 22:35:11 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Aug 2020 22:35:11 -0400 Subject: ipython hangs on Auton cluster In-Reply-To: <4A943418-498F-494C-8B94-0CC61DE4E4EB@andrew.cmu.edu> References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> <4A943418-498F-494C-8B94-0CC61DE4E4EB@andrew.cmu.edu> Message-ID: <20200819023511.nIWfP%predragp@andrew.cmu.edu> Viraj Mehta wrote: > I???m pretty sure it???s not an upstream bug, as many environments > (conda and virtualenv) which were working with ipython across several > python versions before are now not working. > > I understand that ipython and ipdb aren???t typically required for > Python workflows but certain efforts, like stepping through code that > requires a GPU and loads a model from the Auton cluster, are difficult > to debug without ipdb. Is there anything else that has changed that > might have broken it? Nothing that I am aware of. However, you do understand that the system is very complex and it is like a live organism constantly morphing. Best, Predrag > > Thanks, > Viraj > > > On Aug 18, 2020, at 6:21 PM, Predrag Punosevac wrote: > > > > I looked a bit more carefully. It could be an upstream bug. It wouldn't be the first time > > > > https://github.com/ipython/ipython/issues/11678 > > > > You don't need ipython to run Python code. You could work and debug your code on your local machine and just run production code on the server. A typical python code is just a script starting with a shebang following with a path to the binaries. I fail to see how ipython could be useful for that. It is surely useful for the interactive work. > > > > Predrag > > > > On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta > wrote: > > Tried this with 3.7 and 3.8 and it still hangs. Also if it???s a good clue, it doesn???t stop even if I send SIGINT or SIGQUIT. Not really sure what???s going on here. > > > >> On Aug 18, 2020, at 4:39 PM, Viraj Mehta > wrote: > >> > >> Yeah, I???ll give it a shot. Thanks! > >> > >>> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac > wrote: > >>> > >>> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. > >>> > >>> Best, > >>> Predag > >>> > >>> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta > wrote: > >>> Hi Predrag, > >>> > >>> Hope you???re doing well. I???ve been running into an issue the last couple days on the Auton cluster that is blocking my work on code that used to work and was hoping to get your thoughts. I have tried to distill this down to a small but replicable issue, as seen in the attachment, which I have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why this might be? Thanks. > >>> > >>> Best, > >>> Viraj > >> > > > From chiragn at cs.cmu.edu Tue Aug 18 23:22:15 2020 From: chiragn at cs.cmu.edu (Chirag Nagpal) Date: Tue, 18 Aug 2020 23:22:15 -0400 Subject: ipython hangs on Auton cluster In-Reply-To: References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> Message-ID: Hi Viraj The easiest option is to install anaconda on your home mounted on the NFS, instead of separately installing conda for each cluster. (since each cluster node is x64 you can expect conda compiled on one of the machines to run on the other machines too. all the numpy math operations are taken care of MKL/BLAS and so as long is MKL and BLAS in each machine are correctly configured you will not experience a drop in performance) Steps: $wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh and then run $bash https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh Make sure after the installation is complete you export the environment variable that points the PYTHONPATH to the conda directory. (The installer will do this automatically in case it doesnt work you can export it in your bashrc) Chirag On Tue, Aug 18, 2020 at 10:20 PM Viraj Mehta wrote: > Hi Chirag, > > Where do you install your own Conda environment? Scratch? Any other tips > on getting that done? > > Thanks, > Viraj > > On Aug 18, 2020, at 9:19 PM, Chirag Nagpal wrote: > > FWIW my recommendation is to set up your own conda environment and use > the ipython version distributed with it. this way you can > easily upgrade/modify your own python version without having to depend on > the clusterwide distro > > On Tue, Aug 18, 2020 at 7:23 PM Predrag Punosevac > wrote: > >> I looked a bit more carefully. It could be an upstream bug. It wouldn't >> be the first time >> >> https://github.com/ipython/ipython/issues/11678 >> >> You don't need ipython to run Python code. You could work and debug your >> code on your local machine and just run production code on the server. A >> typical python code is just a script starting with a shebang following with >> a path to the binaries. I fail to see how ipython could be useful for that. >> It is surely useful for the interactive work. >> >> Predrag >> >> On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta >> wrote: >> >>> Tried this with 3.7 and 3.8 and it still hangs. Also if it?s a good >>> clue, it doesn?t stop even if I send SIGINT or SIGQUIT. Not really sure >>> what?s going on here. >>> >>> On Aug 18, 2020, at 4:39 PM, Viraj Mehta wrote: >>> >>> Yeah, I?ll give it a shot. Thanks! >>> >>> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac >>> wrote: >>> >>> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both >>> GPU9 and GPU11. Could you please try again? Could you also try with py38 >>> which is now recommended and report back. If this works I will upgrade >>> packages across all servers. This could be potentially remotely related to >>> the fact that Ifegenia could not build TensorFlow. Another thought is that >>> the ipython SQLite database is corrupted. >>> >>> Best, >>> Predag >>> >>> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta >>> wrote: >>> >>>> Hi Predrag, >>>> >>>> Hope you?re doing well. I?ve been running into an issue the last couple >>>> days on the Auton cluster that is blocking my work on code that used to >>>> work and was hoping to get your thoughts. I have tried to distill this down >>>> to a small but replicable issue, as seen in the attachment, which I have >>>> seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why >>>> this might be? Thanks. >>>> >>>> Best, >>>> Viraj >>> >>> >>> >>> > > -- > > *Chirag Nagpal* PhD Student, Auton Lab > School of Computer Science > Carnegie Mellon University > cs.cmu.edu/~chiragn > > > -- *Chirag Nagpal* PhD Student, Auton Lab School of Computer Science Carnegie Mellon University cs.cmu.edu/~chiragn -------------- next part -------------- An HTML attachment was scrubbed... URL: From chufang at andrew.cmu.edu Wed Aug 19 12:39:04 2020 From: chufang at andrew.cmu.edu (Chufan Gao) Date: Wed, 19 Aug 2020 12:39:04 -0400 Subject: cuda problem In-Reply-To: References: <20200818061305.VcGzT%predragp@andrew.cmu.edu> <20200818212348.C8xKV%predragp@andrew.cmu.edu> <20200819014429.fJrx8%predragp@andrew.cmu.edu> Message-ID: Hi All, I ran into a side issue where tensorflow does indeed detect all of the gpus, but pytorch now doesn't work. I did some fiddling, and I figured out that installing pytorch via conda doesn't link up with cuda correctly, but reinstalling it through pip does. So if anyone else is having this issue, try reinstalling through pip. On Tue, Aug 18, 2020 at 10:07 PM Ifigeneia Apostolopoulou < iapostol at andrew.cmu.edu> wrote: > At least for me, Friday evening (or beyond) is fine. All these servers are > currently very underutilized > (or running very old processes with models probably compiled before the > issue popped up). I am not sure if this is because > other people have faced similar problems (with me being the first to > 'complaint'). In the meantime and for better job scheduling, > it may be better if anyone who doesn't encounter a similar problem, > prefers one of those nodes (gpu2,10,11,12,13,14,21), though. > > thanks again and have a good night! > > > > On Tue, Aug 18, 2020 at 9:44 PM Predrag Punosevac > wrote: > >> Ifigeneia Apostolopoulou wrote: >> >> > Predrag, now it works fine. thanks a million! :-D >> > >> > gpu2,10,11,12,13,14,21 seem to have a similar issue. >> >> I am going to sit on this info at least until Friday evening. You are >> not supposed to use more than 2-3 nodes at the same time anyway. If >> those servers work for other people who might not even use TensorFlow I >> would prefer not to reboot them. It takes about 1.5h to rebuilt each >> machine. You just listed 7 machines. That is 10.5h of work if everything >> goes without a hitch. >> >> Cheers, >> Predrag >> >> > >> > >> > >> > On Tue, Aug 18, 2020 at 5:23 PM Predrag Punosevac < >> predragp at andrew.cmu.edu> >> > wrote: >> > >> > > Ifigeneia Apostolopoulou wrote: >> > > >> > > > yes, but there is still no bin/ptxas in cuda 10.2. actually >> there's no >> > > bin >> > > > directory. it seems that cuda-10.2 is corrupted? >> > > > >> > > >> > > I took a clue from your message and did the fresh installation of CUDA >> > > to GPU1 only. I upgraded the kernel and the driver to the latest one >> > > supporting branch 7.8 of RedHat. The driver works as expected in my >> > > limited testing. CUDA is upgraded to the newly released 11.0. I really >> > > hate that NVidia is intensionally breaking previous stable releases as >> > > soon as the new one is branched out. >> > > >> > > Could you please try building Tensor Flow in GPU1 and report the >> > > progress? We will eventually have to upgrade all GPU nodes to CUDA 11 >> > > even if they are fully working now. >> > > >> > > Best, >> > > Predrag >> > > >> > > >> > > >> > > > On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac < >> > > predragp at andrew.cmu.edu> >> > > > wrote: >> > > > >> > > > > Because cuda folder is cuda 10.2 folder. Cuda folder is typically >> just >> > > a >> > > > > symbolic link to the curen version of cuda. >> > > > > >> > > > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller < >> mille856 at andrew.cmu.edu> >> > > > > wrote: >> > > > > >> > > > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda >> > > folder >> > > > >> or CUPTI. >> > > > >> >> > > > >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou < >> > > > >> iapostol at andrew.cmu.edu> wrote: >> > > > >> >> > > > >>> Hi Kyle, >> > > > >>> Thanks a lot for your reply! >> > > > >>> >> > > > >>> I also had this issue and I solved it as you did. However, this >> > > seems to >> > > > >>> be another issue: >> > > > >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or >> > > anywhere >> > > > >>> in gpu1 to set it to my path) which causes the issue. >> > > > >>> I am also attaching the screenshot with the working (gpu3) and >> > > > >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the >> > > directory >> > > > >>> cuda (and all its content) has been moved (and I can't find it >> in >> > > any other >> > > > >>> directory). >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller < >> mille856 at andrew.cmu.edu >> > > > >> > > > >>> wrote: >> > > > >>> >> > > > >>>> Ifi, >> > > > >>>> I recently had difficulty on GPU13, having not used it in a >> long >> > > > >>>> while. For me, the issue was that miniconda had moved. I added >> > > > >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment >> (not >> > > sure if >> > > > >>>> that was necessary). Then it worked. >> > > > >>>> -Kyle >> > > > >>>> >> > > > >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac < >> > > > >>>> predragp at andrew.cmu.edu> wrote: >> > > > >>>> >> > > > >>>>> Ifigeneia Apostolopoulou wrote: >> > > > >>>>> >> > > > >>>>> > Hi Predrag, >> > > > >>>>> > >> > > > >>>>> > I hope that this (weird) summer is going well! >> > > > >>>>> > >> > > > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14. >> > > > >>>>> > Specifically, I no longer can find >> > > > >>>>> >> > > > >>>>> I have not touch those servers in a very long time. I am >> CC-ing >> > > users >> > > > >>>>> mailing list. My brain is shutting down at this late hour. >> Maybe >> > > > >>>>> somebody could be of more help tomorrow morning. >> > > > >>>>> >> > > > >>>>> > >> > > > >>>>> > /usr/local/cuda/extras/CUPTI >> > > > >>>>> > >> > > > >>>>> >> > > > >>>>> I believe you. >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> > which results in the error when I'm building my tensorflow >> > > models. >> > > > >>>>> > >> > > > >>>>> > Not found: ./bin/ptxas not found. Relying on driver to >> perform >> > > ptx >> > > > >>>>> > compilation. This message will be only logged once. >> > > > >>>>> > >> > > > >>>>> > Any ideas, how could I solve this issue? Would it be >> possible to >> > > > >>>>> restore >> > > > >>>>> > the cuda directory? >> > > > >>>>> > >> > > > >>>>> > Also, I currently do not have access to gpu21. >> > > > >>>>> >> > > > >>>>> It is fixed now. I just restarted sssd daemon. Please don't >> use >> > > gpu20 >> > > > >>>>> and gpu21 unless you are training 3D neuronal networks for >> which >> > > you >> > > > >>>>> need lot of GPU memory. >> > > > >>>>> >> > > > >>>>> Predrag >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> > >> > > > >>>>> > Thanks a lot in advance! >> > > > >>>>> >> > > > >>>> >> > > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From virajm at andrew.cmu.edu Wed Aug 19 18:03:32 2020 From: virajm at andrew.cmu.edu (Viraj Mehta) Date: Wed, 19 Aug 2020 17:03:32 -0500 Subject: ipython hangs on Auton cluster In-Reply-To: <213c9d760fdd4a1888d9a20ebec816de@andrew.cmu.edu> References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> <4A943418-498F-494C-8B94-0CC61DE4E4EB@andrew.cmu.edu> <20200819023511.nIWfP%predragp@andrew.cmu.edu> <213c9d760fdd4a1888d9a20ebec816de@andrew.cmu.edu> Message-ID: Hi Predrag & Users, I have a clue as to what is wrong with our cluster. Had a few processes running which broke due to this sqlite error from ipython: I?d imagine this is what is wrong with all our ipython stuff. No idea how to debug this, but I hope it can be helpful as we try to fix this. Thanks, Viraj > On Aug 18, 2020, at 10:28 PM, Chufan Gao wrote: > > Hi All, > > Rachel and I are also facing a similar issue with our Jupyter notebooks. > We also both reinstalled jupyter with no effect. > For me, these notebooks are extremely helpful in fast code iteration and testing out concepts. > I also have the intuition that it is an upstream issue, as they were running fine (without any changes) before lop2 went down. > Would you please take another look? > > Worst case, I have to convert my notebooks into .py files, which will slow things down. > > Sincerely, > Chufan (Andy) Gao > From: Autonlab-users > on behalf of Predrag Punosevac > > Sent: Tuesday, August 18, 2020 10:35:11 PM > To: Viraj Mehta > Cc: users at autonlab.org > Subject: Re: ipython hangs on Auton cluster > > Viraj Mehta > wrote: > > > I???m pretty sure it???s not an upstream bug, as many environments > > (conda and virtualenv) which were working with ipython across several > > python versions before are now not working. > > > > I understand that ipython and ipdb aren???t typically required for > > Python workflows but certain efforts, like stepping through code that > > requires a GPU and loads a model from the Auton cluster, are difficult > > to debug without ipdb. Is there anything else that has changed that > > might have broken it? > > Nothing that I am aware of. However, you do understand that the system > is very complex and it is like a live organism constantly morphing. > > Best, > Predrag > > > > > > > Thanks, > > Viraj > > > > > On Aug 18, 2020, at 6:21 PM, Predrag Punosevac > wrote: > > > > > > I looked a bit more carefully. It could be an upstream bug. It wouldn't be the first time > > > > > > https://github.com/ipython/ipython/issues/11678 > > > ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub > github.com > Now I'm facing that ipython won't start without any error messages. I tried to run it with DEBUG, then the command will be "uninterruptible sleep" after the logs. $ pyenv global system $ python --version Python 2.7.5 $ ipython --version ... > > > ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub > github.com > Now I'm facing that ipython won't start without any error messages. I tried to run it with DEBUG, then the command will be "uninterruptible sleep" after the logs. $ pyenv global system $ python --version Python 2.7.5 $ ipython --version ... > > > > > > > > You don't need ipython to run Python code. You could work and debug your code on your local machine and just run production code on the server. A typical python code is just a script starting with a shebang following with a path to the binaries. I fail to see how ipython could be useful for that. It is surely useful for the interactive work. > > > > > > Predrag > > > > > > On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta >> wrote: > > > Tried this with 3.7 and 3.8 and it still hangs. Also if it???s a good clue, it doesn???t stop even if I send SIGINT or SIGQUIT. Not really sure what???s going on here. > > > > > >> On Aug 18, 2020, at 4:39 PM, Viraj Mehta >> wrote: > > >> > > >> Yeah, I???ll give it a shot. Thanks! > > >> > > >>> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac >> wrote: > > >>> > > >>> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. > > >>> > > >>> Best, > > >>> Predag > > >>> > > >>> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta >> wrote: > > >>> Hi Predrag, > > >>> > > >>> Hope you???re doing well. I???ve been running into an issue the last couple days on the Auton cluster that is blocking my work on code that used to work and was hoping to get your thoughts. I have tried to distill this down to a small but replicable issue, as seen in the attachment, which I have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why this might be? Thanks. > > >>> > > >>> Best, > > >>> Viraj > > >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-1.png Type: image/png Size: 534822 bytes Desc: not available URL: From predragp at andrew.cmu.edu Wed Aug 19 18:46:45 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 19 Aug 2020 18:46:45 -0400 Subject: ipython hangs on Auton cluster In-Reply-To: References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> <4A943418-498F-494C-8B94-0CC61DE4E4EB@andrew.cmu.edu> <20200819023511.nIWfP%predragp@andrew.cmu.edu> <213c9d760fdd4a1888d9a20ebec816de@andrew.cmu.edu> Message-ID: Your report indicates that my gut feeling that SQLite database is the culprit seems to be correct. Per our documentation https://www.autonlab.org/autonlab_wiki/aetiquette.html#don-ts *Use your scratch directory to store Jupiter sqlite database!* You placed your SQLite database onto the NFS share (zfsauton2) and you are surprised that it is incoherent. I hope you understand now better the lack of urgency in my responses. Predrag On Wed, Aug 19, 2020 at 6:03 PM Viraj Mehta wrote: > Hi Predrag & Users, > > I have a clue as to what is wrong with our cluster. Had a few processes > running which broke due to this sqlite error from ipython: > I?d imagine this is what is wrong with all our ipython stuff. No idea how > to debug this, but I hope it can be helpful as we try to fix this. > > Thanks, > Viraj > > On Aug 18, 2020, at 10:28 PM, Chufan Gao wrote: > > Hi All, > > Rachel and I are also facing a similar issue with our Jupyter notebooks. > We also both reinstalled jupyter with no effect. > > For me, these notebooks are extremely helpful in fast code iteration and > testing out concepts. > I also have the intuition that it is an upstream issue, as they were > running fine (without any changes) before lop2 went down. > Would you please take another look? > > Worst case, I have to convert my notebooks into .py files, which will slow > things down. > > Sincerely, > Chufan (Andy) Gao > ------------------------------ > *From:* Autonlab-users on behalf of > Predrag Punosevac > *Sent:* Tuesday, August 18, 2020 10:35:11 PM > *To:* Viraj Mehta > *Cc:* users at autonlab.org > *Subject:* Re: ipython hangs on Auton cluster > > Viraj Mehta wrote: > > > I???m pretty sure it???s not an upstream bug, as many environments > > (conda and virtualenv) which were working with ipython across several > > python versions before are now not working. > > > > I understand that ipython and ipdb aren???t typically required for > > Python workflows but certain efforts, like stepping through code that > > requires a GPU and loads a model from the Auton cluster, are difficult > > to debug without ipdb. Is there anything else that has changed that > > might have broken it? > > Nothing that I am aware of. However, you do understand that the system > is very complex and it is like a live organism constantly morphing. > > Best, > Predrag > > > > > > > Thanks, > > Viraj > > > > > On Aug 18, 2020, at 6:21 PM, Predrag Punosevac < > predragp at andrew.cmu.edu> wrote: > > > > > > I looked a bit more carefully. It could be an upstream bug. It > wouldn't be the first time > > > > > > https://github.com/ipython/ipython/issues/11678 < > https://github.com/ipython/ipython/issues/11678> > > ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub > > github.com > Now I'm facing that ipython won't start without any error messages. I > tried to run it with DEBUG, then the command will be "uninterruptible > sleep" after the logs. $ pyenv global system $ python --version Python > 2.7.5 $ ipython --version ... > > > ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub > > github.com > Now I'm facing that ipython won't start without any error messages. I > tried to run it with DEBUG, then the command will be "uninterruptible > sleep" after the logs. $ pyenv global system $ python --version Python > 2.7.5 $ ipython --version ... > > > > > > > > You don't need ipython to run Python code. You could work and debug > your code on your local machine and just run production code on the server. > A typical python code is just a script starting with a shebang following > with a path to the binaries. I fail to see how ipython could be useful for > that. It is surely useful for the interactive work. > > > > > > Predrag > > > > > > On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta mailto:virajm at andrew.cmu.edu >> wrote: > > > Tried this with 3.7 and 3.8 and it still hangs. Also if it???s a good > clue, it doesn???t stop even if I send SIGINT or SIGQUIT. Not really sure > what???s going on here. > > > > > >> On Aug 18, 2020, at 4:39 PM, Viraj Mehta mailto:virajm at andrew.cmu.edu >> wrote: > > >> > > >> Yeah, I???ll give it a shot. Thanks! > > >> > > >>> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac < > predragp at andrew.cmu.edu >> wrote: > > >>> > > >>> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on > both GPU9 and GPU11. Could you please try again? Could you also try with > py38 which is now recommended and report back. If this works I will upgrade > packages across all servers. This could be potentially remotely related to > the fact that Ifegenia could not build TensorFlow. Another thought is that > the ipython SQLite database is corrupted. > > >>> > > >>> Best, > > >>> Predag > > >>> > > >>> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta mailto:virajm at andrew.cmu.edu >> wrote: > > >>> Hi Predrag, > > >>> > > >>> Hope you???re doing well. I???ve been running into an issue the last > couple days on the Auton cluster that is blocking my work on code that used > to work and was hoping to get your thoughts. I have tried to distill this > down to a small but replicable issue, as seen in the attachment, which I > have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know > why this might be? Thanks. > > >>> > > >>> Best, > > >>> Viraj > > >> > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-1.png Type: image/png Size: 534822 bytes Desc: not available URL: From virajm at andrew.cmu.edu Wed Aug 19 19:08:58 2020 From: virajm at andrew.cmu.edu (Viraj Mehta) Date: Wed, 19 Aug 2020 18:08:58 -0500 Subject: ipython hangs on Auton cluster In-Reply-To: References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> <4A943418-498F-494C-8B94-0CC61DE4E4EB@andrew.cmu.edu> <20200819023511.nIWfP%predragp@andrew.cmu.edu> <213c9d760fdd4a1888d9a20ebec816de@andrew.cmu.edu> Message-ID: <59B1882C-1AFC-44C9-8429-ACEFB7744C5F@andrew.cmu.edu> That makes sense. Sorry for the trouble. Do you know how to make sure that goes on scratch? It is not obvious to me upon Googling. Might be helpful to add that to the wiki so that we all know how to avoid this in the future. Thanks a bunch, Viraj > On Aug 19, 2020, at 5:46 PM, Predrag Punosevac wrote: > > Your report indicates that my gut feeling that SQLite database is the culprit seems to be correct. Per our documentation > > https://www.autonlab.org/autonlab_wiki/aetiquette.html#don-ts > > Use your scratch directory to store Jupiter sqlite database! > > You placed your SQLite database onto the NFS share (zfsauton2) and you are surprised that it is incoherent. I hope you understand now better the lack of urgency in my responses. > > Predrag > > On Wed, Aug 19, 2020 at 6:03 PM Viraj Mehta > wrote: > Hi Predrag & Users, > > I have a clue as to what is wrong with our cluster. Had a few processes running which broke due to this sqlite error from ipython: > I?d imagine this is what is wrong with all our ipython stuff. No idea how to debug this, but I hope it can be helpful as we try to fix this. > > Thanks, > Viraj > >> On Aug 18, 2020, at 10:28 PM, Chufan Gao > wrote: >> >> Hi All, >> >> Rachel and I are also facing a similar issue with our Jupyter notebooks. >> We also both reinstalled jupyter with no effect. >> For me, these notebooks are extremely helpful in fast code iteration and testing out concepts. >> I also have the intuition that it is an upstream issue, as they were running fine (without any changes) before lop2 went down. >> Would you please take another look? >> >> Worst case, I have to convert my notebooks into .py files, which will slow things down. >> >> Sincerely, >> Chufan (Andy) Gao >> From: Autonlab-users > on behalf of Predrag Punosevac > >> Sent: Tuesday, August 18, 2020 10:35:11 PM >> To: Viraj Mehta >> Cc: users at autonlab.org >> Subject: Re: ipython hangs on Auton cluster >> >> Viraj Mehta > wrote: >> >> > I???m pretty sure it???s not an upstream bug, as many environments >> > (conda and virtualenv) which were working with ipython across several >> > python versions before are now not working. >> > >> > I understand that ipython and ipdb aren???t typically required for >> > Python workflows but certain efforts, like stepping through code that >> > requires a GPU and loads a model from the Auton cluster, are difficult >> > to debug without ipdb. Is there anything else that has changed that >> > might have broken it? >> >> Nothing that I am aware of. However, you do understand that the system >> is very complex and it is like a live organism constantly morphing. >> >> Best, >> Predrag >> >> >> >> > >> > Thanks, >> > Viraj >> > >> > > On Aug 18, 2020, at 6:21 PM, Predrag Punosevac > wrote: >> > > >> > > I looked a bit more carefully. It could be an upstream bug. It wouldn't be the first time >> > > >> > > https://github.com/ipython/ipython/issues/11678 > >> >> ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub >> github.com >> Now I'm facing that ipython won't start without any error messages. I tried to run it with DEBUG, then the command will be "uninterruptible sleep" after the logs. $ pyenv global system $ python --version Python 2.7.5 $ ipython --version ... >> >> >> ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub >> github.com >> Now I'm facing that ipython won't start without any error messages. I tried to run it with DEBUG, then the command will be "uninterruptible sleep" after the logs. $ pyenv global system $ python --version Python 2.7.5 $ ipython --version ... >> >> >> > > >> > > You don't need ipython to run Python code. You could work and debug your code on your local machine and just run production code on the server. A typical python code is just a script starting with a shebang following with a path to the binaries. I fail to see how ipython could be useful for that. It is surely useful for the interactive work. >> > > >> > > Predrag >> > > >> > > On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta >> wrote: >> > > Tried this with 3.7 and 3.8 and it still hangs. Also if it???s a good clue, it doesn???t stop even if I send SIGINT or SIGQUIT. Not really sure what???s going on here. >> > > >> > >> On Aug 18, 2020, at 4:39 PM, Viraj Mehta >> wrote: >> > >> >> > >> Yeah, I???ll give it a shot. Thanks! >> > >> >> > >>> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac >> wrote: >> > >>> >> > >>> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. >> > >>> >> > >>> Best, >> > >>> Predag >> > >>> >> > >>> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta >> wrote: >> > >>> Hi Predrag, >> > >>> >> > >>> Hope you???re doing well. I???ve been running into an issue the last couple days on the Auton cluster that is blocking my work on code that used to work and was hoping to get your thoughts. I have tried to distill this down to a small but replicable issue, as seen in the attachment, which I have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why this might be? Thanks. >> > >>> >> > >>> Best, >> > >>> Viraj >> > >> >> > > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From virajm at andrew.cmu.edu Wed Aug 19 19:35:07 2020 From: virajm at andrew.cmu.edu (Viraj Mehta) Date: Wed, 19 Aug 2020 18:35:07 -0500 Subject: ipython hangs on Auton cluster In-Reply-To: References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> <4A943418-498F-494C-8B94-0CC61DE4E4EB@andrew.cmu.edu> <20200819023511.nIWfP%predragp@andrew.cmu.edu> <213c9d760fdd4a1888d9a20ebec816de@andrew.cmu.edu> Message-ID: <645FFA0C-63F4-41A1-B718-37022C7473DE@andrew.cmu.edu> Actually, I figured this out: for everyone having trouble with iPython/jupyter, here?s a solution: 1. Get into a python environment that has ipython installed 2. Run `ipython profile create` 3. Run `cd ~/.ipython/profile_default` 4. Edit the file in there called ipython_config.py by finding the option for c.HistoryAccessor.hist_file and setting it to ?:memory:? ( this will mean your command history isn?t saved between ipython sessions but whatever, you could also point this at scratch). Hope this is helpful ? not sure how to add this to the wiki but it might be good to do so. Viraj > On Aug 19, 2020, at 5:46 PM, Predrag Punosevac wrote: > > Your report indicates that my gut feeling that SQLite database is the culprit seems to be correct. Per our documentation > > https://www.autonlab.org/autonlab_wiki/aetiquette.html#don-ts > > Use your scratch directory to store Jupiter sqlite database! > > You placed your SQLite database onto the NFS share (zfsauton2) and you are surprised that it is incoherent. I hope you understand now better the lack of urgency in my responses. > > Predrag > > On Wed, Aug 19, 2020 at 6:03 PM Viraj Mehta > wrote: > Hi Predrag & Users, > > I have a clue as to what is wrong with our cluster. Had a few processes running which broke due to this sqlite error from ipython: > I?d imagine this is what is wrong with all our ipython stuff. No idea how to debug this, but I hope it can be helpful as we try to fix this. > > Thanks, > Viraj > >> On Aug 18, 2020, at 10:28 PM, Chufan Gao > wrote: >> >> Hi All, >> >> Rachel and I are also facing a similar issue with our Jupyter notebooks. >> We also both reinstalled jupyter with no effect. >> For me, these notebooks are extremely helpful in fast code iteration and testing out concepts. >> I also have the intuition that it is an upstream issue, as they were running fine (without any changes) before lop2 went down. >> Would you please take another look? >> >> Worst case, I have to convert my notebooks into .py files, which will slow things down. >> >> Sincerely, >> Chufan (Andy) Gao >> From: Autonlab-users > on behalf of Predrag Punosevac > >> Sent: Tuesday, August 18, 2020 10:35:11 PM >> To: Viraj Mehta >> Cc: users at autonlab.org >> Subject: Re: ipython hangs on Auton cluster >> >> Viraj Mehta > wrote: >> >> > I???m pretty sure it???s not an upstream bug, as many environments >> > (conda and virtualenv) which were working with ipython across several >> > python versions before are now not working. >> > >> > I understand that ipython and ipdb aren???t typically required for >> > Python workflows but certain efforts, like stepping through code that >> > requires a GPU and loads a model from the Auton cluster, are difficult >> > to debug without ipdb. Is there anything else that has changed that >> > might have broken it? >> >> Nothing that I am aware of. However, you do understand that the system >> is very complex and it is like a live organism constantly morphing. >> >> Best, >> Predrag >> >> >> >> > >> > Thanks, >> > Viraj >> > >> > > On Aug 18, 2020, at 6:21 PM, Predrag Punosevac > wrote: >> > > >> > > I looked a bit more carefully. It could be an upstream bug. It wouldn't be the first time >> > > >> > > https://github.com/ipython/ipython/issues/11678 > >> >> ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub >> github.com >> Now I'm facing that ipython won't start without any error messages. I tried to run it with DEBUG, then the command will be "uninterruptible sleep" after the logs. $ pyenv global system $ python --version Python 2.7.5 $ ipython --version ... >> >> >> ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub >> github.com >> Now I'm facing that ipython won't start without any error messages. I tried to run it with DEBUG, then the command will be "uninterruptible sleep" after the logs. $ pyenv global system $ python --version Python 2.7.5 $ ipython --version ... >> >> >> > > >> > > You don't need ipython to run Python code. You could work and debug your code on your local machine and just run production code on the server. A typical python code is just a script starting with a shebang following with a path to the binaries. I fail to see how ipython could be useful for that. It is surely useful for the interactive work. >> > > >> > > Predrag >> > > >> > > On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta >> wrote: >> > > Tried this with 3.7 and 3.8 and it still hangs. Also if it???s a good clue, it doesn???t stop even if I send SIGINT or SIGQUIT. Not really sure what???s going on here. >> > > >> > >> On Aug 18, 2020, at 4:39 PM, Viraj Mehta >> wrote: >> > >> >> > >> Yeah, I???ll give it a shot. Thanks! >> > >> >> > >>> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac >> wrote: >> > >>> >> > >>> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. >> > >>> >> > >>> Best, >> > >>> Predag >> > >>> >> > >>> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta >> wrote: >> > >>> Hi Predrag, >> > >>> >> > >>> Hope you???re doing well. I???ve been running into an issue the last couple days on the Auton cluster that is blocking my work on code that used to work and was hoping to get your thoughts. I have tried to distill this down to a small but replicable issue, as seen in the attachment, which I have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why this might be? Thanks. >> > >>> >> > >>> Best, >> > >>> Viraj >> > >> >> > > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sitongan at cmu.edu Fri Aug 21 01:17:32 2020 From: sitongan at cmu.edu (Sitong An) Date: Fri, 21 Aug 2020 13:17:32 +0800 Subject: ipython hangs on Auton cluster In-Reply-To: <645FFA0C-63F4-41A1-B718-37022C7473DE@andrew.cmu.edu> References: <8EA57DAE-B1FE-4998-B7FF-37245761F464@andrew.cmu.edu> <338E6BCA-A035-4E6C-9732-4DAF86E77FE5@andrew.cmu.edu> <4F61E3DB-9F58-4879-9E18-E6B37711B2BB@andrew.cmu.edu> <4A943418-498F-494C-8B94-0CC61DE4E4EB@andrew.cmu.edu> <20200819023511.nIWfP%predragp@andrew.cmu.edu> <213c9d760fdd4a1888d9a20ebec816de@andrew.cmu.edu> <645FFA0C-63F4-41A1-B718-37022C7473DE@andrew.cmu.edu> Message-ID: <3EE19283-E1F7-4407-B5E9-2A73713F6481@cmu.edu> Hi All, For your information, for those who are using jupyter notebook, you might have to do the following too: 1. jupyter notebook --generate-config 2. you will find jupyter notebook config at ~/.jupyter/jupyter_notebook_config.py 3. set c.NotebookNotary.db_file = ':memory:' I also updated my nbformat to the latest version. Issue related here: https://github.com/jupyter/nbformat/issues/52 Cheers, Sitong > On 20 Aug 2020, at 07:35, Viraj Mehta wrote: > > Actually, I figured this out: for everyone having trouble with iPython/jupyter, here?s a solution: > > 1. Get into a python environment that has ipython installed > 2. Run `ipython profile create` > 3. Run `cd ~/.ipython/profile_default` > 4. Edit the file in there called ipython_config.py by finding the option for c.HistoryAccessor.hist_file and setting it to ?:memory:? ( this will mean your command history isn?t saved between ipython sessions but whatever, you could also point this at scratch). > > Hope this is helpful ? not sure how to add this to the wiki but it might be good to do so. > > Viraj > >> On Aug 19, 2020, at 5:46 PM, Predrag Punosevac > wrote: >> >> Your report indicates that my gut feeling that SQLite database is the culprit seems to be correct. Per our documentation >> >> https://www.autonlab.org/autonlab_wiki/aetiquette.html#don-ts >> >> Use your scratch directory to store Jupiter sqlite database! >> >> You placed your SQLite database onto the NFS share (zfsauton2) and you are surprised that it is incoherent. I hope you understand now better the lack of urgency in my responses. >> >> Predrag >> >> On Wed, Aug 19, 2020 at 6:03 PM Viraj Mehta > wrote: >> Hi Predrag & Users, >> >> I have a clue as to what is wrong with our cluster. Had a few processes running which broke due to this sqlite error from ipython: >> I?d imagine this is what is wrong with all our ipython stuff. No idea how to debug this, but I hope it can be helpful as we try to fix this. >> >> Thanks, >> Viraj >> >>> On Aug 18, 2020, at 10:28 PM, Chufan Gao > wrote: >>> >>> Hi All, >>> >>> Rachel and I are also facing a similar issue with our Jupyter notebooks. >>> We also both reinstalled jupyter with no effect. >>> For me, these notebooks are extremely helpful in fast code iteration and testing out concepts. >>> I also have the intuition that it is an upstream issue, as they were running fine (without any changes) before lop2 went down. >>> Would you please take another look? >>> >>> Worst case, I have to convert my notebooks into .py files, which will slow things down. >>> >>> Sincerely, >>> Chufan (Andy) Gao >>> From: Autonlab-users > on behalf of Predrag Punosevac > >>> Sent: Tuesday, August 18, 2020 10:35:11 PM >>> To: Viraj Mehta >>> Cc: users at autonlab.org >>> Subject: Re: ipython hangs on Auton cluster >>> >>> Viraj Mehta > wrote: >>> >>> > I???m pretty sure it???s not an upstream bug, as many environments >>> > (conda and virtualenv) which were working with ipython across several >>> > python versions before are now not working. >>> > >>> > I understand that ipython and ipdb aren???t typically required for >>> > Python workflows but certain efforts, like stepping through code that >>> > requires a GPU and loads a model from the Auton cluster, are difficult >>> > to debug without ipdb. Is there anything else that has changed that >>> > might have broken it? >>> >>> Nothing that I am aware of. However, you do understand that the system >>> is very complex and it is like a live organism constantly morphing. >>> >>> Best, >>> Predrag >>> >>> >>> >>> > >>> > Thanks, >>> > Viraj >>> > >>> > > On Aug 18, 2020, at 6:21 PM, Predrag Punosevac > wrote: >>> > > >>> > > I looked a bit more carefully. It could be an upstream bug. It wouldn't be the first time >>> > > >>> > > https://github.com/ipython/ipython/issues/11678 > >>> >>> ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub >>> github.com >>> Now I'm facing that ipython won't start without any error messages. I tried to run it with DEBUG, then the command will be "uninterruptible sleep" after the logs. $ pyenv global system $ python --version Python 2.7.5 $ ipython --version ... >>> >>> >>> ipython won't start ? Issue #11678 ? ipython/ipython ? GitHub >>> github.com >>> Now I'm facing that ipython won't start without any error messages. I tried to run it with DEBUG, then the command will be "uninterruptible sleep" after the logs. $ pyenv global system $ python --version Python 2.7.5 $ ipython --version ... >>> >>> >>> > > >>> > > You don't need ipython to run Python code. You could work and debug your code on your local machine and just run production code on the server. A typical python code is just a script starting with a shebang following with a path to the binaries. I fail to see how ipython could be useful for that. It is surely useful for the interactive work. >>> > > >>> > > Predrag >>> > > >>> > > On Tue, Aug 18, 2020 at 5:45 PM Viraj Mehta >> wrote: >>> > > Tried this with 3.7 and 3.8 and it still hangs. Also if it???s a good clue, it doesn???t stop even if I send SIGINT or SIGQUIT. Not really sure what???s going on here. >>> > > >>> > >> On Aug 18, 2020, at 4:39 PM, Viraj Mehta >> wrote: >>> > >> >>> > >> Yeah, I???ll give it a shot. Thanks! >>> > >> >>> > >>> On Aug 18, 2020, at 4:38 PM, Predrag Punosevac >> wrote: >>> > >>> >>> > >>> I just upgraded all /opt/conda-py37 and /opt/conda-py38 packages on both GPU9 and GPU11. Could you please try again? Could you also try with py38 which is now recommended and report back. If this works I will upgrade packages across all servers. This could be potentially remotely related to the fact that Ifegenia could not build TensorFlow. Another thought is that the ipython SQLite database is corrupted. >>> > >>> >>> > >>> Best, >>> > >>> Predag >>> > >>> >>> > >>> On Tue, Aug 18, 2020 at 4:34 PM Viraj Mehta >> wrote: >>> > >>> Hi Predrag, >>> > >>> >>> > >>> Hope you???re doing well. I???ve been running into an issue the last couple days on the Auton cluster that is blocking my work on code that used to work and was hoping to get your thoughts. I have tried to distill this down to a small but replicable issue, as seen in the attachment, which I have seen hang on the ipython call on GPU9 and GPU11 so far. Do you know why this might be? Thanks. >>> > >>> >>> > >>> Best, >>> > >>> Viraj >>> > >> >>> > > >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sat Aug 22 02:14:01 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sat, 22 Aug 2020 02:14:01 -0400 Subject: GPU, CUDA upgrade report Message-ID: <20200822061401.0TZiS%predragp@andrew.cmu.edu> Dear Autonians, This email is a summary of the current state of our GPU nodes and CUDA installation. I decided earlier today to take my chance and attempt the minor upgrade of all GPU running RHEL 7.8, NVidia binary bloob drivers as well as CUDA Toolkits. This job was in part taken at the request of King Agamemnon's daughter :-) I am not sure if cuDNN is available for CUDA 11.0. Server name RHEL CUDA verson cuDNN version gpu1 7.8 11.0 not available gpu2 7.8 11.0 not available gpu3 7.8 11.0 not available gpu4 7.8 11.0 not available gpu5 7.8 11.0 not available gpu6 7.8 11.0 not available gpu7(reserved) 7.8 11.0 not available gpu8 7.8 11.0 not available gpu9 7.8 11.0 not available gpu10 7.8 11.0 not available gpu11 7.8 11.0 not available gpu12 7.8 11.0 not available gpu13 7.8 11.0 not available gpu14 7.8 11.0 not available gpu15 8.2 10.2 7.5 gpu16 8.2 10.2 7.5 gpu17 8.2 10.2 7.5 gpu18 8.2 10.2 7.5 gpu19 8.2 10.2 7.5 gpu20 8.2 10.2 7.5 gpu21 8.2 10.2 7.5 Best, Predrag From iapostol at andrew.cmu.edu Sat Aug 22 02:44:48 2020 From: iapostol at andrew.cmu.edu (Ifigeneia Apostolopoulou) Date: Sat, 22 Aug 2020 02:44:48 -0400 Subject: GPU, CUDA upgrade report In-Reply-To: <20200822061401.0TZiS%predragp@andrew.cmu.edu> References: <20200822061401.0TZiS%predragp@andrew.cmu.edu> Message-ID: Predrag, thanks soooo much. Artemis, Agamemnon and his daughter are all appeased now :-D greetings from the land of the Tauri! On Sat, Aug 22, 2020 at 2:16 AM Predrag Punosevac wrote: > Dear Autonians, > > This email is a summary of the current state of our GPU nodes and CUDA > installation. > > I decided earlier today to take my chance and attempt the minor upgrade > of all GPU running RHEL 7.8, NVidia binary bloob drivers as well as CUDA > Toolkits. This job was in part taken at the request of King Agamemnon's > daughter :-) I am not sure if cuDNN is available for CUDA 11.0. > > Server name RHEL CUDA verson cuDNN version > gpu1 7.8 11.0 not available > gpu2 7.8 11.0 not available > gpu3 7.8 11.0 not available > gpu4 7.8 11.0 not available > gpu5 7.8 11.0 not available > gpu6 7.8 11.0 not available > gpu7(reserved) 7.8 11.0 not available > gpu8 7.8 11.0 not available > gpu9 7.8 11.0 not available > gpu10 7.8 11.0 not available > gpu11 7.8 11.0 not available > gpu12 7.8 11.0 not available > gpu13 7.8 11.0 not available > gpu14 7.8 11.0 not available > gpu15 8.2 10.2 7.5 > gpu16 8.2 10.2 7.5 > gpu17 8.2 10.2 7.5 > gpu18 8.2 10.2 7.5 > gpu19 8.2 10.2 7.5 > gpu20 8.2 10.2 7.5 > gpu21 8.2 10.2 7.5 > > > Best, > Predrag > -------------- next part -------------- An HTML attachment was scrubbed... URL: From iapostol at andrew.cmu.edu Sun Aug 23 11:40:55 2020 From: iapostol at andrew.cmu.edu (Ifigeneia Apostolopoulou) Date: Sun, 23 Aug 2020 11:40:55 -0400 Subject: Fwd: Neurocomputing Review Request NEUCOM-D-20-03222 In-Reply-To: References: Message-ID: Hi all, I hope that you are well and that you are enjoying the last days of the summer! Below you will find an invitation for review, that due to time constraints, I will not be able to accept. In case anyone is interested in reviewing this article && has time by Sept 19 (the abstract is at the end of the forwarded email), please let me know to add you as suggested alternative reviewer. Thanks and happy fall semester :) ---------- Forwarded message --------- From: Neurocomputing Date: Sat, Aug 22, 2020 at 9:32 PM Subject: Neurocomputing Review Request NEUCOM-D-20-03222 To: Ifigeneia Apostolopoulou Dear Ms. Apostolopoulou, As editor of Neurocomputing, I would hereby like to ask you the big favor of reviewing the manuscript "Deep Hebbian predictive coding accounts for emergence of complex neural response properties along the visual cortical hierarchy" The abstract is attached at the bottom of this message. If possible, I would welcome receiving your review by Sep 19, 2020 (mm/dd/yyyy). Please click on one of the following links to indicate whether you accept or decline the role of reviewing this paper. If you are not able to review this manuscript, We would appreciate receiving suggestions for alternative reviewers. ****** In addition to accessing our subscriber content, you can also use our Open Access content. Read more about Open Access here: http://www.elsevier.com/openaccess Your help as an expert on neural networks is highly appreciated! Kind regards, Professor Yang Tang Associate Editor Reviewer Guidelines are now available to help you with your review: http://www.elsevier.com/wps/find/reviewershome.reviewers/reviewersguidelines Predictive coding provides a computational paradigm for modelling perceptual processing as the construction of representations accounting for causes of sensory inputs. Here, we developed a scalable, deep network architecture for predictive coding that is trained using a Hebbian learning rule and mimics the feedforward and feedback connectivity of the cortex. After training on image datasets, the models formed latent representations in higher area that allowed reconstruction of the original images. We analyzed low- and high-level properties such as orientation selectivity, object selectivity and sparseness of neuronal populations in the model. As reported experimentally, image selectivity increased systematically across ascending areas in the model hierarchy. Depending on the strength of regularization factors, sparseness also increased from lower to higher areas. These results suggest a rationale as to why experimental results on sparseness across the cortical hierarchy have been inconsistent. Finally, representations for different object classes became more distinguishable from lower to higher areas. Thus, deep neural networks trained using a Hebbian formulation of predictive coding can reproduce several properties associated with neuronal responses along the visual cortical hierarchy. For further assistance, please visit our customer support site at http://help.elsevier.com/app/answers/list/p/7923 Here you can search for solutions on a range of topics, find answers to frequently asked questions and learn more about EM via interactive tutorials. You will also find our 24/7 support contact details should you need any further assistance from one of our customer support representatives. Please note: Reviews are subject to a confidentiality policy, http://service.elsevier.com/app/answers/detail/a_id/14156/supporthub/publishing/ __________________________________________________ In compliance with data protection regulations, you may request that we remove your personal registration details at any time. (Use the following URL: https://www.editorialmanager.com/neucom/login.asp?a=r). Please contact the publication office if you have any questions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yeehos at andrew.cmu.edu Sun Aug 23 19:31:04 2020 From: yeehos at andrew.cmu.edu (Yeeho Song) Date: Sun, 23 Aug 2020 19:31:04 -0400 Subject: LOV7 Full Message-ID: Dear All, This is a gentle reminder that LOV7 scratch is almost full. Please check and delete / move your files from the scratch directories if possible. Thank you! yeehos at lov7$ df -h /home/scratch Filesystem Size Used Avail Use% Mounted on /dev/mapper/sl-home 169G 169G 1.7M 100% /home Sincerely, Yeeho Song -------------- next part -------------- An HTML attachment was scrubbed... URL: From arundhat at andrew.cmu.edu Mon Aug 24 23:37:26 2020 From: arundhat at andrew.cmu.edu (Arundhati Banerjee) Date: Mon, 24 Aug 2020 23:37:26 -0400 Subject: LOV5 Full Message-ID: <0B6390EC-6E73-43E2-8EDB-FF647F5B1F0A@andrew.cmu.edu> Hi everyone, The scratch directory on lov5 is full. Please free up space if possible. Thank you! arundhat at lov5$ df -h /home/scratch Filesystem Size Used Avail Use% Mounted on /dev/mapper/sl-home 1.8T 1.8T 20K 100% /home Best regards, Arundhati -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Aug 26 00:20:05 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 26 Aug 2020 00:20:05 -0400 Subject: GPU3 In-Reply-To: <675d328721c24709a24c96b8c2c6930a@andrew.cmu.edu> References: <1dc0b74848724f9d89e491eed53ec5a8@andrew.cmu.edu> <675d328721c24709a24c96b8c2c6930a@andrew.cmu.edu> Message-ID: I had positive reports about TensorFlow. The very reason we went through this exercise of upgrading a bunch of servers that TensorFlow was previously broken for a few people. You are better off soliciting help from users at autonlab. These days I use only Julia for the stuff I do. Cheers, Predrag On Tue, Aug 25, 2020 at 10:48 PM Jielin Qiu wrote: > Thank you very much! The GPU devices are able to be loaded correctly now. > But after loading the devices, the PyTorch and TensorFlow scripts just hang > there, and they do not go forward to get the results. It seems that the > program gets stuck there. I tried to re-install the packages but it seems > that it doesn't work. Would there be any different or additional library I > need to install or update to fit the new update on our cluster? > > > Thanks, > > Jielin > > ------------------------------ > *From:* Predrag Punosevac > *Sent:* Tuesday, August 25, 2020 7:06:01 PM > *To:* Jielin Qiu > *Subject:* Re: GPU3 > > Fixed > > On Tue, Aug 25, 2020 at 5:36 PM Jielin Qiu wrote: > >> Hi Predrag, >> >> >> Hope you are doing well! >> >> >> I came across a problem with GPU3, as I could not load the GPU device >> correctly. We tried to run the scripts which were able to run on GPU3 >> before the driver update, now those scripts are not workable in the GPU >> mode, but they are able to run in the CPU mode. Would you mind helping me >> check if there is anything wrong with the driver or library on GPU3? >> >> >> Thanks, >> >> Jielin >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From chiragn at cs.cmu.edu Wed Aug 26 19:15:26 2020 From: chiragn at cs.cmu.edu (Chirag Nagpal) Date: Wed, 26 Aug 2020 19:15:26 -0400 Subject: Please free up compute nodes. Message-ID: Hello All *All the lov machines from 1-5 are completely choked with jobs. Please free up the machines. * *Remember if you find yourself using *all* the cores of *all* the machines, you are likely affecting productivity of other lab members. Please be reasonable about the resource usage. * Thank you Chirag -- *Chirag Nagpal* PhD Student, Auton Lab School of Computer Science Carnegie Mellon University cs.cmu.edu/~chiragn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Thu Aug 27 09:23:49 2020 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Thu, 27 Aug 2020 09:23:49 -0400 Subject: [Lunch] Today @noon over Zoom Message-ID: https://cmu.zoom.us/j/492870487 We hope to see you there! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Aug 27 17:30:16 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 27 Aug 2020 17:30:16 -0400 Subject: CuDNN In-Reply-To: <4A32C9F5-4AA6-47C7-A1C5-49BCAB2EADE6@andrew.cmu.edu> References: <4A32C9F5-4AA6-47C7-A1C5-49BCAB2EADE6@andrew.cmu.edu> Message-ID: <20200827213016.fr6Uv%predragp@andrew.cmu.edu> Viraj Mehta wrote: > Hi Predrag, > > As a follow up to the previous emails about the CUDA version and such, > I noticed that there isn???t a version of CuDNN installed on the > machines that have CUDA 11. I have some code I???m trying to build > that depends on CuDNN, and I actually just installed CUDA 11 and CuDNN > 8 on my machine at home and they work fine together. That is good to know. I just logged into my NVidia account. I am downloading CuDNN RPMs for RedHat. cuDNN v8.0.2 (July 24th, 2020), for CUDA 11.0 It appears that they have binaries for 7.xxx and 8.xxx branch of RedHat. If 7.xxx binaries work well I probably could go next week and upgrade CUDA on our RedHat 8.2 installations. Cheers, Predrag > > Do you think you???d be able to add CuDNN 8 to those machines? I tried the Conda version but it is an incompatible version of CuDNN (7.6.something) as well. > > Thanks, > Viraj From yeehos at andrew.cmu.edu Sat Aug 29 15:07:04 2020 From: yeehos at andrew.cmu.edu (Yeeho Song) Date: Sat, 29 Aug 2020 15:07:04 -0400 Subject: LOV7 Full In-Reply-To: References: Message-ID: Dear All, This is a gentle reminder that LOV7 scratch is almost full. Please check and delete / move your files from the scratch directories if possible. Thank you! yeehos at lov7$ df -h /home/scratch Filesystem Size Used Avail Use% Mounted on /dev/mapper/sl-home 169G 169G 29M 100% /home Sincerely, Yeeho Song On Sun, Aug 23, 2020 at 7:31 PM Yeeho Song wrote: > Dear All, > > This is a gentle reminder that LOV7 scratch is almost full. Please check > and delete / move your files from the scratch directories if possible. > Thank you! > > yeehos at lov7$ df -h /home/scratch > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/sl-home 169G 169G 1.7M 100% /home > > Sincerely, > Yeeho Song > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sat Aug 29 21:00:05 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sat, 29 Aug 2020 21:00:05 -0400 Subject: cant ssh into gpu20 In-Reply-To: References: Message-ID: <20200830010005.Yz45U%predragp@andrew.cmu.edu> Michael Andrews wrote: > Hi Predrag, > > I can't seem to ssh into gpu20. Other gpu nodes (including gpu21) seem to > be ok. Are you noticing this as well? > It is not down but it is overloaded and about to crash. It doesn't have memory even to receive incoming ssh connection (128 MB needed). I can only ping the machine. There is nothing I can do about it until Monday morning. GPU20 and GPU21 are not connected to our IPMI console due to the fact that they are currently located in somebody's else rack space. The machine room is not crewed 24/7 due to the Covid 19. Tough luck... Best, Predrag > Regards, > Michael From chiragn at cs.cmu.edu Sun Aug 30 23:57:20 2020 From: chiragn at cs.cmu.edu (Chirag Nagpal) Date: Sun, 30 Aug 2020 23:57:20 -0400 Subject: Please free up compute nodes. In-Reply-To: References: Message-ID: Following up on this email, the scratch on lov8 and lov9 is completely full, making it impossible to even start the python interpreter on the machines. Please free up the scratch. Ideally, unless your job requires a lot of fast I/O, you should use the NFS and not the scratch. Once your job is completed, you should move the temporary files from the scratch to NFS or delete them. Thanks for bearing with me Chirag On Wed, Aug 26, 2020 at 7:15 PM Chirag Nagpal wrote: > Hello All > > *All the lov machines from 1-5 are completely choked with jobs. Please > free up the machines. * > > *Remember if you find yourself using *all* the cores of *all* the > machines, you are likely affecting productivity of other lab members. > Please be reasonable about the resource usage. * > > Thank you > > Chirag > > > -- > > *Chirag Nagpal* PhD Student, Auton Lab > School of Computer Science > Carnegie Mellon University > cs.cmu.edu/~chiragn > -- *Chirag Nagpal* PhD Student, Auton Lab School of Computer Science Carnegie Mellon University cs.cmu.edu/~chiragn -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Aug 31 00:15:36 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 31 Aug 2020 00:15:36 -0400 Subject: libcudnn8 Message-ID: <20200831041536.iuc8N%predragp@andrew.cmu.edu> Dear Autonians, I installed libcudnn8 on the following computing nodes GPU1, GPU2, GPU3, GPU4, GPU5 which run RHEL 7.8 and CUDA 11.0 Please test if it works. I need feed back before I can add libcudnn8 to servers GPU[6-14]. If this goes as planned I will proceede to upgrade CUDA from 10.3 to 11 and CuDNN from 7.5 to 8.0 on GPU nodes running RHEL 8.2. That should last us at least next 6 months. Best, Predrag From predragp at andrew.cmu.edu Mon Aug 31 13:34:00 2020 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 31 Aug 2020 13:34:00 -0400 Subject: quick question about our OS / building JAX In-Reply-To: References: Message-ID: Good! You should email users at autonlab is somebody else needs it On Mon, Aug 31, 2020 at 12:40 PM Viraj Mehta wrote: > Actually got it to build, thank you! > > On Mon, Aug 31, 2020 at 10:58 AM Viraj Mehta wrote: > >> Ah I didn?t know we had the 8.xxxx installed in the tools directory?I?ll >> try and point at that. Thanks for bearing with me. >> >> On Mon, Aug 31, 2020 at 10:51 AM Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> I have no idea. FYI the default version of gcc on RHEL 8.2 is 8.xxx. On >>> RHEL the default version is 4.8.5 but you will find gcc 8.xxx in >>> /opt/rh/dev-tools8 >>> or something like that. What is old about it? >>> >>> On Mon, Aug 31, 2020, 11:46 AM Viraj Mehta wrote: >>> >>>> Hi Predrag, >>>> >>>> Hope you're well and sorry for bothering you so much lately. Since we >>>> talked about JAX a while back, I've been trying to figure out if there was >>>> a way to get it built on our cluster environment as there are some features >>>> of its autodiff system I like for some current work. I have two questions >>>> which I've bolded. >>>> >>>> I recently found this thread: https://github.com/google/jax/issues/2083 >>>> where somebody got it to build on CentOS, which afaik is another RHEL >>>> variant. I think the major differences between this guy's setup are the >>>> CUDA/cuDNN version (but we have versions which are supported) and the gcc >>>> version (in which ours is older). *Do you think in general that it >>>> would be plausibly feasible to build this on our cluster? * >>>> >>>> In my most recent effort I ran into something where our version of gcc >>>> is too old to understand the command line option '-std=c++14'. *Is >>>> there a way to get an alternate / newer version of gcc on our machines?* >>>> >>>> Thanks for all the help. >>>> >>>> Viraj >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From virajm at cs.cmu.edu Mon Aug 31 13:54:33 2020 From: virajm at cs.cmu.edu (Viraj Mehta) Date: Mon, 31 Aug 2020 12:54:33 -0500 Subject: quick question about our OS / building JAX In-Reply-To: References: Message-ID: Will do, thanks! On Mon, Aug 31, 2020 at 12:34 PM Predrag Punosevac wrote: > Good! You should email users at autonlab is somebody else needs it > > On Mon, Aug 31, 2020 at 12:40 PM Viraj Mehta wrote: > >> Actually got it to build, thank you! >> >> On Mon, Aug 31, 2020 at 10:58 AM Viraj Mehta wrote: >> >>> Ah I didn?t know we had the 8.xxxx installed in the tools directory?I?ll >>> try and point at that. Thanks for bearing with me. >>> >>> On Mon, Aug 31, 2020 at 10:51 AM Predrag Punosevac < >>> predragp at andrew.cmu.edu> wrote: >>> >>>> I have no idea. FYI the default version of gcc on RHEL 8.2 is 8.xxx. On >>>> RHEL the default version is 4.8.5 but you will find gcc 8.xxx in >>>> /opt/rh/dev-tools8 >>>> or something like that. What is old about it? >>>> >>>> On Mon, Aug 31, 2020, 11:46 AM Viraj Mehta wrote: >>>> >>>>> Hi Predrag, >>>>> >>>>> Hope you're well and sorry for bothering you so much lately. Since we >>>>> talked about JAX a while back, I've been trying to figure out if there was >>>>> a way to get it built on our cluster environment as there are some features >>>>> of its autodiff system I like for some current work. I have two questions >>>>> which I've bolded. >>>>> >>>>> I recently found this thread: >>>>> https://github.com/google/jax/issues/2083 where somebody got it to >>>>> build on CentOS, which afaik is another RHEL variant. I think the major >>>>> differences between this guy's setup are the CUDA/cuDNN version (but we >>>>> have versions which are supported) and the gcc version (in which ours is >>>>> older). *Do you think in general that it would be plausibly >>>>> feasible to build this on our cluster? * >>>>> >>>>> In my most recent effort I ran into something where our version of gcc >>>>> is too old to understand the command line option '-std=c++14'. *Is >>>>> there a way to get an alternate / newer version of gcc on our machines?* >>>>> >>>>> Thanks for all the help. >>>>> >>>>> Viraj >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From virajm at andrew.cmu.edu Mon Aug 31 14:22:37 2020 From: virajm at andrew.cmu.edu (Viraj Mehta) Date: Mon, 31 Aug 2020 13:22:37 -0500 Subject: Building JAX on the Auton cluster Message-ID: <3C2C294D-1221-4040-BF4A-49C2894B40EF@andrew.cmu.edu> Hi everyone, I had a bit of an adventure trying to build JAX on Auton and thought I?d document the right way to do it so that if others want they can as well. Here are the steps to do it on a machine that has CUDA 11 and cuDNN 8: 1. Go somewhere in your scratch 2. git clone https://github.com/google/jax 3. Make a python environment with a python >= 3.6 using Conda or virtualenv 4. Install numpy, spicy, cython, six 5. Source /opt/rh/devtoolset-8/enable 6. Run python build/build.py ?enable_cuda ?cuda_path /usr/local/cuda ?cudnn_path /usr/ 7. Pip install -e build 8. Pip install -e . I know this isn?t that complicated, but I figured it would save some effort if anyone else would like to use it. Cheers, Viraj -------------- next part -------------- An HTML attachment was scrubbed... URL: