From awd at cs.cmu.edu Fri Jan 6 17:04:56 2023 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Fri, 6 Jan 2023 17:04:56 -0500 Subject: Auton Lab spinoff Marinus Analytics colonizes United Kingdom Message-ID: Team, Check this out: https://www.bizjournals.com/pittsburgh/inno/stories/profiles/2023/01/05/pittsburgh-based-marinus-analytics-london-office.html Way to go Cara and Marinus Team! Cheers, Artur -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Jan 9 19:52:53 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 9 Jan 2023 19:52:53 -0500 Subject: GPU5 is now virtual KVM gpu computing host Message-ID: Dear Autonians, Happy New Year! I wish everyone a healthy, happy, and prosperous new year 2023. I hope everyone recharged batteries. I took advantage of the fact that GPU5 was idling and I just reprovisioned it as a GPU virtual host. The plan is to migrate all sorts of external services provided by our lab to our external collaborators to the virtual hosts which will run on GPU5. Once those services are migrated from GPU7 for example, GPU7 will be returned to our computing pool as a regular GPU computing node. Best, Dr. P^2 -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Mon Jan 9 20:08:23 2023 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 9 Jan 2023 20:08:23 -0500 Subject: GPU5 is now virtual KVM gpu computing host In-Reply-To: References: Message-ID: Thanks Predrag! Artur On Mon, Jan 9, 2023 at 7:53 PM Predrag Punosevac wrote: > Dear Autonians, > > Happy New Year! I wish everyone a healthy, happy, and prosperous new year > 2023. I hope everyone > recharged batteries. > > I took advantage of the fact that GPU5 was idling and I just reprovisioned > it as a GPU virtual host. > The plan is to migrate all sorts of external services provided by our lab > to our external collaborators to the > virtual hosts which will run on GPU5. Once those services are migrated > from GPU7 for example, > GPU7 will be returned to our computing pool as a regular GPU computing > node. > > Best, > Dr. P^2 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Jan 11 10:41:15 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 11 Jan 2023 10:41:15 -0500 Subject: GPU Machines Rebooting In-Reply-To: References: Message-ID: Hi Ian, Interesting, I have not noticed but the monitoring switchs are connected to the same PDUs. This makes me thinking that this have something to do with power outage. Namely, GPU nodes for obvious reasons are not UPSed. Either PDUs capacity was exceeded or power on certain outlets was cut. I will look into it. The server room was too hot the other day I did the work. They are doing something but I couldn't see what. Are you sure that affected machines are from 2 different racks? GPU1-9 + Denver is one rack. GPU 5 and GPU 7 are not even part of the GPU pool. Predrag On Wed, Jan 11, 2023, 9:56 AM Ian Char wrote: > Hey Predrag, > > Hope you had a happy new year and are doing well! > > It seems that both today and yesterday morning many GPU machines were > rebooted at the exact same time (see screenshot below). As far as I can > tell this happened for GPUs 1-15. Do you have any insights why this might > be happening? > > [image: image.png] > > Thank you, > Ian > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ichar at andrew.cmu.edu Wed Jan 11 11:20:33 2023 From: ichar at andrew.cmu.edu (Ian Char) Date: Wed, 11 Jan 2023 11:20:33 -0500 Subject: GPU Machines Rebooting In-Reply-To: References:

Message-ID: Hey Predrag, Thanks for your help on this. I am not sure about gpu5 or gpu7; they may have been unaffected. Besides those machines, I just confirmed that gpus 1-13 have the same output as the attached screenshot. Interestingly, I just looked at some of the other gpus, and they also logged some activity this morning only. However, they say "still running" (see screenshot for gpu23) and it seems like the jobs on them may have been unaffected. I am not familiar with what this means. Does this suggest some sort of power outage? [image: image.png] Thanks, Ian On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac wrote: > Hi Ian, > > Interesting, I have not noticed but the monitoring switchs are connected > to the same PDUs. This makes me thinking that this have something to do > with power outage. Namely, GPU nodes for obvious reasons are not UPSed. > Either PDUs capacity was exceeded or power on certain outlets was cut. I > will look into it. The server room was too hot the other day I did the > work. They are doing something but I couldn't see what. Are you sure that > affected machines are from 2 different racks? GPU1-9 + Denver is one rack. > GPU 5 and GPU 7 are not even part of the GPU pool. > > Predrag > > On Wed, Jan 11, 2023, 9:56 AM Ian Char wrote: > >> Hey Predrag, >> >> Hope you had a happy new year and are doing well! >> >> It seems that both today and yesterday morning many GPU machines were >> rebooted at the exact same time (see screenshot below). As far as I can >> tell this happened for GPUs 1-15. Do you have any insights why this might >> be happening? >> >> [image: image.png] >> >> Thank you, >> Ian >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 16867 bytes Desc: not available URL: From predragp at andrew.cmu.edu Wed Jan 11 14:51:34 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 11 Jan 2023 14:51:34 -0500 Subject: Bash is down Message-ID: The Bash shell gateway (my desktop) is down. It is brand new but faulty hardware. I am hitting either a microcode CPU bug (AMD Ryzen 5600G) or UEFI bug. The motherboard had a really hard time booting with 5600G (G stands for built in GPU/Graphics capability). Bayers be aware! For the record 5600x which needs a separate video card was pure gold. Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Jan 11 20:15:03 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 11 Jan 2023 20:15:03 -0500 Subject: GPU Machines Rebooting In-Reply-To: References:

Message-ID: It appears that all machines which are not UPSed (all GPU nodes) have been rebooted 14h and 27 minutes ago. The only explanation is electricity. I will talk tomorrow morning to the guys who are supposed to monitor Wean Hall 3611. Predrag On Wed, Jan 11, 2023 at 11:20 AM Ian Char wrote: > Hey Predrag, > > Thanks for your help on this. I am not sure about gpu5 or gpu7; they may > have been unaffected. Besides those machines, I just confirmed that gpus > 1-13 have the same output as the attached screenshot. Interestingly, I just > looked at some of the other gpus, and they also logged some activity this > morning only. However, they say "still running" (see screenshot for gpu23) > and it seems like the jobs on them may have been unaffected. I am not > familiar with what this means. Does this suggest some sort of power outage? > > [image: image.png] > > Thanks, > Ian > > On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac < > predragp at andrew.cmu.edu> wrote: > >> Hi Ian, >> >> Interesting, I have not noticed but the monitoring switchs are connected >> to the same PDUs. This makes me thinking that this have something to do >> with power outage. Namely, GPU nodes for obvious reasons are not UPSed. >> Either PDUs capacity was exceeded or power on certain outlets was cut. I >> will look into it. The server room was too hot the other day I did the >> work. They are doing something but I couldn't see what. Are you sure that >> affected machines are from 2 different racks? GPU1-9 + Denver is one rack. >> GPU 5 and GPU 7 are not even part of the GPU pool. >> >> Predrag >> >> On Wed, Jan 11, 2023, 9:56 AM Ian Char wrote: >> >>> Hey Predrag, >>> >>> Hope you had a happy new year and are doing well! >>> >>> It seems that both today and yesterday morning many GPU machines were >>> rebooted at the exact same time (see screenshot below). As far as I can >>> tell this happened for GPUs 1-15. Do you have any insights why this might >>> be happening? >>> >>> [image: image.png] >>> >>> Thank you, >>> Ian >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 16867 bytes Desc: not available URL: From predragp at andrew.cmu.edu Wed Jan 11 20:27:11 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 11 Jan 2023 20:27:11 -0500 Subject: GPU Machines Rebooting In-Reply-To: References:

Message-ID: Dear Autonians, dead autofs and sssd deamons on GPU machines which are causing login troubles appear to be due to this electric instability. I owe a big apology to Ifi. Not that she shouldn't debug those scripts but that is another story :-) I am really taken aback with the electric grid problems. I haven't seen anything similar in 10 years. Predrag On Wed, Jan 11, 2023 at 8:15 PM Predrag Punosevac wrote: > It appears that all machines which are not UPSed (all GPU nodes) have been > rebooted 14h and 27 minutes ago. The only explanation is electricity. I > will talk tomorrow morning to the guys who are supposed to monitor Wean > Hall 3611. > > Predrag > > On Wed, Jan 11, 2023 at 11:20 AM Ian Char wrote: > >> Hey Predrag, >> >> Thanks for your help on this. I am not sure about gpu5 or gpu7; they may >> have been unaffected. Besides those machines, I just confirmed that gpus >> 1-13 have the same output as the attached screenshot. Interestingly, I just >> looked at some of the other gpus, and they also logged some activity this >> morning only. However, they say "still running" (see screenshot for gpu23) >> and it seems like the jobs on them may have been unaffected. I am not >> familiar with what this means. Does this suggest some sort of power outage? >> >> [image: image.png] >> >> Thanks, >> Ian >> >> On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> Hi Ian, >>> >>> Interesting, I have not noticed but the monitoring switchs are connected >>> to the same PDUs. This makes me thinking that this have something to do >>> with power outage. Namely, GPU nodes for obvious reasons are not UPSed. >>> Either PDUs capacity was exceeded or power on certain outlets was cut. I >>> will look into it. The server room was too hot the other day I did the >>> work. They are doing something but I couldn't see what. Are you sure that >>> affected machines are from 2 different racks? GPU1-9 + Denver is one rack. >>> GPU 5 and GPU 7 are not even part of the GPU pool. >>> >>> Predrag >>> >>> On Wed, Jan 11, 2023, 9:56 AM Ian Char wrote: >>> >>>> Hey Predrag, >>>> >>>> Hope you had a happy new year and are doing well! >>>> >>>> It seems that both today and yesterday morning many GPU machines were >>>> rebooted at the exact same time (see screenshot below). As far as I can >>>> tell this happened for GPUs 1-15. Do you have any insights why this might >>>> be happening? >>>> >>>> [image: image.png] >>>> >>>> Thank you, >>>> Ian >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 16867 bytes Desc: not available URL: From predragp at andrew.cmu.edu Thu Jan 12 15:45:24 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 12 Jan 2023 15:45:24 -0500 Subject: GPU Machines Rebooting In-Reply-To: References:

Message-ID: I talked earlier today with Dave from CS CMU operations. Apparently this was a scheduled power outage. I am supposed to receive emails when those things happen but I didn't :-( Best, Predrag On Wed, Jan 11, 2023 at 8:27 PM Predrag Punosevac wrote: > Dear Autonians, > > dead autofs and sssd deamons on GPU machines which are causing login > troubles appear to be due to this electric instability. I owe a big apology > to Ifi. Not that she shouldn't debug those scripts but that is another > story :-) I am really taken aback with the electric grid problems. I > haven't seen anything similar in 10 years. > > Predrag > > On Wed, Jan 11, 2023 at 8:15 PM Predrag Punosevac > wrote: > >> It appears that all machines which are not UPSed (all GPU nodes) have >> been rebooted 14h and 27 minutes ago. The only explanation is electricity. >> I will talk tomorrow morning to the guys who are supposed to monitor Wean >> Hall 3611. >> >> Predrag >> >> On Wed, Jan 11, 2023 at 11:20 AM Ian Char wrote: >> >>> Hey Predrag, >>> >>> Thanks for your help on this. I am not sure about gpu5 or gpu7; they may >>> have been unaffected. Besides those machines, I just confirmed that gpus >>> 1-13 have the same output as the attached screenshot. Interestingly, I just >>> looked at some of the other gpus, and they also logged some activity this >>> morning only. However, they say "still running" (see screenshot for gpu23) >>> and it seems like the jobs on them may have been unaffected. I am not >>> familiar with what this means. Does this suggest some sort of power outage? >>> >>> [image: image.png] >>> >>> Thanks, >>> Ian >>> >>> On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac < >>> predragp at andrew.cmu.edu> wrote: >>> >>>> Hi Ian, >>>> >>>> Interesting, I have not noticed but the monitoring switchs are >>>> connected to the same PDUs. This makes me thinking that this have >>>> something to do with power outage. Namely, GPU nodes for obvious reasons >>>> are not UPSed. Either PDUs capacity was exceeded or power on certain >>>> outlets was cut. I will look into it. The server room was too hot the >>>> other day I did the work. They are doing something but I couldn't see what. >>>> Are you sure that affected machines are from 2 different racks? GPU1-9 + >>>> Denver is one rack. GPU 5 and GPU 7 are not even part of the GPU pool. >>>> >>>> Predrag >>>> >>>> On Wed, Jan 11, 2023, 9:56 AM Ian Char wrote: >>>> >>>>> Hey Predrag, >>>>> >>>>> Hope you had a happy new year and are doing well! >>>>> >>>>> It seems that both today and yesterday morning many GPU machines were >>>>> rebooted at the exact same time (see screenshot below). As far as I can >>>>> tell this happened for GPUs 1-15. Do you have any insights why this might >>>>> be happening? >>>>> >>>>> [image: image.png] >>>>> >>>>> Thank you, >>>>> Ian >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 16867 bytes Desc: not available URL: From ichar at andrew.cmu.edu Thu Jan 12 15:48:50 2023 From: ichar at andrew.cmu.edu (Ian Char) Date: Thu, 12 Jan 2023 15:48:50 -0500 Subject: GPU Machines Rebooting In-Reply-To: References:

Message-ID: Good to know the root cause. Thanks for getting to the bottom of this Predrag! On Thu, Jan 12, 2023 at 3:45 PM Predrag Punosevac wrote: > I talked earlier today with Dave from CS CMU operations. Apparently this > was a scheduled power outage. I am supposed to receive emails when those > things happen but I didn't :-( > > Best, > Predrag > > On Wed, Jan 11, 2023 at 8:27 PM Predrag Punosevac > wrote: > >> Dear Autonians, >> >> dead autofs and sssd deamons on GPU machines which are causing login >> troubles appear to be due to this electric instability. I owe a big apology >> to Ifi. Not that she shouldn't debug those scripts but that is another >> story :-) I am really taken aback with the electric grid problems. I >> haven't seen anything similar in 10 years. >> >> Predrag >> >> On Wed, Jan 11, 2023 at 8:15 PM Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> It appears that all machines which are not UPSed (all GPU nodes) have >>> been rebooted 14h and 27 minutes ago. The only explanation is electricity. >>> I will talk tomorrow morning to the guys who are supposed to monitor Wean >>> Hall 3611. >>> >>> Predrag >>> >>> On Wed, Jan 11, 2023 at 11:20 AM Ian Char wrote: >>> >>>> Hey Predrag, >>>> >>>> Thanks for your help on this. I am not sure about gpu5 or gpu7; they >>>> may have been unaffected. Besides those machines, I just confirmed that >>>> gpus 1-13 have the same output as the attached screenshot. Interestingly, I >>>> just looked at some of the other gpus, and they also logged some activity >>>> this morning only. However, they say "still running" (see screenshot for >>>> gpu23) and it seems like the jobs on them may have been unaffected. I am >>>> not familiar with what this means. Does this suggest some sort of power >>>> outage? >>>> >>>> [image: image.png] >>>> >>>> Thanks, >>>> Ian >>>> >>>> On Wed, Jan 11, 2023 at 10:41 AM Predrag Punosevac < >>>> predragp at andrew.cmu.edu> wrote: >>>> >>>>> Hi Ian, >>>>> >>>>> Interesting, I have not noticed but the monitoring switchs are >>>>> connected to the same PDUs. This makes me thinking that this have >>>>> something to do with power outage. Namely, GPU nodes for obvious reasons >>>>> are not UPSed. Either PDUs capacity was exceeded or power on certain >>>>> outlets was cut. I will look into it. The server room was too hot the >>>>> other day I did the work. They are doing something but I couldn't see what. >>>>> Are you sure that affected machines are from 2 different racks? GPU1-9 + >>>>> Denver is one rack. GPU 5 and GPU 7 are not even part of the GPU pool. >>>>> >>>>> Predrag >>>>> >>>>> On Wed, Jan 11, 2023, 9:56 AM Ian Char wrote: >>>>> >>>>>> Hey Predrag, >>>>>> >>>>>> Hope you had a happy new year and are doing well! >>>>>> >>>>>> It seems that both today and yesterday morning many GPU machines were >>>>>> rebooted at the exact same time (see screenshot below). As far as I can >>>>>> tell this happened for GPUs 1-15. Do you have any insights why this might >>>>>> be happening? >>>>>> >>>>>> [image: image.png] >>>>>> >>>>>> Thank you, >>>>>> Ian >>>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 16867 bytes Desc: not available URL: From predragp at andrew.cmu.edu Wed Jan 18 10:35:54 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 18 Jan 2023 10:35:54 -0500 Subject: Scratch on GPU15 is full In-Reply-To: References: Message-ID: Dear Autonians, Please clean the old unnecessary stuff from your scratch directories on GPU15 until 3:00pm. If the scratch is still full I will just zap it and recreate it. Predrag On Tue, Jan 17, 2023, 6:27 PM Viraj Mehta wrote: > Hi Predrag, > > Hope you are well. I noticed that scratch on GPU15 is full and the GPUs > are empty. I would like to use the machine for some pretty urgent > experiments I?m aiming to complete for the ICML deadline next week and need > some space to put a Conda env on the machine. Would you be willing to help > me free up these resources? > > Thanks so much, > Viraj -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Jan 19 14:17:46 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 19 Jan 2023 14:17:46 -0500 Subject: Unable to authenticate through upload In-Reply-To: References: <757862B0-6872-4712-8D68-E8EB1B5E313A@andrew.cmu.edu>

Message-ID: Something is happening with the gateway machine. Your report was right on money. I am having a hard time logging in even with the root account. I am trying power cycle via IPMI. If that doesn't fix the issue I will have to look into it later today when I have the access to the hardware. Predrag On Thu, Jan 19, 2023 at 12:49 PM Benedikt Boecking wrote: > No rush, just thought I?d let you know. > > > On Jan 19, 2023, at 11:47 AM, Predrag Punosevac > wrote: > > Will look into it. > > On Thu, Jan 19, 2023, 12:37 PM Benedikt Boecking > wrote: > >> I can authenticate to bash.autonlab.org but not to upload.autonlab.org. >> >> Bash is very very slow for me, but I can get to the servers through it. >> >> On upload, the ssh log in gets stuck at public key authentication, see >> below. >> >> >> ssh -v benediktb at upload.autonlab.org >> OpenSSH_9.0p1, LibreSSL 3.3.6 >> debug1: Reading configuration data /Users/boecking/.ssh/config >> debug1: /Users/boecking/.ssh/config line 1: Applying options for * >> debug1: Reading configuration data /etc/ssh/ssh_config >> debug1: /etc/ssh/ssh_config line 21: include /etc/ssh/ssh_config.d/* >> matched no files >> debug1: /etc/ssh/ssh_config line 54: Applying options for * >> debug1: Authenticator provider $SSH_SK_PROVIDER did not resolve; disabling >> debug1: Connecting to upload.autonlab.org port 22. >> debug1: Connection established. >> debug1: identity file /Users/boecking/.ssh/id_rsa type 0 >> debug1: identity file /Users/boecking/.ssh/id_rsa-cert type -1 >> debug1: Local version string SSH-2.0-OpenSSH_9.0 >> debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4 >> debug1: compat_banner: match: OpenSSH_7.4 pat OpenSSH_7.4* compat >> 0x04000006 >> debug1: Authenticating to upload.autonlab.org:22 as 'benediktb' >> debug1: load_hostkeys: fopen /Users/boecking/.ssh/known_hosts2: No such >> file or directory >> debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or >> directory >> debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or >> directory >> debug1: SSH2_MSG_KEXINIT sent >> debug1: SSH2_MSG_KEXINIT received >> debug1: kex: algorithm: curve25519-sha256 >> debug1: kex: host key algorithm: ssh-ed25519 >> debug1: kex: server->client cipher: chacha20-poly1305 at openssh.com MAC: >> compression: none >> debug1: kex: client->server cipher: chacha20-poly1305 at openssh.com MAC: >> compression: none >> debug1: expecting SSH2_MSG_KEX_ECDH_REPLY >> debug1: SSH2_MSG_KEX_ECDH_REPLY received >> debug1: Server host key: ssh-ed25519 >> SHA256:IUQxesUiVl0JBF9f1ilsQOEK7bzrcA0sxPejAmmL0LI >> debug1: load_hostkeys: fopen /Users/boecking/.ssh/known_hosts2: No such >> file or directory >> debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or >> directory >> debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or >> directory >> debug1: Host 'upload.autonlab.org' is known and matches the ED25519 host >> key. >> debug1: Found key in /Users/boecking/.ssh/known_hosts:1 >> debug1: rekey out after 134217728 blocks >> debug1: SSH2_MSG_NEWKEYS sent >> debug1: expecting SSH2_MSG_NEWKEYS >> debug1: SSH2_MSG_NEWKEYS received >> debug1: rekey in after 134217728 blocks >> debug1: get_agent_identities: bound agent to hostkey >> debug1: get_agent_identities: agent returned 1 keys >> debug1: Will attempt key: /Users/boecking/.ssh/id_rsa RSA >> SHA256:exq72a6QWMvMVpNOUObFzz0ivUQJbWbMn84bUmpEN2g explicit agent >> debug1: SSH2_MSG_EXT_INFO received >> debug1: kex_input_ext_info: server-sig-algs= >> debug1: SSH2_MSG_SERVICE_ACCEPT received >> debug1: Authentications that can continue: >> publickey,gssapi-keyex,gssapi-with-mic,password >> debug1: Next authentication method: publickey >> debug1: Offering public key: /Users/boecking/.ssh/id_rsa RSA >> SHA256:exq72a6QWMvMVpNOUObFzz0ivUQJbWbMn84bUmpEN2g explicit agent >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Jan 19 14:43:40 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 19 Jan 2023 14:43:40 -0500 Subject: Unable to authenticate through upload In-Reply-To: References: <757862B0-6872-4712-8D68-E8EB1B5E313A@andrew.cmu.edu>

Message-ID: It looks like reboot fixed the issue. Please let me know if it works for you. Predrag On Thu, Jan 19, 2023 at 2:17 PM Predrag Punosevac wrote: > Something is happening with the gateway machine. Your report was right on > money. I am having a hard time logging in even with the root account. I am > trying power cycle via IPMI. If that doesn't fix the issue I will have to > look into it later today when I have the access to the hardware. > > Predrag > > On Thu, Jan 19, 2023 at 12:49 PM Benedikt Boecking < > boecking at andrew.cmu.edu> wrote: > >> No rush, just thought I?d let you know. >> >> >> On Jan 19, 2023, at 11:47 AM, Predrag Punosevac >> wrote: >> >> Will look into it. >> >> On Thu, Jan 19, 2023, 12:37 PM Benedikt Boecking >> wrote: >> >>> I can authenticate to bash.autonlab.org but not to upload.autonlab.org. >>> >>> Bash is very very slow for me, but I can get to the servers through it. >>> >>> On upload, the ssh log in gets stuck at public key authentication, see >>> below. >>> >>> >>> ssh -v benediktb at upload.autonlab.org >>> OpenSSH_9.0p1, LibreSSL 3.3.6 >>> debug1: Reading configuration data /Users/boecking/.ssh/config >>> debug1: /Users/boecking/.ssh/config line 1: Applying options for * >>> debug1: Reading configuration data /etc/ssh/ssh_config >>> debug1: /etc/ssh/ssh_config line 21: include /etc/ssh/ssh_config.d/* >>> matched no files >>> debug1: /etc/ssh/ssh_config line 54: Applying options for * >>> debug1: Authenticator provider $SSH_SK_PROVIDER did not resolve; >>> disabling >>> debug1: Connecting to upload.autonlab.org port 22. >>> debug1: Connection established. >>> debug1: identity file /Users/boecking/.ssh/id_rsa type 0 >>> debug1: identity file /Users/boecking/.ssh/id_rsa-cert type -1 >>> debug1: Local version string SSH-2.0-OpenSSH_9.0 >>> debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4 >>> debug1: compat_banner: match: OpenSSH_7.4 pat OpenSSH_7.4* compat >>> 0x04000006 >>> debug1: Authenticating to upload.autonlab.org:22 as 'benediktb' >>> debug1: load_hostkeys: fopen /Users/boecking/.ssh/known_hosts2: No such >>> file or directory >>> debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or >>> directory >>> debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or >>> directory >>> debug1: SSH2_MSG_KEXINIT sent >>> debug1: SSH2_MSG_KEXINIT received >>> debug1: kex: algorithm: curve25519-sha256 >>> debug1: kex: host key algorithm: ssh-ed25519 >>> debug1: kex: server->client cipher: chacha20-poly1305 at openssh.com MAC: >>> compression: none >>> debug1: kex: client->server cipher: chacha20-poly1305 at openssh.com MAC: >>> compression: none >>> debug1: expecting SSH2_MSG_KEX_ECDH_REPLY >>> debug1: SSH2_MSG_KEX_ECDH_REPLY received >>> debug1: Server host key: ssh-ed25519 >>> SHA256:IUQxesUiVl0JBF9f1ilsQOEK7bzrcA0sxPejAmmL0LI >>> debug1: load_hostkeys: fopen /Users/boecking/.ssh/known_hosts2: No such >>> file or directory >>> debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or >>> directory >>> debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or >>> directory >>> debug1: Host 'upload.autonlab.org' is known and matches the ED25519 >>> host key. >>> debug1: Found key in /Users/boecking/.ssh/known_hosts:1 >>> debug1: rekey out after 134217728 blocks >>> debug1: SSH2_MSG_NEWKEYS sent >>> debug1: expecting SSH2_MSG_NEWKEYS >>> debug1: SSH2_MSG_NEWKEYS received >>> debug1: rekey in after 134217728 blocks >>> debug1: get_agent_identities: bound agent to hostkey >>> debug1: get_agent_identities: agent returned 1 keys >>> debug1: Will attempt key: /Users/boecking/.ssh/id_rsa RSA >>> SHA256:exq72a6QWMvMVpNOUObFzz0ivUQJbWbMn84bUmpEN2g explicit agent >>> debug1: SSH2_MSG_EXT_INFO received >>> debug1: kex_input_ext_info: server-sig-algs= >>> debug1: SSH2_MSG_SERVICE_ACCEPT received >>> debug1: Authentications that can continue: >>> publickey,gssapi-keyex,gssapi-with-mic,password >>> debug1: Next authentication method: publickey >>> debug1: Offering public key: /Users/boecking/.ssh/id_rsa RSA >>> SHA256:exq72a6QWMvMVpNOUObFzz0ivUQJbWbMn84bUmpEN2g explicit agent >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Jan 19 14:52:03 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 19 Jan 2023 14:52:03 -0500 Subject: Somewhat Urgent: lots of GPU nodes not responding In-Reply-To: <10DA39EA-55B8-4F5A-AFCE-45E1F4A4BF5C@andrew.cmu.edu> References: <10DA39EA-55B8-4F5A-AFCE-45E1F4A4BF5C@andrew.cmu.edu> Message-ID: Hi Viraj, Sorry for a bit of delay. I was attending some NFS calls for proposals. I did a bit of poking. It looks like a tangled file system to me. I can't get df -h to produce the output. That is never a good sign. Autofs works as expected. I am surprised that more people didn't report this. Not sure what to do about it as the reboot is probably unwarranted. Somebody is really messing up with the file server. Predrag On Thu, Jan 19, 2023 at 9:42 AM Viraj Mehta wrote: > Hi Predrag, > > Hope you are well this morning. I was kinda shocked to notice that I can?t > access GPU nodes 2,3,4,11,12,17,20. I am not sure what caused this but I > was running jobs on some of these machines that all stopped producing > output around 8:23 last night. These are super critical for the ICML > deadline next Thursday and I would like to restart them ASAP. I am not > entirely sure what happened here as I don?t think they are terribly > write-heavy or anything like that. Please let me know if they are able to > be restored to normal function as I urgently need them. > > If I did anything that was responsible for them crashing, please let me > know as well so I don?t repeat it. I am under some time pressure so am > running a fairly large number of jobs right now. > > Thanks, > Viraj -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Jan 19 14:59:36 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 19 Jan 2023 14:59:36 -0500 Subject: Somewhat Urgent: lots of GPU nodes not responding In-Reply-To: References: <10DA39EA-55B8-4F5A-AFCE-45E1F4A4BF5C@andrew.cmu.edu> Message-ID: The good news is that it involves only the machines you are listing. It seems that other machines were not affected. How sure are you that those scripts Ian and you were running don't involve heavy read/write? Predrag On Thu, Jan 19, 2023 at 2:52 PM Predrag Punosevac wrote: > Hi Viraj, > > Sorry for a bit of delay. I was attending some NFS calls for proposals. I > did a bit of poking. It looks like a tangled file system to me. I can't get > > df -h > > to produce the output. That is never a good sign. Autofs works as > expected. I am surprised that more people didn't report this. Not sure > what to do about it as the reboot is probably unwarranted. Somebody is > really messing up with the file server. > > Predrag > > On Thu, Jan 19, 2023 at 9:42 AM Viraj Mehta wrote: > >> Hi Predrag, >> >> Hope you are well this morning. I was kinda shocked to notice that I >> can?t access GPU nodes 2,3,4,11,12,17,20. I am not sure what caused this >> but I was running jobs on some of these machines that all stopped producing >> output around 8:23 last night. These are super critical for the ICML >> deadline next Thursday and I would like to restart them ASAP. I am not >> entirely sure what happened here as I don?t think they are terribly >> write-heavy or anything like that. Please let me know if they are able to >> be restored to normal function as I urgently need them. >> >> If I did anything that was responsible for them crashing, please let me >> know as well so I don?t repeat it. I am under some time pressure so am >> running a fairly large number of jobs right now. >> >> Thanks, >> Viraj > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From virajm at andrew.cmu.edu Thu Jan 19 15:22:36 2023 From: virajm at andrew.cmu.edu (Viraj Mehta) Date: Thu, 19 Jan 2023 15:22:36 -0500 Subject: Somewhat Urgent: lots of GPU nodes not responding In-Reply-To: References: <10DA39EA-55B8-4F5A-AFCE-45E1F4A4BF5C@andrew.cmu.edu>

Message-ID: <87A9CBEE-24FB-4E0B-A39E-A042A8133AF7@andrew.cmu.edu> Hi Predrag, I?m running a bunch of deep learning scripts that should dump logging info every ~10 minutes or so. However, they were built off of someone else?s codebase that does a bunch of stuff with multiprocessing. I don?t think it relies on read/write for this and have gone through the codebase in some detail without finding anything. I think perhaps an `strace` could tell us something. I will attempt this. Obviously I think there could be something wrong here as the problems have mostly affected servers I was using. Thanks, Viraj > On Jan 19, 2023, at 2:59 PM, Predrag Punosevac wrote: > > The good news is that it involves only the machines you are listing. It seems that other machines were not affected. How sure are you that those scripts Ian and you were running don't involve heavy read/write? > > Predrag > > On Thu, Jan 19, 2023 at 2:52 PM Predrag Punosevac > wrote: > Hi Viraj, > > Sorry for a bit of delay. I was attending some NFS calls for proposals. I did a bit of poking. It looks like a tangled file system to me. I can't get > > df -h > > to produce the output. That is never a good sign. Autofs works as expected. I am surprised that more people didn't report this. Not sure what to do about it as the reboot is probably unwarranted. Somebody is really messing up with the file server. > > Predrag > > On Thu, Jan 19, 2023 at 9:42 AM Viraj Mehta > wrote: > Hi Predrag, > > Hope you are well this morning. I was kinda shocked to notice that I can?t access GPU nodes 2,3,4,11,12,17,20. I am not sure what caused this but I was running jobs on some of these machines that all stopped producing output around 8:23 last night. These are super critical for the ICML deadline next Thursday and I would like to restart them ASAP. I am not entirely sure what happened here as I don?t think they are terribly write-heavy or anything like that. Please let me know if they are able to be restored to normal function as I urgently need them. > > If I did anything that was responsible for them crashing, please let me know as well so I don?t repeat it. I am under some time pressure so am running a fairly large number of jobs right now. > > Thanks, > Viraj -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Jan 23 12:01:55 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 23 Jan 2023 12:01:55 -0500 Subject: Scratch on GPU17 is full In-Reply-To: References: Message-ID: Dear Autonians, If you have anything you could clear from gpu17 I would really appreciate it. Otherwise, I will have to use my omnipotence and recreate the scratch space. Best, Predrag On Mon, Jan 23, 2023 at 11:05 AM Willa Potosnak wrote: > Hi Predrag, > > I noticed that scratch on GPU17 is full and there are no running > processes. Would it be possible to clear up some space on this GPU? If so, > it would be very much appreciated. > > Thank you, > Willa > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sun Jan 29 16:58:39 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 29 Jan 2023 16:58:39 -0500 Subject: lov5 and lov6 are down Message-ID: Dear Autonians, I tried rebooting those two machines remotely via IPMI but to no avail. If I have to guess what is happening it is most likely the unclear file system on the secondary HDD used for scratch directories. This will have to wait until tomorrow at the very least. Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Jan 30 17:32:39 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 30 Jan 2023 17:32:39 -0500 Subject: Scratch full on a some nodes In-Reply-To: References: Message-ID: Hi Ben, Thanks for reporting. We will give 24h to rogue users to get their act together before I use my omnipotent account to clear things. Predrag On Mon, Jan 30, 2023 at 5:22 PM Benjamin Freed wrote: > Hi Predrag, > > I am writing to let you know that it seems scratch is full on some of the > nodes, I think gpu20 and gpu22 are full (those are the ones I'm aware of). > > Thanks, > Ben > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Jan 30 17:39:15 2023 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 30 Jan 2023 17:39:15 -0500 Subject: lov5 and lov6 are down In-Reply-To: References: Message-ID: Dear Autonians, I was correct about the unclear file system. Instead of fixing 10 year old HDDs I put in the new ones and installed the latest RHEL 9.1. The computing nodes are not quite finished yet. I hope to make some progress tonight. Adding scratch drives will have to wait until Wednesday at earliest. Predrag On Sun, Jan 29, 2023 at 4:58 PM Predrag Punosevac wrote: > Dear Autonians, > > I tried rebooting those two machines remotely via IPMI but to no avail. If > I have to guess what is happening it is most likely the unclear file system > on the secondary HDD used for scratch directories. > > This will have to wait until tomorrow at the very least. > > Predrag > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Tue Jan 31 12:14:57 2023 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Tue, 31 Jan 2023 12:14:57 -0500 Subject: Autonians receive research grant from the International Institute of Forecasters Message-ID: Team, Please join me in congratulating our own Kin Gutierrez Olivares and Cristian Challu for receiving a small yet highly prestigious recognition! Their proposal on "Transferability of Neural Forecast Methods" was selected by the IIF for funding as one of only two this year: https://forecasters.org/programs/research-awards/iif-sas/ Way to go Kin and Cristian! Artur -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaupperlee at cmu.edu Tue Jan 31 12:38:25 2023 From: aaupperlee at cmu.edu (Aaron Aupperlee) Date: Tue, 31 Jan 2023 12:38:25 -0500 Subject: Autonians receive research grant from the International Institute of Forecasters In-Reply-To: References: Message-ID: Thanks for sharing Artur. Congrats all! On Tue, Jan 31, 2023 at 12:15 PM Artur Dubrawski wrote: > Team, > > Please join me in congratulating our own Kin Gutierrez Olivares and > Cristian Challu for receiving a small yet highly prestigious recognition! > Their proposal on "Transferability of Neural Forecast Methods" was selected > by the IIF for funding as one of only two this year: > https://forecasters.org/programs/research-awards/iif-sas/ > > Way to go Kin and Cristian! > > Artur > -------------- next part -------------- An HTML attachment was scrubbed... URL: