From awd at cs.cmu.edu Thu Nov 3 16:02:03 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Thu, 3 Nov 2022 16:02:03 -0400 Subject: Enter Peggy Martin! Message-ID: Dear Autonians, As you might have heard, Trish Spencer left our team a couple of days ago. We wish her great success in her next endeavors. The RI has made an outstanding choice of Peggy Martin (cc-d here) as the interim Administrative Assistant for the part of the Auton Lab that works with me. I am planning to be supernice to Peggy, and I encourage you to please do the same. Even if it is highly unlikely for any of us to ever reach the level of niceness that Peggy oozes of, and for me that is clearly entirely impossible, I sincerely hope that she would agree to stay with us for good. Welcome onboard Peggy! Artur PS Peggy's contact info: Office: 3203 Newell-Simon Hall Phone: (412) 268-7943 -------------- next part -------------- An HTML attachment was scrubbed... URL: From boecking at andrew.cmu.edu Thu Nov 3 20:05:08 2022 From: boecking at andrew.cmu.edu (Benedikt Boecking) Date: Thu, 3 Nov 2022 19:05:08 -0500 Subject: Gpu scratch space Message-ID: <7CC5DC86-6B14-4590-B51D-71F544326597@andrew.cmu.edu> Hi all, Someone filled up all the scratch space on gpu24, which lead my experiments to crash. Please monitor your resource usage and don?t push it to the limit, as this can negatively affect your peers. You can use $ df -h /home/scratch/ to check the available disk space. And $ du -h --max-depth=1 /home/scratch/myusername/ to check how much space your files and folders in your directory take up. From awd at cs.cmu.edu Thu Nov 3 20:23:44 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Thu, 3 Nov 2022 20:23:44 -0400 Subject: Gpu scratch space In-Reply-To: <7CC5DC86-6B14-4590-B51D-71F544326597@andrew.cmu.edu> References: <7CC5DC86-6B14-4590-B51D-71F544326597@andrew.cmu.edu> Message-ID: Team, These inconsiderate events happen too often lately. We have been watching this patiently for a long while, but the time is coming when drastic measures may need to be implemented. Please can everyone show respect to others and their hard work and obey our rules of friendly conduct in our shared computing space. Artur On Thu, Nov 3, 2022 at 8:06 PM Benedikt Boecking wrote: > Hi all, > > Someone filled up all the scratch space on gpu24, which lead my > experiments to crash. Please monitor your resource usage and don?t push it > to the limit, as this can negatively affect your peers. > > You can use > $ df -h /home/scratch/ > > to check the available disk space. > > And > $ du -h --max-depth=1 /home/scratch/myusername/ > > to check how much space your files and folders in your directory take up. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Nov 3 20:52:47 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 3 Nov 2022 20:52:47 -0400 Subject: Gpu scratch space In-Reply-To: <7CC5DC86-6B14-4590-B51D-71F544326597@andrew.cmu.edu> References: <7CC5DC86-6B14-4590-B51D-71F544326597@andrew.cmu.edu> Message-ID: GPU24 has 2TB of /home/extra_scratch I added just 10 days ago. If a few people can't share 2TB of regular plus 2TB of extra scratch space then a user dedicated block device (HDD) is needed. It is $75 for every 2TB 2.5" HDD I add. I can add up to 32 HDDs. Please send me the Oracle string and we will resolve this issue easily. Predrag On Thu, Nov 3, 2022 at 8:06 PM Benedikt Boecking wrote: > Hi all, > > Someone filled up all the scratch space on gpu24, which lead my > experiments to crash. Please monitor your resource usage and don?t push it > to the limit, as this can negatively affect your peers. > > You can use > $ df -h /home/scratch/ > > to check the available disk space. > > And > $ du -h --max-depth=1 /home/scratch/myusername/ > > to check how much space your files and folders in your directory take up. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sun Nov 6 20:23:53 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 6 Nov 2022 20:23:53 -0500 Subject: GPU25 crashed three times under the load Message-ID: GPU25 was crashed three times in the last 4 days due to the load root at gpu25$ ls -l -1 /var/crash total 0 drwxr-xr-x. 2 root root 67 Sep 4 21:45 127.0.0.1-2022-09-04-21:45:45 drwxr-xr-x. 2 root root 67 Nov 3 14:24 127.0.0.1-2022-11-03-14:24:33 drwxr-xr-x. 2 root root 67 Nov 4 15:29 127.0.0.1-2022-11-04-15:29:36 drwxr-xr-x. 2 root root 67 Nov 5 10:19 127.0.0.1-2022-11-05-10:19:19 If your job wasn't completed the first time, there is no point restarting it until you debug your code. Einstein came up with a name for it which is now commonly taught in the first year of psychology. Best, Dr. P^2 -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Mon Nov 7 16:35:49 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 7 Nov 2022 16:35:49 -0500 Subject: Brief meetings with Artur in the week of 10/24 - please sign up In-Reply-To: References:

Message-ID: The short meeting slots for this week have just been opened. Please book one (or more) while they last. Spreadsheet for signups link and the zoom link have not changed, they are below for easy reference. Cheers Artur > >>>> >>>>> >>>>> https://docs.google.com/spreadsheets/d/1OpY1DSxG7LLsMRroocMFTgqndSMiRTYUhYEwYdXH7Wc/edit?pli=1#gid=0 >>>>> >>>>> We will be using the same zoom link as before: >>>>> >>>>> https://cmu.zoom.us/j/9672166543 >>>>> >>>>> PS Let me know if the available times do not work for you and we will >>>>> look for alternatives. >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Tue Nov 8 09:06:48 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Tue, 8 Nov 2022 09:06:48 -0500 Subject: Reminder: Brief meetings with Artur in the week of 11/7 - please sign up In-Reply-To: References:

Message-ID: On Mon, Nov 7, 2022 at 4:35 PM Artur Dubrawski wrote: > The short meeting slots for this week have just been opened. Please book > one (or more) while they last. > > Spreadsheet for signups link and the zoom link have not changed, they are > below for easy reference. > > Cheers > Artur > > >> >>>>> >>>>>> >>>>>> https://docs.google.com/spreadsheets/d/1OpY1DSxG7LLsMRroocMFTgqndSMiRTYUhYEwYdXH7Wc/edit?pli=1#gid=0 >>>>>> >>>>>> We will be using the same zoom link as before: >>>>>> >>>>>> https://cmu.zoom.us/j/9672166543 >>>>>> >>>>>> PS Let me know if the available times do not work for you and we will >>>>>> look for alternatives. >>>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Nov 8 15:44:06 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 8 Nov 2022 15:44:06 -0500 Subject: ssh login problems (NFS server killed due to overload) In-Reply-To: References: <70CF9D01-A418-47A2-A0DF-3EEED712A9BB@andrew.cmu.edu>

<4674D02C-875E-4347-BD66-6A9082231E14@andrew.cmu.edu> Message-ID: Hi Gus, Your SQLite database used by Jupyter is corrupted. Corruption was caused by NFS (high io or stale NFS file handle). The real mystery is why don't you instruct Jupyter notebook to store SQLite into the scratch directory? Which server(s) are affected? Predrag On Tue, Nov 8, 2022 at 12:48 PM Gus Welter wrote: > Hi Predrag, > > Multiple lab members have Jupyter hang when they go to create or open a > file in Jupyter Lab. I tried to do a "df -h" on lov3, and it hangs. Maybe > there are some lingering NFS issues? > > Best, > Gus > > On Mon, Oct 24, 2022 at 9:28 PM Predrag Punosevac > wrote: > >> That means that the process which caused the crash are still alive. I >> need to think a bit how to proceed in the most efficient way. Logging into >> 45 computing nodes and poking around doesn't scale well. If I end up doing >> that offending account will be suspended. >> >> Predrag >> >> >> On Mon, Oct 24, 2022, 1:21 PM Benedikt Boecking >> wrote: >> >>> Just to confirm, looks like things are down again. >>> >>> >>> >>> On Oct 24, 2022, at 11:12 AM, Predrag Punosevac >>> wrote: >>> >>> Please try to test bash.autonlab.org, upload.autonlab.org, and >>> lop2.autonlab.org. >>> >>> It appears that NFS mounts work on these shell gateways. If you have an >>> Auton Lab workstation please mount -o remount your network home directory >>> or reboot it. >>> >>> Predrag >>> >>> On Mon, Oct 24, 2022 at 12:01 PM Predrag Punosevac < >>> predragp at andrew.cmu.edu> wrote: >>> >>>> I am trying really hard not to reboot anything. I manually restarted a >>>> bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I >>>> noticed that restarting autofs daemons on computing nodes restored the >>>> access. I am using Ansible to propagate autofs daemon restart over all >>>> computing nodes. It appears that some of them hang. I am hoping to get away >>>> with rebooting only a machine or two and definitely avoid rebooting the >>>> main file server. >>>> >>>> For curiosity. NFS is the last century (1980s Sun Microsystem) >>>> technology. It is a centralized single point of failure system. We >>>> mitigate this risk by having NFS exports distributed over several different >>>> physical file servers which run their own NFS instances. That is why >>>> /zfsauton/data and /zfsauton/project as well as /zfsauton/datasets are not >>>> affected. Unfortunately all of your home directories are located on GAIA. >>>> If I catch rough users I could theoretically move their home directory to >>>> the different file server and avoid this mess. The other option I was >>>> looking for was migrating NFS to GlusterFS (distributed network file >>>> system). The migration will be non-trivial and the performance penalty with >>>> small files might be significant. This is not an exact science. >>>> >>>> Predrag >>>> >>>> >>>> >>>> >>>> On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking < >>>> boecking at andrew.cmu.edu> wrote: >>>> >>>>> If there is any way to not reboot gpu24 and gpu27 you might save me 2 >>>>> weeks of work. If they are rebooted I may be screwed for my ICLR rebuttal. >>>>> >>>>> But ultimately, do what you have to of course. Thanks! >>>>> >>>>> >>>>> >>>>> > On Oct 24, 2022, at 10:43 AM, Predrag Punosevac < >>>>> predragp at andrew.cmu.edu> wrote: >>>>> > >>>>> > >>>>> > >>>>> > Dear Autoninas, >>>>> > >>>>> > I got several reports this morning from a few of you (Ifi, Abby, >>>>> Ben, Vedant) that they are having problems accessing the system. After a >>>>> bit of investigation, I nailed down the culprit to the main file server. >>>>> The server (NFS instance) appears to be dead or severely degraded due to >>>>> the overload. >>>>> > >>>>> > I am afraid that the only medicine will be to reboot the machine, >>>>> perhaps followed up by the reboot of all 45+ computing nodes. This will >>>>> result in a significant loss of work and productivity. We did go through >>>>> this exercise less than two months ago. >>>>> > >>>>> > The Auton Lab cluster is not policed for rogue users. Its usability >>>>> depends on collegial behaviour of each of our 130 members. Use of scratch >>>>> directories instead of taxing NFS is well described in the documentation >>>>> and as recently as last week I added extra scratch on at least four >>>>> machines. >>>>> > >>>>> > Best, >>>>> > Predrag >>>>> >>>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Nov 9 00:36:59 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 9 Nov 2022 00:36:59 -0500 Subject: GPU24 killed, GPU25 /zfsauton/datasets issues Message-ID: Dear Autonians, I am noticing a pattern here. A few (less or equal than 5) are fighting over the four most potent computing nodes in our cluster GPU[24-27]. Those few users have managed to chase away everyone else and got into the vicious cycle of running jobs too big even for those machines and killing all daemons and NFS mounts in the process. I don't know a thing about ML but this is not the way to conduct "scientific research". This will have to stop. I am currently logging into GPU[25-27]. GPU24 is not reachable even with my root ssh access. ssh daemon is usually one of the very last daemons to be killed by overuse of resources. I will remain logged for a few days and monitor activity. Repeated offenders will be reported. Cheers, Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Nov 9 00:55:16 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 9 Nov 2022 00:55:16 -0500 Subject: GPU24 killed, GPU25 /zfsauton/datasets issues In-Reply-To: References: Message-ID: I used IPMI to power off/on GPU24. I am now logged into that node as well monitoring use. Predrag On Wed, Nov 9, 2022 at 12:36 AM Predrag Punosevac wrote: > Dear Autonians, > > I am noticing a pattern here. A few (less or equal than 5) are fighting > over the four most potent computing nodes in our cluster GPU[24-27]. Those > few users have managed to chase away everyone else and got into the > vicious cycle of running jobs too big even for those machines and killing > all daemons and NFS mounts in the process. I don't know a thing about ML > but this is not the way to conduct "scientific research". > > > This will have to stop. I am currently logging into GPU[25-27]. GPU24 is > not reachable even with my root ssh access. ssh daemon is usually one of > the very last daemons to be killed by overuse of resources. I will remain > logged for a few days and monitor activity. Repeated offenders will be > reported. > > Cheers, > Predrag > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwelter at andrew.cmu.edu Wed Nov 9 08:00:06 2022 From: gwelter at andrew.cmu.edu (Gus Welter) Date: Wed, 9 Nov 2022 17:00:06 +0400 Subject: ssh login problems (NFS server killed due to overload) In-Reply-To: References: <70CF9D01-A418-47A2-A0DF-3EEED712A9BB@andrew.cmu.edu>

<4674D02C-875E-4347-BD66-6A9082231E14@andrew.cmu.edu>

Message-ID: Hi Predrag, I've heard mixed results. For one lab member, the issue persists on all servers, including gpu. For another, it is on the lov servers but not gpu. Personally, I've only tested it on lov3. Best, Gus On Wed, Nov 9, 2022 at 12:44 AM Predrag Punosevac wrote: > Hi Gus, > > Your SQLite database used by Jupyter is corrupted. Corruption was caused > by NFS (high io or stale NFS file handle). The real mystery is why don't > you instruct Jupyter notebook to store SQLite into the scratch directory? > Which server(s) are affected? > > Predrag > > On Tue, Nov 8, 2022 at 12:48 PM Gus Welter wrote: > >> Hi Predrag, >> >> Multiple lab members have Jupyter hang when they go to create or open a >> file in Jupyter Lab. I tried to do a "df -h" on lov3, and it hangs. Maybe >> there are some lingering NFS issues? >> >> Best, >> Gus >> >> On Mon, Oct 24, 2022 at 9:28 PM Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> That means that the process which caused the crash are still alive. I >>> need to think a bit how to proceed in the most efficient way. Logging into >>> 45 computing nodes and poking around doesn't scale well. If I end up doing >>> that offending account will be suspended. >>> >>> Predrag >>> >>> >>> On Mon, Oct 24, 2022, 1:21 PM Benedikt Boecking >>> wrote: >>> >>>> Just to confirm, looks like things are down again. >>>> >>>> >>>> >>>> On Oct 24, 2022, at 11:12 AM, Predrag Punosevac < >>>> predragp at andrew.cmu.edu> wrote: >>>> >>>> Please try to test bash.autonlab.org, upload.autonlab.org, and >>>> lop2.autonlab.org. >>>> >>>> It appears that NFS mounts work on these shell gateways. If you have an >>>> Auton Lab workstation please mount -o remount your network home directory >>>> or reboot it. >>>> >>>> Predrag >>>> >>>> On Mon, Oct 24, 2022 at 12:01 PM Predrag Punosevac < >>>> predragp at andrew.cmu.edu> wrote: >>>> >>>>> I am trying really hard not to reboot anything. I manually restarted a >>>>> bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I >>>>> noticed that restarting autofs daemons on computing nodes restored the >>>>> access. I am using Ansible to propagate autofs daemon restart over all >>>>> computing nodes. It appears that some of them hang. I am hoping to get away >>>>> with rebooting only a machine or two and definitely avoid rebooting the >>>>> main file server. >>>>> >>>>> For curiosity. NFS is the last century (1980s Sun Microsystem) >>>>> technology. It is a centralized single point of failure system. We >>>>> mitigate this risk by having NFS exports distributed over several different >>>>> physical file servers which run their own NFS instances. That is why >>>>> /zfsauton/data and /zfsauton/project as well as /zfsauton/datasets are not >>>>> affected. Unfortunately all of your home directories are located on GAIA. >>>>> If I catch rough users I could theoretically move their home directory to >>>>> the different file server and avoid this mess. The other option I was >>>>> looking for was migrating NFS to GlusterFS (distributed network file >>>>> system). The migration will be non-trivial and the performance penalty with >>>>> small files might be significant. This is not an exact science. >>>>> >>>>> Predrag >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking < >>>>> boecking at andrew.cmu.edu> wrote: >>>>> >>>>>> If there is any way to not reboot gpu24 and gpu27 you might save me 2 >>>>>> weeks of work. If they are rebooted I may be screwed for my ICLR rebuttal. >>>>>> >>>>>> But ultimately, do what you have to of course. Thanks! >>>>>> >>>>>> >>>>>> >>>>>> > On Oct 24, 2022, at 10:43 AM, Predrag Punosevac < >>>>>> predragp at andrew.cmu.edu> wrote: >>>>>> > >>>>>> > >>>>>> > >>>>>> > Dear Autoninas, >>>>>> > >>>>>> > I got several reports this morning from a few of you (Ifi, Abby, >>>>>> Ben, Vedant) that they are having problems accessing the system. After a >>>>>> bit of investigation, I nailed down the culprit to the main file server. >>>>>> The server (NFS instance) appears to be dead or severely degraded due to >>>>>> the overload. >>>>>> > >>>>>> > I am afraid that the only medicine will be to reboot the machine, >>>>>> perhaps followed up by the reboot of all 45+ computing nodes. This will >>>>>> result in a significant loss of work and productivity. We did go through >>>>>> this exercise less than two months ago. >>>>>> > >>>>>> > The Auton Lab cluster is not policed for rogue users. Its usability >>>>>> depends on collegial behaviour of each of our 130 members. Use of scratch >>>>>> directories instead of taxing NFS is well described in the documentation >>>>>> and as recently as last week I added extra scratch on at least four >>>>>> machines. >>>>>> > >>>>>> > Best, >>>>>> > Predrag >>>>>> >>>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Nov 9 10:03:55 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 9 Nov 2022 10:03:55 -0500 Subject: ssh login problems (NFS server killed due to overload) In-Reply-To: References: <70CF9D01-A418-47A2-A0DF-3EEED712A9BB@andrew.cmu.edu>

<4674D02C-875E-4347-BD66-6A9082231E14@andrew.cmu.edu>

Message-ID: The only sure thing is rebooting the file server, all computing nodes, and deleting corrupted SQLite database when everything is back. Things will be ok untill the same person(s) who killed the file server restarts her/his scripts. Predrag On Wed, Nov 9, 2022, 8:00 AM Gus Welter wrote: > Hi Predrag, > > I've heard mixed results. For one lab member, the issue persists on all > servers, including gpu. For another, it is on the lov servers but not gpu. > Personally, I've only tested it on lov3. > > Best, > Gus > > On Wed, Nov 9, 2022 at 12:44 AM Predrag Punosevac > wrote: > >> Hi Gus, >> >> Your SQLite database used by Jupyter is corrupted. Corruption was caused >> by NFS (high io or stale NFS file handle). The real mystery is why don't >> you instruct Jupyter notebook to store SQLite into the scratch directory? >> Which server(s) are affected? >> >> Predrag >> >> On Tue, Nov 8, 2022 at 12:48 PM Gus Welter >> wrote: >> >>> Hi Predrag, >>> >>> Multiple lab members have Jupyter hang when they go to create or open a >>> file in Jupyter Lab. I tried to do a "df -h" on lov3, and it hangs. Maybe >>> there are some lingering NFS issues? >>> >>> Best, >>> Gus >>> >>> On Mon, Oct 24, 2022 at 9:28 PM Predrag Punosevac < >>> predragp at andrew.cmu.edu> wrote: >>> >>>> That means that the process which caused the crash are still alive. I >>>> need to think a bit how to proceed in the most efficient way. Logging into >>>> 45 computing nodes and poking around doesn't scale well. If I end up doing >>>> that offending account will be suspended. >>>> >>>> Predrag >>>> >>>> >>>> On Mon, Oct 24, 2022, 1:21 PM Benedikt Boecking < >>>> boecking at andrew.cmu.edu> wrote: >>>> >>>>> Just to confirm, looks like things are down again. >>>>> >>>>> >>>>> >>>>> On Oct 24, 2022, at 11:12 AM, Predrag Punosevac < >>>>> predragp at andrew.cmu.edu> wrote: >>>>> >>>>> Please try to test bash.autonlab.org, upload.autonlab.org, and >>>>> lop2.autonlab.org. >>>>> >>>>> It appears that NFS mounts work on these shell gateways. If you have >>>>> an Auton Lab workstation please mount -o remount your network home >>>>> directory or reboot it. >>>>> >>>>> Predrag >>>>> >>>>> On Mon, Oct 24, 2022 at 12:01 PM Predrag Punosevac < >>>>> predragp at andrew.cmu.edu> wrote: >>>>> >>>>>> I am trying really hard not to reboot anything. I manually restarted >>>>>> a bunch of daemons on the main file server Gaia (nfsd, mounted, rpcbind). I >>>>>> noticed that restarting autofs daemons on computing nodes restored the >>>>>> access. I am using Ansible to propagate autofs daemon restart over all >>>>>> computing nodes. It appears that some of them hang. I am hoping to get away >>>>>> with rebooting only a machine or two and definitely avoid rebooting the >>>>>> main file server. >>>>>> >>>>>> For curiosity. NFS is the last century (1980s Sun Microsystem) >>>>>> technology. It is a centralized single point of failure system. We >>>>>> mitigate this risk by having NFS exports distributed over several different >>>>>> physical file servers which run their own NFS instances. That is why >>>>>> /zfsauton/data and /zfsauton/project as well as /zfsauton/datasets are not >>>>>> affected. Unfortunately all of your home directories are located on GAIA. >>>>>> If I catch rough users I could theoretically move their home directory to >>>>>> the different file server and avoid this mess. The other option I was >>>>>> looking for was migrating NFS to GlusterFS (distributed network file >>>>>> system). The migration will be non-trivial and the performance penalty with >>>>>> small files might be significant. This is not an exact science. >>>>>> >>>>>> Predrag >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 24, 2022 at 11:47 AM Benedikt Boecking < >>>>>> boecking at andrew.cmu.edu> wrote: >>>>>> >>>>>>> If there is any way to not reboot gpu24 and gpu27 you might save me >>>>>>> 2 weeks of work. If they are rebooted I may be screwed for my ICLR >>>>>>> rebuttal. >>>>>>> >>>>>>> But ultimately, do what you have to of course. Thanks! >>>>>>> >>>>>>> >>>>>>> >>>>>>> > On Oct 24, 2022, at 10:43 AM, Predrag Punosevac < >>>>>>> predragp at andrew.cmu.edu> wrote: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Dear Autoninas, >>>>>>> > >>>>>>> > I got several reports this morning from a few of you (Ifi, Abby, >>>>>>> Ben, Vedant) that they are having problems accessing the system. After a >>>>>>> bit of investigation, I nailed down the culprit to the main file server. >>>>>>> The server (NFS instance) appears to be dead or severely degraded due to >>>>>>> the overload. >>>>>>> > >>>>>>> > I am afraid that the only medicine will be to reboot the machine, >>>>>>> perhaps followed up by the reboot of all 45+ computing nodes. This will >>>>>>> result in a significant loss of work and productivity. We did go through >>>>>>> this exercise less than two months ago. >>>>>>> > >>>>>>> > The Auton Lab cluster is not policed for rogue users. Its >>>>>>> usability depends on collegial behaviour of each of our 130 members. Use of >>>>>>> scratch directories instead of taxing NFS is well described in the >>>>>>> documentation and as recently as last week I added extra scratch on at >>>>>>> least four machines. >>>>>>> > >>>>>>> > Best, >>>>>>> > Predrag >>>>>>> >>>>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sat Nov 12 18:31:43 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sat, 12 Nov 2022 18:31:43 -0500 Subject: Auton Lab etiquette Message-ID: Dear Autonians, This is a friendly reminder that every Auton Lab account holder is supposed to adhere to Auton Lab etiquette. Please refer to the existing documentation https://docs.google.com/document/d/1ah94jN6tMFeHMttyW9Vr9vjhjMQoT7Cb/edit Cheers, Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Mon Nov 21 20:46:21 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 21 Nov 2022 20:46:21 -0500 Subject: No brief meetings with Artur this week In-Reply-To: References:

Message-ID: Team, I was hoping to release even a skimpy meeting schedule for this week, but earlier today I tested positive for covid. Combine that with the lowered immune response due to chemotherapy I am taking as well, and it got me scramble around trying to figure out what to do. But now I have a robust plan and I should be fine overall. But I am unable to commit to many meetings these days. However, if you have a pressing need to meet or a really cool idea to discuss, please do not hesitate to let me know via email, and we will figure it out. Cheers, Artur On Mon, Nov 7, 2022, 4:35 PM Artur Dubrawski wrote: > The short meeting slots for this week have just been opened. Please book > one (or more) while they last. > > Spreadsheet for signups link and the zoom link have not changed, they are > below for easy reference. > > Cheers > Artur > > >> >>>>> >>>>>> >>>>>> https://docs.google.com/spreadsheets/d/1OpY1DSxG7LLsMRroocMFTgqndSMiRTYUhYEwYdXH7Wc/edit?pli=1#gid=0 >>>>>> >>>>>> We will be using the same zoom link as before: >>>>>> >>>>>> https://cmu.zoom.us/j/9672166543 >>>>>> >>>>>> PS Let me know if the available times do not work for you and we will >>>>>> look for alternatives. >>>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Nov 22 21:20:48 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 22 Nov 2022 21:20:48 -0500 Subject: Could not chdir to home directory Message-ID: Dear Autonians, It came to my attention that autofs daemon is not doing its job on the numerous computing nodes. On one of the computing nodes I found this in /var/log/messages Your configuration uses the autofs provider with schema set to rfc2307 and default attribute mappings. The default map has changed in this release, please make sure the configuration matches the server attributes. which points to an old bug in SSSD which was fixed years ago https://bugzilla.redhat.com/show_bug.cgi?id=1372814 I need to poke a bit more with this before I say anything else. Best, Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Nov 23 12:12:19 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 23 Nov 2022 12:12:19 -0500 Subject: Could not chdir to home directory In-Reply-To: References: Message-ID: The issues didn't go away over the night. I played with this a little bit last night and this morning. I see the same problem with three out of 4 file servers. Restarting nsfd, rpcbind, and a few other daemons didn't fix anything. At this point your home directories are not usable. Unless somebody gives a really good reason not to reboot these file servers they will be rebooted today in an attempt to fix the system. Best, Predrag On Tue, Nov 22, 2022 at 9:20 PM Predrag Punosevac wrote: > Dear Autonians, > > It came to my attention that autofs daemon is not doing its job on the > numerous computing nodes. On one of the computing nodes I found this in > /var/log/messages > > Your configuration uses the autofs provider with schema set to rfc2307 and > default attribute mappings. The default map has changed in this release, > please make sure the configuration matches the server attributes. > > which points to an old bug in SSSD which was fixed years ago > > https://bugzilla.redhat.com/show_bug.cgi?id=1372814 > > I need to poke a bit more with this before I say anything else. > > Best, > Predrag > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Fri Nov 25 18:37:12 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 25 Nov 2022 18:37:12 -0500 Subject: Could not chdir to home directory In-Reply-To: References:

Message-ID: Dear Autonians, I hope everyone had a good Thanksgiving. I wish you safe travels. I have fixed the main file server (home directories). Since the fix required reboot, I upgraded the OS to the latest FreeBSD 13.1p4. I verified autofs on several computing nodes (gpu24 was one of them). Everything works as expected. If somebody notices any stale NFS file handles please report immediately as those computing nodes will have to be rebooted. Stale NFS file handles will not resolve on their own. There are at least 2 other file server servers (/zfsauton/data, /zfsauton/projects, /zfsauton/datasets) which need to be fixed. I don't want to jinx it by giving ETA. Please adhere to the Auton Lab etiquette to keep the system stable. https://docs.google.com/document/d/1ah94jN6tMFeHMttyW9Vr9vjhjMQoT7Cb/edit Best, Predrag On Wed, Nov 23, 2022 at 12:12 PM Predrag Punosevac wrote: > The issues didn't go away over the night. I played with this a little bit > last night and this morning. I see the same problem with three out of 4 > file servers. Restarting nsfd, rpcbind, and a few other daemons didn't fix > anything. At this point your home directories are not usable. Unless > somebody gives a really good reason not to reboot these file servers they > will be rebooted today in an attempt to fix the system. > > Best, > Predrag > > On Tue, Nov 22, 2022 at 9:20 PM Predrag Punosevac > wrote: > >> Dear Autonians, >> >> It came to my attention that autofs daemon is not doing its job on the >> numerous computing nodes. On one of the computing nodes I found this in >> /var/log/messages >> >> Your configuration uses the autofs provider with schema set to rfc2307 >> and default attribute mappings. The default map has changed in this >> release, please make sure the configuration matches the server attributes. >> >> which points to an old bug in SSSD which was fixed years ago >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1372814 >> >> I need to poke a bit more with this before I say anything else. >> >> Best, >> Predrag >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sat Nov 26 00:28:38 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sat, 26 Nov 2022 00:28:38 -0500 Subject: Could not chdir to home directory In-Reply-To: References:

Message-ID: All file servers are now fixed and upgraded to FreeBSD 13.1-RELEASE-p4 Predrag On Fri, Nov 25, 2022 at 6:37 PM Predrag Punosevac wrote: > Dear Autonians, > > I hope everyone had a good Thanksgiving. I wish you safe travels. > > I have fixed the main file server (home directories). Since the fix > required reboot, I upgraded the OS to the latest FreeBSD 13.1p4. I verified > autofs on several computing nodes (gpu24 was one of them). Everything works > as expected. If somebody notices any stale NFS file handles please report > immediately as those computing nodes will have to be rebooted. Stale NFS > file handles will not resolve on their own. > > There are at least 2 other file server servers (/zfsauton/data, > /zfsauton/projects, /zfsauton/datasets) which need to be fixed. I don't > want to jinx it by giving ETA. > > Please adhere to the Auton Lab etiquette to keep the system stable. > > https://docs.google.com/document/d/1ah94jN6tMFeHMttyW9Vr9vjhjMQoT7Cb/edit > > Best, > Predrag > > > > > On Wed, Nov 23, 2022 at 12:12 PM Predrag Punosevac < > predragp at andrew.cmu.edu> wrote: > >> The issues didn't go away over the night. I played with this a little bit >> last night and this morning. I see the same problem with three out of 4 >> file servers. Restarting nsfd, rpcbind, and a few other daemons didn't fix >> anything. At this point your home directories are not usable. Unless >> somebody gives a really good reason not to reboot these file servers they >> will be rebooted today in an attempt to fix the system. >> >> Best, >> Predrag >> >> On Tue, Nov 22, 2022 at 9:20 PM Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> Dear Autonians, >>> >>> It came to my attention that autofs daemon is not doing its job on the >>> numerous computing nodes. On one of the computing nodes I found this in >>> /var/log/messages >>> >>> Your configuration uses the autofs provider with schema set to rfc2307 >>> and default attribute mappings. The default map has changed in this >>> release, please make sure the configuration matches the server attributes. >>> >>> which points to an old bug in SSSD which was fixed years ago >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1372814 >>> >>> I need to poke a bit more with this before I say anything else. >>> >>> Best, >>> Predrag >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sat Nov 26 12:53:46 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sat, 26 Nov 2022 12:53:46 -0500 Subject: Gogs down Message-ID: Dear Autonians, The upgrade of the very last lab server (jail host bhyve.int.autonlab.org) running FreeBSD 12.3 didn't go as planned. As a consequence a few jail instances are currently not available. Perhaps the only jail instance for which lab members truly care is git.int.autonlab.org I do have IPMI access to the jail host but no KVM or text console access due to Java issues. The fix will have to wait until I have physical access to the machine. Unless both HDD (ZFS mirror) died at the same time during the reboot I should be able to recover the server simply by using bectl (Utility to manage boot environments on ZFS). I apologize for the inconvenience. Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Sun Nov 27 13:48:42 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Sun, 27 Nov 2022 13:48:42 -0500 Subject: Brief meetings with Artur this week In-Reply-To: References:

Message-ID: New meeting openings grid has just been published. Cheers Artur > > On Mon, Nov 7, 2022, 4:35 PM Artur Dubrawski wrote: > >> The short meeting slots for this week have just been opened. Please book >> one (or more) while they last. >> >> Spreadsheet for signups link and the zoom link have not changed, they are >> below for easy reference. >> >> Cheers >> Artur >> >> >>> >>>>>> >>>>>>> >>>>>>> https://docs.google.com/spreadsheets/d/1OpY1DSxG7LLsMRroocMFTgqndSMiRTYUhYEwYdXH7Wc/edit?pli=1#gid=0 >>>>>>> >>>>>>> We will be using the same zoom link as before: >>>>>>> >>>>>>> https://cmu.zoom.us/j/9672166543 >>>>>>> >>>>>>> PS Let me know if the available times do not work for you and we will >>>>>>> look for alternatives. >>>>>>> >>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Mon Nov 28 12:28:30 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 28 Nov 2022 12:28:30 -0500 Subject: Fwd: RI Ph.D. Thesis Defense: Benedikt Boecking In-Reply-To: References: Message-ID: Please join Ben on his Big Day and see a really cool talk he is going to give. Cheers Artur ---------- Forwarded message --------- From: Suzanne Muth Date: Mon, Nov 28, 2022 at 11:54 AM Subject: RI Ph.D. Thesis Defense: Benedikt Boecking To: Date: 07 December 2022 Time: 1:00 p.m. (ET) Location: NSH 4305 Zoom Link: https://cmu.zoom.us/j/96368686155?pwd=Zm9abDRRYWNJUkNqU2pIZmEvM0hpQT09 Type: Ph.D. Thesis Defense Who: Benedikt Boecking Title: Learning with Diverse Forms of Imperfect and Indirect Supervision Abstract: Powerful Machine Learning (ML) models trained on large, annotated datasets have driven impressive advances in fields including natural language processing and computer vision. In turn, such developments have led to impactful applications of ML in areas such as healthcare, e-commerce, and predictive maintenance. However, obtaining annotated datasets at the scale required for training high capacity ML models is frequently a bottleneck for promising applications of ML. In this thesis, I study alternative pathways for acquiring domain knowledge and develop methodologies to enable learning from weak supervision, i.e., imperfect and indirect forms of supervision. I cover three forms of weak supervision: pairwise linkage feedback, programmatic weak supervision, and paired multi-modal data. These forms of information are often easy to obtain at scale, and the methods I develop reduce--and in some cases eliminate--the need for pointillistic ground truth annotations. I begin by studying the utility of pairwise supervision. I introduce a new constrained clustering method which uses small amounts of pairwise constraints to simultaneously learn a kernel and cluster data. The method outperforms related approaches on a large and diverse group of publicly available datasets. Next, I introduce imperfect pairwise supervision to programmatic weak supervision label models. I show empirically that just one source of weak pairwise feedback can lead to significantly improved downstream performance. I then further the study of programmatic data labeling methods by introducing approaches that model the distribution of inputs in concert with weak labels. I first introduce a framework for joint learning of a label and end model on the basis of observed weak labels, showing improvements over prior work in terms of end model performance on downstream test sets. Next, I introduce a method that fuses generative adversarial networks and programmatic weak supervision label models to the benefit of both, measured by label model performance and data generation quality. In the last part of this thesis, I tackle a central challenge in programmatic weak supervision: the need for experts to provide labeling rules. First, I introduce an interactive learning framework that aids users in discovering weak supervision sources to capture subject matter experts? knowledge of the application domain in an efficient fashion. I then study the opportunity of dispensing with labeling functions altogether by learning from unstructured natural language descriptions directly. In particular, I study how biomedical text paired with images can be exploited for self-supervised vision--language processing, yielding data-efficient representations and enabling zero-shot classification, without requiring experts to define rules on the text or images. Together, these works provide novel methodologies and frameworks to encode and use expert domain knowledge more efficiently in ML models, reducing the bottleneck created by the need for manual ground truth annotations. Thesis Committee Members: Artur Dubrawski, Chair Jeff Schneider Barnabas Poczos Hoifung Poon, Microsoft Research A draft of the thesis defense document is available at: https://drive.google.com/file/d/17DB_6gkfH7LPVzkt0adS0-O58pg_RSmE/view?usp=sharing _______________________________________________ ri-people mailing list ri-people at lists.andrew.cmu.edu https://lists.andrew.cmu.edu/mailman/listinfo/ri-people -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Nov 28 18:54:20 2022 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 28 Nov 2022 18:54:20 -0500 Subject: Gogs down In-Reply-To: References: Message-ID: One of the HDDs has some S.M.A.R.T. errors and BIOS is refusing to allow automatic reboot with a drive which might die. I was able to bypass the error report simply by pressing F1. I had only 15 minutes today to spend in the server room so I could not see how to disable S.M.A.R.T. daemon in the firmware. I am running it anyway on the OS. In any case I hope to restore all services shortly. However, this server is going to be a bit of a pain until I replace the faulty HDD and rebuild the ZFS mirror. Unfortunately, this is not a hot swappable server chassis so fixing it is going to be time consuming. Predrag Cheers, Predrag On Sat, Nov 26, 2022 at 12:53 PM Predrag Punosevac wrote: > Dear Autonians, > > The upgrade of the very last lab server (jail host bhyve.int.autonlab.org) > running FreeBSD 12.3 didn't go as planned. As a consequence a few jail > instances are currently not available. Perhaps the only jail instance for > which lab members truly care is > > git.int.autonlab.org > > I do have IPMI access to the jail host but no KVM or text console access > due to Java issues. The fix will have to wait until I have physical access > to the machine. Unless both HDD (ZFS mirror) died at the same time during > the reboot I should be able to recover the server simply by using bectl > (Utility to manage boot environments on ZFS). > > I apologize for the inconvenience. > > Predrag > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Tue Nov 29 13:16:59 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Tue, 29 Nov 2022 13:16:59 -0500 Subject: Reminder: Brief meetings with Artur this week In-Reply-To: References:

Message-ID: There are still a few slots left. On Sun, Nov 27, 2022 at 1:48 PM Artur Dubrawski wrote: > New meeting openings grid has just been published. > > Cheers > Artur > > >> >> On Mon, Nov 7, 2022, 4:35 PM Artur Dubrawski wrote: >> >>> The short meeting slots for this week have just been opened. Please book >>> one (or more) while they last. >>> >>> Spreadsheet for signups link and the zoom link have not changed, they >>> are below for easy reference. >>> >>> Cheers >>> Artur >>> >>> >>>> >>>>>>> >>>>>>>> >>>>>>>> https://docs.google.com/spreadsheets/d/1OpY1DSxG7LLsMRroocMFTgqndSMiRTYUhYEwYdXH7Wc/edit?pli=1#gid=0 >>>>>>>> >>>>>>>> We will be using the same zoom link as before: >>>>>>>> >>>>>>>> https://cmu.zoom.us/j/9672166543 >>>>>>>> >>>>>>>> PS Let me know if the available times do not work for you and we >>>>>>>> will >>>>>>>> look for alternatives. >>>>>>>> >>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Wed Nov 30 10:08:43 2022 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Wed, 30 Nov 2022 10:08:43 -0500 Subject: Fwd: 2023 Probability Management Summit Follow-Up In-Reply-To: References: Message-ID: I think this is relevant and may be of interest to a few of us. And we should have at least one representative of the Lab participate in the event. Please let Jessie and I know if you'd be interested. Cheers Artur ---------- Forwarded message --------- From: George Darakos Date: Wed, Nov 30, 2022 at 9:52 AM Subject: Fwd: 2023 Probability Management Summit Follow-Up To: Zachary Lipton , Artur Dubrawski , Barnabas Poczos , Roni Rosenfeld , Aaditya Ramdas Hi Gentlemen, Would you (and your students) be interested in participating in this workshop on Probability Management in May '23? They would like to come to campus for a session on 5/25 where your participation would be requested. If you could please let me know by noon this Friday if you might be interested. I'm focusing on SCS faculty, so if you know of any others in SCS who might be interested in this, please let me know. *Note:* CBE will be working to identify others around campus (in statistics and in Tepper and Heinz) who may be interested. Thank you, George ---------- Forwarded message --------- From: Schell, Justin (He/Him) (Highmark Health) < justin.schell at highmarkhealth.org> Date: Wed, Oct 26, 2022 at 9:53 AM Subject: 2023 Probability Management Summit Follow-Up To: Anita Jesionowski , gdarakos at andrew.cmu.edu < gdarakos at andrew.cmu.edu> Cc: Smetanka, Courtney L (Highmark Health) < Courtney.Smetanka at highmarkhealth.org>, Sam Savage < sam at probabilitymanagement.org>, Max Henrion Anita and George, Thank you for your time yesterday and thank you to Courtney for creating the connection. I am including Sam Savage (Executive Director of Probability Management) and Max Henrion (CEO of Lumina Decisions & CMU Adjunct) to make sure their feedback and perspectives are included in our conversation. Here are the follow-up items to our conversation. *What is the mission & vision of Probability Management?* www.probabilitymanagement.org Probability Management is the developer and promotor of an open-source standard of sharing stochastic information across organizations and business decisions called the SIPMath Standard. Probability Management?s mission is to help decision makers make decisions in the face of uncertainty through a better understanding of the chance of meeting their objectives or avoiding risk. Probability Management?s vision is a world where the brightest minds create power plants of uncertainty that generate probability distributions and accessing those probability distributions for decision-making is as simple as screwing in a light bulb. *What is the goal of the annual Probability Management Summit (2023 is May 24th & 25th)?* Each year, Probability Management holds an invitation only summit of decision scientists, engineers, and decision makers to come together and collaborate on ways to apply the SIPMath Standard to decision-making and methods to change organizational culture to adopt probabilistic methods for making decisions. The summit looks to also connect to institutions of higher education where students and faculty are exploring problems that could benefit from using the SIPMath Standard to generate probability distributions, enhance decision-making, or optimize outcomes. *What is our ask of CMU?* 1. Identify faculty, researchers, and partners at CMU who would find the SIPMath standard relevant and would be interested in learning more about the standard and its application. 2. Facilitate some introductions in advance of the summit to some of the relevant faculty & researchers to introduce them to the standard with the goal of applying it right away in their work. 3. Publicize a 1-day educational program at CMU to learn about the SIPMath standard and its applications (hopefully with some examples from the advance introduction group) 4. Provide space for the program *What are some areas who may be interest?* The SIPMath standard has found consistent applications in the energy industry and military readiness applications. However, Highmark Health and Kaiser Permanente have begun using it in healthcare applications as well. Additionally, there are applications in government policy. Basically, any place where people are trying to determine outcomes in uncertain conditions would find the standard useful. Doug Matty would be an excellent person to touch base with in the advance group. I looked at his LinkedIn profile, and he has connections to some of the volunteers and sponsors at Probability Management, and he is a graduate of the OR program at the Naval Postgraduate School (featured at the 2022 summit). *Size Range* We are expecting 30-40 people to attend the summit from around the country for the 2 days. It is reasonable to assume that the CMU day would add another 30-40 CMU representatives. Thank you again for your time and help. Sam or Max if you have any thoughts (especially on some advance group contacts at CMU), please share them. Anita and George, please let me know what other questions or thoughts you have. From, Justin Schell (he/him) *Justin Schell* *|* Director Decision & Capital Analysis *|* Highmark Health *|* 120 Fifth Avenue, Suite FAPHM-193A *|* Pittsburgh, PA 15222 *|* 412-544-4680 *| *justin.schell at highmarkhealth.org Visit my LinkedIn Profile at https://www.linkedin.com/in/justinschell -- *George Darakos* | *Chief Partnerships Officer* *Carnegie Mellon University, **School of Computer Science* gdarakos at andrew.cmu.edu | o: (412) 268-3805 | c: (412) 596-7836 -------------- next part -------------- An HTML attachment was scrubbed... URL: