From jmjoseph at andrew.cmu.edu Wed Oct 6 16:56:48 2004 From: jmjoseph at andrew.cmu.edu (Jacob Joseph) Date: Wed, 06 Oct 2004 16:56:48 -0400 Subject: [auton-users] Request for Downtime Message-ID: <41645C10.70604@andrew.cmu.edu> Hi. I need to take Lofty(the fileserver) down for some disk upgrades and wanted to get an idea of when a day of downtime would be possible. Obviously, since this machine is rather integral to the lab network, this will have to occur on a weekend, preferably for an entire day, as I have a significant amount of data to transfer. By no means does the downtime *need* to occur immediately, but I would like to accomplish the upgrade as soon as possible. Long running jobs on the opterons which need access to BigPapa can be temporarily suspended for the duration, but would clearly face a higher risk of problems. Jobs which do not need continual disk access, such as those reading/writing to /tmp could continue to run. Desktop machines using BigPapa would be unusable for the duration of the downtime. The Auton website and email will not be affected. So, at the risk of giving too little notice, would this Saturday(10/9) through Sunday morning inconvenience anyone? I know we had very low utilization last weekend. If I hear of no objections, I will plan for this Saturday and send another email. Please do speak up if there will be a better weekend for me to do this. Thanks, Jacob From jmjoseph at andrew.cmu.edu Wed Oct 6 19:32:19 2004 From: jmjoseph at andrew.cmu.edu (Jacob Joseph) Date: Wed, 06 Oct 2004 19:32:19 -0400 Subject: [auton-users] Downtime Scheduling Message-ID: <41648083.2080405@andrew.cmu.edu> This weekend is not going to work for a number of people. As such, I am going to tentatively schedule for next weekend(10/16), with 10/23 as an alternate if it looks necessary by the end of next week. Contact me if either 10/16 or 10/23 will be unacceptable. -Jacob From jmjoseph at andrew.cmu.edu Thu Oct 7 10:57:55 2004 From: jmjoseph at andrew.cmu.edu (Jacob Joseph) Date: Thu, 07 Oct 2004 10:57:55 -0400 Subject: [auton-users] Lofty Interruption Last Night Message-ID: <41655973.90106@andrew.cmu.edu> Late last night, between 9 and 12 a bug in the firmware of the primary disk array(BigPapa) was encountered. This did result in the array going down until I noticed around 1am. After waiting out some emergency backups and an upgrade of the buggy firmware, I brought the array back up for good around 4-5am. While nothing on the disks was lost, there is a slim to nil chance that writes in progress were interrupted at a point that could have resulted in corruption. Due to the way we do NFS and the lack of caching in memory during writes, this is extremely unlikely. All transfers resumed when the array was brought back up. For the sake of clarity, I can imagine such corruption presenting itself as a few bytes missing from the middle of a written file. Despite the low likelihood of any troubles, I think it's prudent to let everyone know the two critical time points where Lofty was uncleanly rebooted. If you were writing between 9-12(I do not know the exact time of the shutdown offhand) and at about 3am, and you see problems with your data, it would be wise to keep this issue in mind. Please note that reads would not have been affected in any case. While I've tried to address everything, please let me know if you have any questions. -Jacob From jmjoseph at andrew.cmu.edu Sat Oct 9 20:02:29 2004 From: jmjoseph at andrew.cmu.edu (Jacob Joseph) Date: Sat, 09 Oct 2004 20:02:29 -0400 Subject: [auton-users] Lab Downtime 10pm TONIGHT Message-ID: <41687C15.7060205@andrew.cmu.edu> After talking further with those affected by downtime this weekend, we've agreed to not delay the downtime beyond this weekend. I am going to start the previously discussed disk maintenance at 10pm tonight, completing by 9am Sunday morning. If didn't speak up before, and this is going to be a problem for you, contact me immediately: AIM: jacobmj1 EMAIL: jmjoseph at andrew.cmu.edu PHONE: 831-524-0666 Thanks, -Jacob From jmjoseph at andrew.cmu.edu Sun Oct 10 08:46:59 2004 From: jmjoseph at andrew.cmu.edu (Jacob Joseph) Date: Sun, 10 Oct 2004 08:46:59 -0400 Subject: [auton-users] Lab's back up In-Reply-To: <41687C15.7060205@andrew.cmu.edu> References: <41687C15.7060205@andrew.cmu.edu> Message-ID: <41692F43.3060704@andrew.cmu.edu> The lab is back up. Please note that most machines using BigPapa will need to be rebooted to correct their NFS mounts. Given some slow transfers, I was unable to complete one portion of the maintenance within my 9am window and am currently running on a temporary disk array. I will be able to perform the needed transfers in the background with the lab up, but the transition back from the temporary array will require a short bit of downtime with the correction of NFS mounts on all clients, as is the case currently. I hope to accomplish this tonight. -Jacob Jacob Joseph wrote: > After talking further with those affected by downtime this weekend, > we've agreed to not delay the downtime beyond this weekend. I am going > to start the previously discussed disk maintenance at 10pm tonight, > completing by 9am Sunday morning. > > If didn't speak up before, and this is going to be a problem for you, > contact me immediately: > > AIM: jacobmj1 > EMAIL: jmjoseph at andrew.cmu.edu > PHONE: 831-524-0666 > > Thanks, > -Jacob From awm at cs.cmu.edu Thu Oct 21 14:19:43 2004 From: awm at cs.cmu.edu (Andrew W Moore) Date: Thu, 21 Oct 2004 14:19:43 -0400 Subject: [auton-users] Cycle servers Message-ID: <4177FDBF.3060707@cs.cmu.edu> Over the next few weeks please avoid using the cycle servers for big jobs. Several sponsored research projects need results in the short term and we want to use the machines for those. We'll let youknow when this period ends: probably mid-November. If a machine looks empty and has been unused for a while then feel free to run your code provided it doesn't use a whole lot of memory, but in "nice" mode. Thanks, Andrew From jmjoseph at andrew.cmu.edu Thu Oct 21 16:09:22 2004 From: jmjoseph at andrew.cmu.edu (Jacob Joseph) Date: Thu, 21 Oct 2004 16:09:22 -0400 Subject: [auton-users] Saturday 10/23 Brief Lab outage Message-ID: <41781772.8010308@andrew.cmu.edu> Hi. In order to put the finishing touches on the disk array work of last weekend, I am going to temporarily disallow access and suspend running jobs for about 5 hours Saturday night, beginning around 10PM. All running jobs will be restarted when the work is complete. Please note that any desktop machines using NFS home directories will not be usable during this period. I have agreed upon this time with the current users of the cluster machines and thus assume that there will be no others affected. However, if this time is somehow inappropriate for you, contact me by email ASAP. -Jacob From jmjoseph at andrew.cmu.edu Sat Oct 23 20:07:24 2004 From: jmjoseph at andrew.cmu.edu (Jacob Joseph) Date: Sat, 23 Oct 2004 20:07:24 -0400 Subject: [auton-users] Saturday 10/23 Brief Lab outage In-Reply-To: <41781772.8010308@andrew.cmu.edu> References: <41781772.8010308@andrew.cmu.edu> Message-ID: <417AF23C.4000308@andrew.cmu.edu> The window for this outage has been pushed back a few hours, beginning between 12 and 2AM. -Jacob Jacob Joseph wrote: > Hi. In order to put the finishing touches on the disk array work of > last weekend, I am going to temporarily disallow access and suspend > running jobs for about 5 hours Saturday night, beginning around 10PM. > All running jobs will be restarted when the work is complete. > > Please note that any desktop machines using NFS home directories will > not be usable during this period. I have agreed upon this time with the > current users of the cluster machines and thus assume that there will be > no others affected. However, if this time is somehow inappropriate for > you, contact me by email ASAP. > > -Jacob From jmjoseph at andrew.cmu.edu Sun Oct 24 08:00:00 2004 From: jmjoseph at andrew.cmu.edu (Jacob Joseph) Date: Sun, 24 Oct 2004 08:00:00 -0400 Subject: [auton-users] Lab is back In-Reply-To: <41781772.8010308@andrew.cmu.edu> References: <41781772.8010308@andrew.cmu.edu> Message-ID: <417B9940.3000206@andrew.cmu.edu> Hi. The disk transfer was successfully completed and all jobs on the lops resumed. That is, lofty and BigPapa are back up, with a bit more space than before. You might notice the no-so-little "T" in df. Thanks. -Jacob