From predragp at cs.cmu.edu Wed Jul 1 15:12:07 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Wed, 01 Jul 2015 15:12:07 -0400 Subject: Athena is down Message-ID: <20150701201207.NzBMJGCB%predragp@cs.cmu.edu> Dear Autonians, I just got a message from Monit that Athena is down. Trying manually to ping it didn't work. This is a very serious issue as Athena is our main KVM host running typically more than 10 KVM guests. All Traffic Jam project KVM instances are down and can't be recovered until I see what is wrong with Athena. I am looking into this right now. Predrag From predragp at cs.cmu.edu Wed Jul 1 17:42:58 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Wed, 01 Jul 2015 17:42:58 -0400 Subject: Athena is down In-Reply-To: <20150701201207.NzBMJGCB%predragp@cs.cmu.edu> References: <20150701201207.NzBMJGCB%predragp@cs.cmu.edu> Message-ID: <20150701224258.E5aUfvhI%predragp@cs.cmu.edu> Predrag Punosevac wrote: > Dear Autonians, > > I just got a message from Monit that Athena is down. Trying manually to > ping it didn't work. This is a very serious issue as Athena is our main > KVM host running typically more than 10 KVM guests. All Traffic Jam > project KVM instances are down and can't be recovered until I see what > is wrong with Athena. > > I am looking into this right now. > > Predrag Dear Autonians, Athena is up and running. I will proceed carefully with firing up virtual machines. Most importantly raid arrays look good at the moment. However the largest data pool/raid is still resyncing. It will take another couple hours to finish. I will sent another update in about 2h. Predrag From predragp at cs.cmu.edu Wed Jul 1 20:14:26 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Wed, 01 Jul 2015 20:14:26 -0400 Subject: Athena is down In-Reply-To: <20150701224258.E5aUfvhI%predragp@cs.cmu.edu> References: <20150701201207.NzBMJGCB%predragp@cs.cmu.edu> <20150701224258.E5aUfvhI%predragp@cs.cmu.edu> Message-ID: <20150702011426.gFw80Sfg%predragp@cs.cmu.edu> Predrag Punosevac wrote: > Predrag Punosevac wrote: > > > Dear Autonians, > > > > I just got a message from Monit that Athena is down. Trying manually to > > ping it didn't work. This is a very serious issue as Athena is our main > > KVM host running typically more than 10 KVM guests. All Traffic Jam > > project KVM instances are down and can't be recovered until I see what > > is wrong with Athena. > > > > I am looking into this right now. > > > > Predrag > > Dear Autonians, > > Athena is up and running. I will proceed carefully with firing up > virtual machines. Most importantly raid arrays look good at the moment. > However the largest data pool/raid is still resyncing. It will take > another couple hours to finish. > > I will sent another update in about 2h. > > Predrag All virtual machines on Athena with exception of John Deere computing node are up and running. Data pool is resync 65%. If you are in charge with one of those virtual servers please take 10-15 minutes to make sure that your services are running as expected. I am going through the log files right now trying to understand what happened. Predrag From predragp at imap.srv.cs.cmu.edu Thu Jul 2 08:14:00 2015 From: predragp at imap.srv.cs.cmu.edu (predragp) Date: Thu, 02 Jul 2015 08:14:00 -0400 Subject: CS.CMU infrastructure problems Message-ID: Some of you must have noticed problems with accessing network services. I checked our infrastructure and it appears that something is happening with CS.CMU infrastructure. For example I can't log into my cs.cmu.edu e-mail from home buy I can log from the remote server in Florida. I also had troubles with www.autonlab.org web site from home but not from several other remote locations around U.S. I haven't gotten any notice from them so I will check once I am on the campus. Predrag From boecking at andrew.cmu.edu Mon Jul 20 23:06:55 2015 From: boecking at andrew.cmu.edu (Benedikt Boecking) Date: Mon, 20 Jul 2015 23:06:55 -0400 Subject: disc space lov3 Message-ID: Would everyone with large files in their /home/scratch directory on lov3 (and other nodes for that matter) please delete unnecessary files? The directory is 100% full right now so many users cannot run important jobs. If everyone makes this a habit then I won?t have to send out a reminder every 4 weeks. Thanks. =============================== Benedikt Boecking Research Programmer Auton Lab, School of Computer Science Carnegie Mellon University From predragp at cs.cmu.edu Mon Jul 20 23:29:13 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Mon, 20 Jul 2015 23:29:13 -0400 Subject: disc space lov3 In-Reply-To: References: Message-ID: <20150721042913.VBe5mwU-%predragp@cs.cmu.edu> Benedikt Boecking wrote: > Would everyone with large files in their /home/scratch directory on lov3 (and other nodes for that matter) please delete unnecessary files? The directory is 100% full right now so many users cannot run important jobs. If everyone makes this a habit then I won???t have to send out a reminder every 4 weeks. > > Thanks. > Dear Autonians, I could add disk space on LOV3 and LOV4 so that we don't go through this anymore. However that requires that I reboot the machines which is at this moment unacceptable due to numerous deadlines. Could you please cooperate on this matter until I have the green ligth from project leaders that it is safe to reboot. Thank you, Predrag > =============================== > Benedikt Boecking > Research Programmer > Auton Lab, School of Computer Science > Carnegie Mellon University > From predragp at cs.cmu.edu Thu Jul 23 13:59:57 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Thu, 23 Jul 2015 13:59:57 -0400 Subject: Fwd: RE: Brief Power Outage in NSH and Smith Friday Morning Message-ID: <20150723185957.U4HpTSer%predragp@cs.cmu.edu> Dear Autonians, Unfortunatelly this power outage will affect us since most of our UPSs have been retired. Please shutdown your desktop machine when you leave the work today unless it is plugged into an UPS. Otherwise I will shutdown all desktop computers tonight at 10:00 PM and have them back tomorrow around 8:30 AM. Best, Predrag -------- Original Message -------- From: "SCS Help Desk" To: Subject: RE: Brief Power Outage in NSH and Smith Friday Morning Date: Thu, 23 Jul 2015 12:49:10 -0400 SCS Computing Facilities recommends that office occupants shut down all desktop machines prior to these outages as a precaution. There should be no interruption of SCS network connectivity, email services, AFS or printing during these outages. Please contact the SCS Help Desk at x8-4231 or send mail to help+ at cs.cmu.edu or skees at cs.cmu.edu with any questions or concerns regarding these scheduled power outages. Thank you for your attention, SCS Help Desk -----Original Message----- From: Jim Skees Sent: Wednesday, July 22, 2015 2:43 PM To: Jim Skees Subject: Brief Power Outage in NSH and Smith Friday Morning *Division of Campus Affairs* *Facilities Management Services* *TO: Shutdown Group, FMS, Newell Simon, Hamburg, Smith* *FROM: Service Response Center* *DATE: July 22, 2015* *SUBJECT: Power Transfer Affecting FMS, Newell Simon, Hamburg and Smith on 7/24* FMS electricians will be transferring power on Friday, 7/24, starting at 6:00AM, to support the new chiller being installed for the Scott Hall Project. During this time, the following buildings will experience a short power outage: FMS Building Newell Simon Hamburg Hall Smith Hall Power should be restored by 6:30AM to all affected buildings. If you have questions about this work, please contact the Service Response Center at 412-268-2910 or fixit at andrew.cmu.edu From predragp at cs.cmu.edu Thu Jul 23 14:47:07 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Thu, 23 Jul 2015 14:47:07 -0400 Subject: Brief Power Outage in NSH and Smith Friday Morning In-Reply-To: References: <20150723185957.U4HpTSer%predragp@cs.cmu.edu> Message-ID: <20150723194707.9qiTZLpn%predragp@cs.cmu.edu> Anthony Wertz wrote: > Predrag, > > Does this affect athena (or wherever jdeere is located)? Some things are > running there I just need to know if I need to kill it somewhere. No it doesn't not! Sorry folks I should have been more clear about this. The power outrage will NOT affect our servers. We have two generators plus batteries to gracefully shutdown the servers. Unfortunately we no longer have UPS for the desktop machines. If we see more frequent power outrage I am sure we will scrap 2-3K and equip all desktops with new UPS. Predrag > > 2015-07-23 13:59 GMT-04:00 Predrag Punosevac : > > > Dear Autonians, > > > > Unfortunatelly this power outage will affect us since most of our UPSs > > have been retired. Please shutdown your desktop machine when you leave > > the work today unless it is plugged into an UPS. Otherwise I will > > shutdown all desktop computers tonight at 10:00 PM and have them back > > tomorrow around 8:30 AM. > > > > Best, > > Predrag > > > > > > -------- Original Message -------- > > From: "SCS Help Desk" > > To: > > Subject: RE: Brief Power Outage in NSH and Smith Friday Morning > > Date: Thu, 23 Jul 2015 12:49:10 -0400 > > > > SCS Computing Facilities recommends that office occupants shut down all > > desktop machines prior to these outages as a precaution. > > There should be no interruption of SCS network connectivity, email > > services, > > AFS or printing during these outages. > > > > Please contact the SCS Help Desk at x8-4231 or send mail to > > help+ at cs.cmu.edu or skees at cs.cmu.edu with any questions or concerns > > regarding > > these scheduled power outages. > > > > Thank you for your attention, > > > > SCS Help Desk > > > > -----Original Message----- > > > > From: Jim Skees > > Sent: Wednesday, July 22, 2015 2:43 PM > > To: Jim Skees > > Subject: Brief Power Outage in NSH and Smith Friday Morning > > > > *Division of Campus Affairs* > > *Facilities Management Services* > > *TO: Shutdown Group, FMS, Newell Simon, Hamburg, Smith* > > *FROM: Service Response Center* > > *DATE: July 22, 2015* > > *SUBJECT: Power Transfer Affecting FMS, Newell Simon, Hamburg and Smith on > > 7/24* > > > > FMS electricians will be transferring power on Friday, 7/24, starting at > > 6:00AM, to support the new chiller being installed for the Scott Hall > > Project. During this time, the following buildings will experience a short > > power outage: > > > > FMS Building > > Newell Simon > > Hamburg Hall > > Smith Hall > > > > Power should be restored by 6:30AM to all affected buildings. > > > > If you have questions about this work, please contact the Service Response > > Center at 412-268-2910 or fixit at andrew.cmu.edu > > > > > > > -- > *Anthony Wertz* > Research Programmer and Analyst > Robotics Institute - Auton Lab > Carnegie Mellon University > awertz at cmu.edu From predragp at cs.cmu.edu Fri Jul 24 10:42:33 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Fri, 24 Jul 2015 10:42:33 -0400 Subject: Desktops powered up In-Reply-To: <0A0CF19A-A8AA-4A8E-8CAB-86A6906C810C@andrew.cmu.edu> References: <20150723185957.U4HpTSer%predragp@cs.cmu.edu> <0A0CF19A-A8AA-4A8E-8CAB-86A6906C810C@andrew.cmu.edu> Message-ID: <20150724154233.WCbaEngH%predragp@cs.cmu.edu> Benedikt Boecking wrote: > Predrag, > > could you do me a favour and restart my desktop PC? > > Thanks! All desktops are now restarted. So far everything looks good but I will be checking things for another hour. Predrag > > =============================== > Benedikt Boecking > Research Programmer > Auton Lab, School of Computer Science > Carnegie Mellon University > > > On 23 Jul 2015, at 13:59, Predrag Punosevac wrote: > > > > tomorrow around 8:30 AM. > From predragp at cs.cmu.edu Fri Jul 24 11:27:11 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Fri, 24 Jul 2015 11:27:11 -0400 Subject: sftp(Upolad/Mollify) is dead Message-ID: <20150724162711.mLLQva0Y%predragp@cs.cmu.edu> Dear Autonians, Our secure sftp server aka. upload which was running from that old Sun Blade located in the storage room has not survived shutdown and power up cycle. If you were in the process of exchanging files with your collaborators please inform them that the service will be restored within 24h. Moving upload to one of new Atom servers was for a while on my todo list and now it just got on the top of my stack. Since the accounts on the Upload were granted on the ad hoc basis I will need people to resend me the project names together with the names of their collaborators if they need to use this service. I apologize for any inconvenience. Sincerely, Predrag Punosevac From predragp at cs.cmu.edu Fri Jul 24 11:57:04 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Fri, 24 Jul 2015 11:57:04 -0400 Subject: Neill3 revitalized Message-ID: <20150724165704.axeoRHw9%predragp@cs.cmu.edu> All the services on the neill3 has been restored after the crash caused by overload. Predrag From predragp at cs.cmu.edu Fri Jul 24 17:29:28 2015 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Fri, 24 Jul 2015 17:29:28 -0400 Subject: LOV4 recovered from the crash Message-ID: <20150724222928.xHrkWlcf%predragp@cs.cmu.edu> I hope this is the last incidence for today. LOV4 has suffered very serious crash because of the high load. I was able to recover the machine by manually fsck repairing damaged file system. Predrag From punosevac72 at gmail.com Sat Jul 25 00:48:55 2015 From: punosevac72 at gmail.com (Predrag Punosevac) Date: Sat, 25 Jul 2015 00:48:55 -0400 Subject: ARI about to crash Message-ID: <20150725054855.WKNNDTDZ%punosevac72@gmail.com> Dear Autonians, I would not be surprised to see ARI crashing. Currently CPUs are 100% and 512GB or memory is used over 92%. Best, Predrag