From predragp at andrew.cmu.edu Mon Apr 2 20:08:07 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 02 Apr 2018 20:08:07 -0400 Subject: MATLAB R2018a Message-ID: <20180403000807.r-X8kQvfn%predragp@andrew.cmu.edu> Dear Autonians, Some of you might have noticed that MATLAB R2018a has been released. Based on my past experiences I don't see any compelling reason to rush to upgrade from R2017b across all machines. We will have to upgrade to R2018b in the Fall due to the licensing issues. I did upgrade MATLAB on GPU1 and GPU2 to R2018a part as a preview part due to the fact that I had to change the location (MATLAB is now 17GB with all tool boxes). If you must have MATLAB 2018a please ping me but be mindful of the fact that I have to rebuild the main file server and that is priority 1-5. MATLAB is fully functional on all GPU and CPU computing nodes. Best, Predrag From predragp at andrew.cmu.edu Tue Apr 3 14:19:15 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 03 Apr 2018 14:19:15 -0400 Subject: Main File Server status Message-ID: <20180403181915.CWzH7TF-S%predragp@andrew.cmu.edu> Dear Autonians, As I indicated earlier in my e-mails I no longer can hold rebuilding the main file server. In my tests remote replications appear to be consistent. I have just mounted backup copies of /zfsauton/project and /zfsauton/data folders from my backups on gpu9 and lov1 computing nodes. If you have few minutes to spear please log into these two servers and check for yourself if everything passes the smell test. Please let me know ASAP if everything is OK. Unless we discover something unexpected I will stop snapshots and remote replications of those two data sets on the main file server at 4:00 PM shortly afterward start umounting those shares and mounting them from the backup copy. If you have important experiments which uses data in those directories you need to speak right now. All processes have to be stopped for me to umount existing shares and mount new shares. I appreciate your cooperation in this matter. Best, Predrag From predragp at andrew.cmu.edu Tue Apr 3 19:49:43 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 03 Apr 2018 19:49:43 -0400 Subject: Main File Server status In-Reply-To: <20180403181915.CWzH7TF-S%predragp@andrew.cmu.edu> References: <20180403181915.CWzH7TF-S%predragp@andrew.cmu.edu> Message-ID: <20180403234943.-G4uoDrWb%predragp@andrew.cmu.edu> Predrag Punosevac wrote: > Dear Autonians, > > As I indicated earlier in my e-mails I no longer can hold rebuilding the > main file server. In my tests remote replications appear to be > consistent. > > I have just mounted backup copies of /zfsauton/project and > /zfsauton/data folders from my backups on gpu9 and lov1 computing nodes. backup copies which are physically located on Uranus of /zfsauton/project and /zfsauton/data ZFS datasets are now mounted/live to all computing nodes but not on desktops. The snapshots are taken as usual but no remote replication is currently performed. If everything looks OK in next 24h I will destroy the copies of these two datasets on the main file server in order to create enough space to be able to snapshot and backup /zfsauton/home dataset which was not the case for the past few weeks. This is how the ZFS pool looks right now [root at gaia] ~# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zfsauton 27.2T 24.6T 2.61T 90% 1.00x ONLINE /mnt I have to drop load to under 80% in order to be able to backup your home directories before rebuilding the server. Predrag > If you have few minutes to spear please log into these two servers and > check for yourself if everything passes the smell test. Please let me > know ASAP if everything is OK. > > Unless we discover something unexpected I will stop snapshots and remote > replications of those two data sets on the main file server at 4:00 PM > shortly afterward start umounting those shares and mounting them from > the backup copy. If you have important experiments which uses data in > those directories you need to speak right now. All processes have to be > stopped for me to umount existing shares and mount new shares. I > appreciate your cooperation in this matter. > > Best, > Predrag From ngisolfi at cs.cmu.edu Wed Apr 4 15:18:22 2018 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Wed, 4 Apr 2018 15:18:22 -0400 Subject: [hackAuton] Anyone have a large car? Message-ID: Hi Everyone, Our catering order for Saturday dinner is coming from Choolaah in Bakery Square. Sibi found out they don't offer delivery. We need to pick up the food ourselves. Does anyone have a large car? Can any of you help Sibi pick up the food around 4:30pm, Saturday April 7th? It needs to be delivered to the 3rd floor of NSH. Please reply-all so we know when someone has offered to help. Thank you! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Thu Apr 5 08:51:08 2018 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Thu, 5 Apr 2018 08:51:08 -0400 Subject: [hackAuton] Join Event Slack Team Message-ID: Hi Everyone, The hackAuton starts tomorrow at 5pm! I strongly encourage everyone to join the event Slack team. You can do so with your cmu.edu email. This Slack team will be used for communication throughout the event. Participants can post questions and receive answers. The more Autonians we have monitoring the channels, the better. This way, you can even help with the event remotely! hackauton.slack.com Details about the schedule/challenge problems/etc will be posted on hackauton.com before the start of the event. Thanks All! This is going to be a very exciting weekend for the Auton Lab! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From yichongx at cs.cmu.edu Fri Apr 6 19:26:09 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Fri, 6 Apr 2018 19:26:09 -0400 Subject: Compling source of PyTorch Message-ID: <1AF01AA2-1FBD-48C6-9DF6-2B13D138E900@cs.cmu.edu> Hi Autonians, Has anyone got experience on installing PyTorch from source? I was able to install it but it said it was complied without cuDNN. Does anyone know if cuDNN is installed and where it is? Also it seems it does not work quite well with cuda 9.1?? Thanks in advance! Thanks, Yichong -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Fri Apr 6 20:44:59 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 06 Apr 2018 20:44:59 -0400 Subject: Compling source of PyTorch In-Reply-To: <1AF01AA2-1FBD-48C6-9DF6-2B13D138E900@cs.cmu.edu> References: <1AF01AA2-1FBD-48C6-9DF6-2B13D138E900@cs.cmu.edu> Message-ID: <20180407004459.YDbiLK0Mi%predragp@andrew.cmu.edu> Yichong Xu wrote: > Hi Autonians, > Has anyone got experience on installing PyTorch from source? I was able to install it but it said it was complied without cuDNN. Does anyone know if cuDNN is installed and where it is? Also it seems it does not work quite well with cuda 9.1???? > It was broken with cuDNN 7.1 IIRC and CUDA 9.1. It can be compiled with cuDNN 7.0. I have to check which version of cuDNN we have installed. I would not be surprised 7.1. Predrag > Thanks in advance! > > Thanks, > Yichong > > > From yichongx at cs.cmu.edu Fri Apr 6 23:04:25 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Fri, 6 Apr 2018 23:04:25 -0400 Subject: Compling source of PyTorch In-Reply-To: <20180407004459.YDbiLK0Mi%predragp@andrew.cmu.edu> References: <1AF01AA2-1FBD-48C6-9DF6-2B13D138E900@cs.cmu.edu> <20180407004459.YDbiLK0Mi%predragp@andrew.cmu.edu> Message-ID: <267621BD-A370-4F64-BBCE-BF86204127E8@cs.cmu.edu> Thanks for telling me Predrag! It seems my current problem should be to find cuDNN7.0 - I actually complied the source with cuda9.0. Thanks, Yichong > On Apr 6, 2018, at 8:44 PM, Predrag Punosevac wrote: > > Yichong Xu wrote: > >> Hi Autonians, >> Has anyone got experience on installing PyTorch from source? I was able to install it but it said it was complied without cuDNN. Does anyone know if cuDNN is installed and where it is? Also it seems it does not work quite well with cuda 9.1???? >> > It was broken with cuDNN 7.1 IIRC and CUDA 9.1. It can be compiled with > cuDNN 7.0. I have to check which version of cuDNN we have installed. I > would not be surprised 7.1. > > Predrag > >> Thanks in advance! >> >> Thanks, >> Yichong >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at andrew.cmu.edu Mon Apr 9 15:40:43 2018 From: awd at andrew.cmu.edu (Artur Dubrawski) Date: Mon, 9 Apr 2018 15:40:43 -0400 Subject: Fwd: Another hackathon In-Reply-To: References: Message-ID: Just in case you feel like keep pushing, Martial Hebert reminds us about this other hackathon. Cheers Artur ---------- Forwarded message ---------- From: Martial Hebert Date: Mon, Apr 9, 2018 at 3:33 PM Subject: Another hackathon To: awd at andrew.cmu.edu Artur: Thank you again for the opportunity to participate in this hackaton. I believe you were contacted before for this, but I remind that it would be great if some your students (or feel free to forward outside of your group) would be interested in this Siemens Hackaton this week ( https://www.cs.cmu.edu/calendar/fri-2018-04-13-1600/siemens -corporate-technology-futuremakers-challenge-2018): Siemens FutureMakers Challenge @CMU Event: April 13-14, 2018 - 4:00 pm ? 4:00 pm Information session: April 12 - 4:30 pm Location: ASA Conference Room, Gates Hillman 6115 Calling innovative minds with a passion for solving, building, and creating software-related projects! Siemens FutureMakers Challenge: Making Innovation to Society Real is a 24-hour Challenge with a 2-hour presentation/judging period following the event. Computer programmers, graphic designers, interface designers, project managers, and other tech-savvy individuals involved in software development will collaborate intensively on software projects. And, all first, second, and third place team members will receive prizes! At Carnegie Mellon University, Challenge participants will be working on projects related to the following theme: Design, Verification and Manufacturing using AI. This theme includes all aspects of AI to empower designs that are better than crafted by humans and to build intelligent products and production systems that are more reliable than engineered by humans. Teams of 2-5 students (must include at least 1 PhD student) from the university will build and demonstrate their project immediately following the event. Team projects will be judged based on the following criteria: Technology (25%), theme (45%), business (20%), presentation (5%), and populous vote (5%). Don?t miss out! Register here today: -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Apr 11 17:10:57 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 11 Apr 2018 17:10:57 -0400 Subject: Main File server scheduled maintanance Message-ID: <20180411211057.l5ACo_wSQ%predragp@andrew.cmu.edu> Dear Autonians, I do have a verified good up to date copies of your home directories. In order for me to rebuild our main file server I need to switch NFS mounts of your home directories on all desktops and computing nodes from current file server to our backup server. I was thinking to try to pull the trigger on Thursday April 19 at 3:00 PM EST. This happen to be the first day of CMU Carnival so for many students next day 4/20/2018 is school holiday. Switching from main file server to backup should not take more then few hours. During that time Auton Lab infrastructure will not be available. I figured that giving you a 7 day notice will give you enough time to plan for the down time. In the case of impending deadlines please speak up now. Best, Predrag P.S. Members of the Neill group who have their own file server (12 in total) are not affected by this down time. From predragp at andrew.cmu.edu Wed Apr 11 17:37:57 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 11 Apr 2018 17:37:57 -0400 Subject: Compling source of PyTorch In-Reply-To: <1AF01AA2-1FBD-48C6-9DF6-2B13D138E900@cs.cmu.edu> References: <1AF01AA2-1FBD-48C6-9DF6-2B13D138E900@cs.cmu.edu> Message-ID: <20180411213757.m-0VU4eUH%predragp@andrew.cmu.edu> Yichong Xu wrote: > Hi Autonians, > Has anyone got experience on installing PyTorch from source? I was able to install it but it said it was complied without cuDNN. Does anyone know if cuDNN is installed and where it is? Also it seems it does not work quite well with cuda 9.1???? > Ok I looked at this more carefully. cuDNN is a proprietary library which requires NVidia account to be downloaded and must be installed manually. That is not the problem. Bigger problem is that cuDNN has dual version system. One number refers to the cuDNN library itself and the another is refereed to CUDA version. I went carefully through GPU servers and unfortunately /usr/local/cuda simlinks to /usr/local/cuda-9.0 on some servers and on some servers /usr/local/cuda-9.1 All servers have at least partial installation of CUDA 9.1 but IIRC most TensorFlow, Caffe, Teano users went back to 9.0 version. It is not possible to install 9.0 version without at least partial 9.1 installation (using YUM and NVidia repos). Long story short this is not 5 minute job and we are risking major GPU down time (just like the one from 2 weeks ago) if things are not done properly. Properly would mean that one of PyTorch users who insists on using cuDNN actually tries compiling PyTorch with CUDA-9.1 and cuDNN-7.1 then repeat the exercise with CUDA-9.0 and cuDNN-7.1, then repeat the exercise with CUDA-9.1 and cuDNN-7.0 and finally repeat the exercise with CUDA-9.0 and cuDNN-7.0. Only once we are 100% sure that one of those 4 combinations works and doesn't break the machines for other users we can go with this thing. Oh and don't forget that I an not interested if this works on your favourity OS (mine is OpenBSD and this definitly doesn't work). Testing needs to be done on Springdale 7.4 which we use in the lab. In the mean time my bandwidth is limited in part due to the major upgrade of the main file server and this high risk (many users could be negatively affected) low reward (few people will benefit) operation is not my high priority. Best, Predrag > Thanks in advance! > > Thanks, > Yichong > > > From awd at andrew.cmu.edu Wed Apr 11 18:03:48 2018 From: awd at andrew.cmu.edu (Artur Dubrawski) Date: Wed, 11 Apr 2018 18:03:48 -0400 Subject: just in case you've missed it: hackAuton was a success! Message-ID: Team, Huge thanks to everyone who helped organizing the first-ever hackAuton. Nick, you were a machine! Incredible performance, thanks very much. But big thank-you goes to all Autonians who contributed to making this event successful beyond expectations. We had 14 (or 15 depending on how you count) student teams attacking 7 important problems. They presented their results, some won prizes (more details to be provided shortly), and they will publish short papers to form online proceedings and be able to reference their work. The students had fun. They have learned things about data science, and the event helped strengthen the fabric of CMU ML/AI community. It was definitely a good thing for the Auton Lab to do. Moreover, hackAuton has made its internal and external sponsors happy as well. See e.g. this article by our partners in Colombia who contributed one of the challenge data sets: https://www.connectas.org/labs/2018-hackauton-25-anos-de-innovacion-en-inteligencia-artificial/ However, a success is a double-edged sword. Now we are expected to repeat next year :P I think we could do it even better. Do you? Cheers! Artur -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at andrew.cmu.edu Fri Apr 13 14:56:23 2018 From: awd at andrew.cmu.edu (Artur Dubrawski) Date: Fri, 13 Apr 2018 14:56:23 -0400 Subject: Fwd: Twoja byla studentka In-Reply-To: <71d6ec1c-d618-a9d6-ffc6-306e2bc5135a@sis.pitt.edu> References: <71d6ec1c-d618-a9d6-ffc6-306e2bc5135a@sis.pitt.edu> Message-ID: Ina Fiterau (for the younger of us: a former Autonian) visiting town and giving a talk at Pitt on Wednesday next week. Please check it out, the talk looks interesting and quite relevant to some of our continuing interests. Cheers Artur -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 4-18_Fiterau.pdf Type: application/pdf Size: 307962 bytes Desc: not available URL: From predragp at andrew.cmu.edu Sun Apr 15 20:14:40 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 15 Apr 2018 20:14:40 -0400 Subject: fsync() errors is unsafe and risks data loss at least on XFS Message-ID: <20180416001440.g4-K9qRa_%predragp@andrew.cmu.edu> Dear Autonians, I am not sure how many of you read PostgreSQL mailing lists but this is one of the most interesting exchanges I came across in a long while. https://www.postgresql.org/message-id/flat/CAEepm%3D0B9f0O7jLE3ipUTqC3V6NO2LNbwE9Hp%3D3BxGbZPqEyQg%40mail.gmail.com#CAEepm=0B9f0O7jLE3ipUTqC3V6NO2LNbwE9Hp=3BxGbZPqEyQg at mail.gmail.com Long story short if you have any serious amount of data in a PostgreSQL database you should be using ZFS found on IllumOS kernel (Open Solaris derivative) or FreeBSD. As you know we in the Auton Lab run ZFS of FreeBSD. Best, Predrag From awd at cs.cmu.edu Mon Apr 16 09:00:31 2018 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 16 Apr 2018 09:00:31 -0400 Subject: Fwd: [AI Seminar] AI Seminar sponsored by Apple -- Yichong Xu -- April 17 In-Reply-To: References: Message-ID: An interesting seminar talk to be given tomorrow by one of the Autonians: Artur ---------- Forwarded message ---------- From: Adams Wei Yu Date: Mon, Apr 16, 2018 at 7:27 AM Subject: Re: [AI Seminar] AI Seminar sponsored by Apple -- Yichong Xu -- April 17 To: ai-seminar-announce at cs.cmu.edu A gentle reminder that the talk will be tomorrow (Tuesday) noon in *NSH 1507.* On Sat, Apr 14, 2018 at 3:17 AM, Adams Wei Yu wrote: > Dear faculty and students, > > We look forward to seeing you next Tuesday, April 17, at noon in *NSH > 1507* for AI Seminar sponsored by Apple. To learn more about the seminar > series, please visit the AI Seminar webpage > . > > On Tuesday, Yichong Xu will give > the following talk: > > Title: Interactive learning using Comparison Queries > > Abstract: > > In supervised learning, we leverage a labeled dataset to design methods > for function estimation. In many practical situations, we are able to > obtain alternative feedback, possibly at a low cost. A broad goal is to > understand the usefulness of, and to design algorithms to exploit, this > alternative feedback. We consider a interactive learning setting where we > obtain additional ordinal (or comparison) information for potentially > unlabeled samples. In this talk we show the usefulness of such ordinal > feedback for two tasks: Binary classification and nonparametric regression. > For binary classification, we show that comparison queries can help in > improving the label and total query complexity by reducing the learning > problem to that of learning a threshold function. We present an algorithm > that achieves near-optimal label and total query complexity. For > nonparametric regression, we show that it is possible to accurately > estimate an underlying function with a very small labeled set, effectively > escaping the curse of dimensionality. We develop an algorithm called > Ranking-Regression(R^2) and analyze its accuracy as a function of size of > the labeled and unlabeled datasets and various noise parameters. We also > derive lower bounds to show that R^2 is optimal in a variety of settings. > Experiments show that our algorithms outperforms label-only algorithms when > comparison information is available. > > Based on joint works with Sivaraman Balakrishnan, Artur Dubrawski, Kyle > Miller, Hariank Muthakana, Aarti Singh and Hongyang Zhang. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Thu Apr 19 00:38:25 2018 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Thu, 19 Apr 2018 00:38:25 -0400 Subject: "Emily the Rockstar" Message-ID: But she really rocks! Check this out: https://www.youtube.com/watch?time_continue=77&v=haXS2F7ALlg Artur -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Apr 19 14:30:39 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 19 Apr 2018 14:30:39 -0400 Subject: Autonlab time sync In-Reply-To: <1DE71983-A8F3-49EF-88D1-B21267E7FB86@andrew.cmu.edu> References: <20180416001440.g4-K9qRa_%predragp@andrew.cmu.edu> <1DE71983-A8F3-49EF-88D1-B21267E7FB86@andrew.cmu.edu> Message-ID: <20180419183039.aYkvGCgTO%predragp@andrew.cmu.edu> Yang Zhang wrote: > Hi Predrag, > > Could you sync the time on the autonlab server with NTP? > Its about 5 minute ahead. Which server? The synchronization is turned on. Having an inaccurate clock might mean that the server is dying. Also keep in mind that CMU network guys firewall NTP protocol so our computing nodes are synchronized from my time server which is fetching time using http protocol (from obvious reasons). Predrag > > Thanks! > BEst, > Yang > > > On Apr 15, 2018, at 8:14 PM, Predrag Punosevac wrote: > > > > Dear Autonians, > > > > I am not sure how many of you read PostgreSQL mailing lists but this is > > one of the most interesting exchanges I came across in a long while. > > > > https://www.postgresql.org/message-id/flat/CAEepm%3D0B9f0O7jLE3ipUTqC3V6NO2LNbwE9Hp%3D3BxGbZPqEyQg%40mail.gmail.com#CAEepm=0B9f0O7jLE3ipUTqC3V6NO2LNbwE9Hp=3BxGbZPqEyQg at mail.gmail.com > > > > Long story short if you have any serious amount of data in a PostgreSQL > > database you should be using ZFS found on IllumOS kernel (Open Solaris > > derivative) or FreeBSD. As you know we in the Auton Lab run ZFS of > > FreeBSD. > > > > Best, > > Predrag From awd at cs.cmu.edu Thu Apr 19 16:25:38 2018 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Thu, 19 Apr 2018 16:25:38 -0400 Subject: Data for Good Message-ID: Team, Bloomberg is looking for submissions to their annual event. Chirag went there last year so ping him if you'd like more insight. https://www.bloomberg.com/company/d4gx/?utm_source=Sailthru&utm_medium=email&utm_campaign=2018%20CFP%20Save%20The%20Date&utm_term=Master%20and%20Team Cheers Artur -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sun Apr 22 00:22:47 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 22 Apr 2018 00:22:47 -0400 Subject: matlab VideoReader error In-Reply-To: References: Message-ID: <20180422042247.EE5SKXDbS%predragp@andrew.cmu.edu> Yusha Liu wrote: > Hi Predrag, > > I'm using Matlab 2017b on CPU machine, when usingVideoReader to load a mp4 > video file, I got the following message: > > '' > Error using VideoReader/init (line 619) > The VideoReader plugin libmwgstreamerplugin failed to load properly. > > Error in VideoReader (line 172) > obj.init(fileName); > '' > > I installed gstreamer but the problem still exits. I'm wondering do you > know what is the issue here? You don't have privileges to install any package. How did you install gstreamer? I would suggest you log into GPU1 or GPU2 machine. Those two machines have the latest 2018a MATLAB release. They also have gstreamer-0.10.36-7.el7.x86_64 which I installed. If it doesn't work my guess would be that we are hitting some bug. A simple Google search reveals lot of problems with VideoReader plugin. For example https://www.mathworks.com/matlabcentral/answers/341978-error-using-videoreader-init-line-619 https://stackoverflow.com/questions/36150689/matlab-on-ubuntu-15-04-the-videoreader-plugin-libmwgstreamerplugin-failed-to-lo Predrag > > Thanks a lot! > > -- > Yusha Liu, Master's Student > Machine Learning Department > Carnegie Mellon University From siyu.cosmo at gmail.com Sun Apr 22 18:25:42 2018 From: siyu.cosmo at gmail.com (Siyu He) Date: Sun, 22 Apr 2018 15:25:42 -0700 Subject: RuntimeError: CUDNN_STATUS_INTERNAL_ERROR Message-ID: <11665C07-9962-4B21-BE44-A52C67D10E8E@gmail.com> Dear all, Sorry to bother you on this. I am running some jobs on autonlab. It works all fine but all of a sudden I receive the following error: RuntimeError: CUDNN_STATUS_INTERNAL_ERROR Then all my jobs cannot be run even for those which are working well before. Do you know why does this happen? Thanks!! Siyu -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sun Apr 22 20:06:27 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 22 Apr 2018 20:06:27 -0400 Subject: RuntimeError: CUDNN_STATUS_INTERNAL_ERROR In-Reply-To: <11665C07-9962-4B21-BE44-A52C67D10E8E@gmail.com> References: <11665C07-9962-4B21-BE44-A52C67D10E8E@gmail.com> Message-ID: <20180423000627.aZ0zNFqN_%predragp@andrew.cmu.edu> Siyu He wrote: > Dear all, > > Sorry to bother you on this. I am running some jobs on autonlab. It works all fine but all of a sudden I receive the following error: > RuntimeError: CUDNN_STATUS_INTERNAL_ERROR > Then all my jobs cannot be run even for those which are working well before. > Do you know why does this happen? This is a poor problem report. When was the last time program was working OK? When did you notice the problem. Any changes in your local path etc etc... What are you trying to do. Predrag > > Thanks!! > Siyu > > From chiragn at cs.cmu.edu Mon Apr 23 16:39:34 2018 From: chiragn at cs.cmu.edu (Chirag Nagpal) Date: Mon, 23 Apr 2018 16:39:34 -0400 Subject: RuntimeError: CUDNN_STATUS_INTERNAL_ERROR In-Reply-To: <20180423000627.aZ0zNFqN_%predragp@andrew.cmu.edu> References: <11665C07-9962-4B21-BE44-A52C67D10E8E@gmail.com> <20180423000627.aZ0zNFqN_%predragp@andrew.cmu.edu> Message-ID: This seems to be an issue with the way your code is utilizing the GPU RAM. It seems as the Card is running out of RAM. Please make sure you are setting the GPU card environment variable where you want to execute your CUDA code using $export CUDA_VISIBLE_DEVICES=#GPUNUMBER Also try to calculate how much RAM is your card using. If its more than the maximum RAM of the GPU, try to use smaller batch sizes. Chirag On Sun, Apr 22, 2018 at 8:06 PM, Predrag Punosevac wrote: > Siyu He wrote: > > > Dear all, > > > > Sorry to bother you on this. I am running some jobs on autonlab. It > works all fine but all of a sudden I receive the following error: > > RuntimeError: CUDNN_STATUS_INTERNAL_ERROR > > Then all my jobs cannot be run even for those which are working well > before. > > Do you know why does this happen? > > This is a poor problem report. When was the last time program was > working OK? When did you notice the problem. Any changes in your local > path etc etc... What are you trying to do. > > Predrag > > > > > Thanks!! > > Siyu > > > > > -- *Chirag Nagpal* Graduate Student, Language Technologies Institute School of Computer Science Carnegie Mellon University cs.cmu.edu/~chiragn -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Apr 23 17:59:48 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 23 Apr 2018 17:59:48 -0400 Subject: Hard disk drive failure Message-ID: <20180423215948.0cyATSWei%predragp@andrew.cmu.edu> Dear Autonians, Normally I try to keep quiet about occasional HDD failures which are fixable. However I feel that some of you might have noticed that /zfsauton/public and /zfsauton/data went down for a bout 15 minutes and some of your scripts crashed. Unfortunately one of HDDs which constitutes a ZFS pool hosting those two data sets was failing long S.M.A.R.T. tests and had to be replaced on the short notice. Since I prefer safer reboot over hot swap the server was down for about 15 minutes (I don't like to take a chance by accidently off lining the second HDDs in the same pool). Affected ZFS pool is being resilvered as I am typing this message. root at uranus:~ # zpool status backups pool: backups state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 23 17:48:37 2018 143G scanned out of 5.72T at 302M/s, 5h23m to go 13.1G resilvered, 2.44% done config: NAME STATE READ WRITE CKSUM backups DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 replacing-3 OFFLINE 0 0 0 17410010232298071688 OFFLINE 0 0 0 was /dev/da3/old da3 ONLINE 0 0 0 (resilvering) da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 errors: No known data errors Best, Predrag From siyu.cosmo at gmail.com Mon Apr 23 19:16:36 2018 From: siyu.cosmo at gmail.com (Siyu He) Date: Mon, 23 Apr 2018 16:16:36 -0700 Subject: RuntimeError: CUDNN_STATUS_INTERNAL_ERROR In-Reply-To: References: <11665C07-9962-4B21-BE44-A52C67D10E8E@gmail.com> <20180423000627.aZ0zNFqN_%predragp@andrew.cmu.edu> Message-ID: <9A354E7A-D2DB-4C35-A051-B241A183E85E@gmail.com> Hi Chirag, Thanks for your reply! It seems it?s now working. But I did use "export CUDA_VISIBLE_DEVICES=#GPUNUMBER? before to specify what GPU I want to use and check the available memory. I am still a little confused why I had the issue on Sunday. But anyway it?s working. Yay! Thanks, Siyu > On Apr 23, 2018, at 1:39 PM, Chirag Nagpal wrote: > > This seems to be an issue with the way your code is utilizing the GPU RAM. It seems as the Card is running out of RAM. > > Please make sure you are setting the GPU card environment variable where you want to execute your CUDA code using > > $export CUDA_VISIBLE_DEVICES=#GPUNUMBER > > Also try to calculate how much RAM is your card using. If its more than the maximum RAM of the GPU, try to use smaller batch sizes. > > Chirag > > > > > On Sun, Apr 22, 2018 at 8:06 PM, Predrag Punosevac > wrote: > Siyu He > wrote: > > > Dear all, > > > > Sorry to bother you on this. I am running some jobs on autonlab. It works all fine but all of a sudden I receive the following error: > > RuntimeError: CUDNN_STATUS_INTERNAL_ERROR > > Then all my jobs cannot be run even for those which are working well before. > > Do you know why does this happen? > > This is a poor problem report. When was the last time program was > working OK? When did you notice the problem. Any changes in your local > path etc etc... What are you trying to do. > > Predrag > > > > > Thanks!! > > Siyu > > > > > > > > -- > Chirag Nagpal > Graduate Student, Language Technologies Institute > School of Computer Science > Carnegie Mellon University > cs.cmu.edu/~chiragn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Wed Apr 25 10:36:46 2018 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Wed, 25 Apr 2018 10:36:46 -0400 Subject: [hackAuton] Free Mugs! Message-ID: Hi Everyone, Stop by NSH 3111 and pick up a free Auton Lab mug! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Apr 25 22:48:18 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 25 Apr 2018 22:48:18 -0400 Subject: Red Hat 7.5 release Message-ID: <20180426024818.h4UpO59iH%predragp@andrew.cmu.edu> Dear Autonians, On April 10, Red Hat Inc. has announced the release of Red Hat Enterprise Linux (RHEL) 7.5, the latest update of the company's enterprise-class Linux distribution. Thanks to the hard work of my friend Josko Plazonic and his team at Princeton University Springdale Linux a free, enterprise-class, community-supported computing platform functionally compatible with its upstream source, Red Hat Enterprise Linux (RHEL) has also been updated last night to the version 7.5. I am happy to announce that as of this moment all Auton Lab computing nodes have been updated to the version 7.5 with exception of few obsolete machines running Springdale 6.9. Note that I didn't update CUDA and NVidia drivers on GPU[1-9] as that would require reboots and perhaps would break deep learning software many of you are using. I also didn't reboot non GPU computing nodes in order to avoid disruption, thus nodes are still running the same kernels but the very latest userland. Virtual hosts running Springdale Linux are also upgraded as well as most desktops. I upgrading few remaining desktops right now. Please test if the things work for you and report any strange behavior. Also Ben and Jarod who have GPU cards in their desktops should be extra vigilant. Please let me know if your desktops look broken. I will be happy to upgrade your NVidia drivers and CUDA if the things appear broken. Best, Predrag From mbarnes1 at andrew.cmu.edu Thu Apr 26 08:14:04 2018 From: mbarnes1 at andrew.cmu.edu (Matthew Barnes) Date: Thu, 26 Apr 2018 12:14:04 +0000 Subject: Red Hat 7.5 release In-Reply-To: <20180426024818.h4UpO59iH%predragp@andrew.cmu.edu> References: <20180426024818.h4UpO59iH%predragp@andrew.cmu.edu> Message-ID: Things appear broken on at least some of the GPU machines. This worked before last night. mbarnes1 at gpu3$ python Python 2.7.5 (default, Apr 15 2018, 20:27:58) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf RuntimeError: module compiled against API version 0xb but this version of numpy is 0x7 RuntimeError: module compiled against API version 0xb but this version of numpy is 0x7 Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in from tensorflow.python import * File "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 63, in from tensorflow.python.framework.framework_lib import * File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/framework_lib.py", line 81, in from tensorflow.python.framework.sparse_tensor import SparseTensor File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/sparse_tensor.py", line 25, in from tensorflow.python.framework import tensor_util File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 33, in from tensorflow.python.framework import fast_tensor_util File "__init__.pxd", line 163, in init tensorflow.python.framework.fast_tensor_util ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96 On Wed, Apr 25, 2018 at 10:49 PM Predrag Punosevac wrote: > Dear Autonians, > > On April 10, Red Hat Inc. has announced the release of Red Hat > Enterprise Linux (RHEL) 7.5, the latest update of the company's > enterprise-class Linux distribution. > > Thanks to the hard work of my friend Josko Plazonic and his team at > Princeton University Springdale Linux a free, enterprise-class, > community-supported computing platform functionally compatible with its > upstream source, Red Hat Enterprise Linux (RHEL) has also been updated > last night to the version 7.5. > > I am happy to announce that as of this moment all Auton Lab computing > nodes have been updated to the version 7.5 with exception of few > obsolete machines running Springdale 6.9. Note that I didn't update CUDA > and NVidia drivers on GPU[1-9] as that would require reboots and perhaps > would break deep learning software many of you are using. I also didn't > reboot non GPU computing nodes in order to avoid disruption, thus nodes > are still running the same kernels but the very latest userland. > > Virtual hosts running Springdale Linux are also upgraded as well as most > desktops. I upgrading few remaining desktops right now. > > Please test if the things work for you and report any strange behavior. > Also Ben and Jarod who have GPU cards in their desktops should be extra > vigilant. Please let me know if your desktops look broken. I will be > happy to upgrade your NVidia drivers and CUDA if the things appear > broken. > > Best, > Predrag > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Apr 26 08:18:58 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 26 Apr 2018 08:18:58 -0400 Subject: Red Hat 7.5 release In-Reply-To: References: <20180426024818.h4UpO59iH%predragp@andrew.cmu.edu> Message-ID: <20180426121858.tbu8OsFOa%predragp@andrew.cmu.edu> Matthew Barnes wrote: > Things appear broken on at least some of the GPU machines. This worked > before last night. > You compiled TensorFlow against Python from the base. Please recompile and report. Predrag > mbarnes1 at gpu3$ python > Python 2.7.5 (default, Apr 15 2018, 20:27:58) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import tensorflow as tf > RuntimeError: module compiled against API version 0xb but this version of > numpy is 0x7 > RuntimeError: module compiled against API version 0xb but this version of > numpy is 0x7 > Traceback (most recent call last): > File "", line 1, in > File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, > in > from tensorflow.python import * > File "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", > line 63, in > from tensorflow.python.framework.framework_lib import * > File > "/usr/lib/python2.7/site-packages/tensorflow/python/framework/framework_lib.py", > line 81, in > from tensorflow.python.framework.sparse_tensor import SparseTensor > File > "/usr/lib/python2.7/site-packages/tensorflow/python/framework/sparse_tensor.py", > line 25, in > from tensorflow.python.framework import tensor_util > File > "/usr/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", > line 33, in > from tensorflow.python.framework import fast_tensor_util > File "__init__.pxd", line 163, in init > tensorflow.python.framework.fast_tensor_util > ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, > got 96 > > On Wed, Apr 25, 2018 at 10:49 PM Predrag Punosevac > wrote: > > > Dear Autonians, > > > > On April 10, Red Hat Inc. has announced the release of Red Hat > > Enterprise Linux (RHEL) 7.5, the latest update of the company's > > enterprise-class Linux distribution. > > > > Thanks to the hard work of my friend Josko Plazonic and his team at > > Princeton University Springdale Linux a free, enterprise-class, > > community-supported computing platform functionally compatible with its > > upstream source, Red Hat Enterprise Linux (RHEL) has also been updated > > last night to the version 7.5. > > > > I am happy to announce that as of this moment all Auton Lab computing > > nodes have been updated to the version 7.5 with exception of few > > obsolete machines running Springdale 6.9. Note that I didn't update CUDA > > and NVidia drivers on GPU[1-9] as that would require reboots and perhaps > > would break deep learning software many of you are using. I also didn't > > reboot non GPU computing nodes in order to avoid disruption, thus nodes > > are still running the same kernels but the very latest userland. > > > > Virtual hosts running Springdale Linux are also upgraded as well as most > > desktops. I upgrading few remaining desktops right now. > > > > Please test if the things work for you and report any strange behavior. > > Also Ben and Jarod who have GPU cards in their desktops should be extra > > vigilant. Please let me know if your desktops look broken. I will be > > happy to upgrade your NVidia drivers and CUDA if the things appear > > broken. > > > > Best, > > Predrag > > From mbarnes1 at andrew.cmu.edu Thu Apr 26 08:20:50 2018 From: mbarnes1 at andrew.cmu.edu (Matthew Barnes) Date: Thu, 26 Apr 2018 12:20:50 +0000 Subject: Red Hat 7.5 release In-Reply-To: <20180426121858.tbu8OsFOa%predragp@andrew.cmu.edu> References: <20180426024818.h4UpO59iH%predragp@andrew.cmu.edu> <20180426121858.tbu8OsFOa%predragp@andrew.cmu.edu> Message-ID: That's the system install of Tensorflow, not me. An admin would have to do that. On Thu, Apr 26, 2018 at 8:19 AM Predrag Punosevac wrote: > Matthew Barnes wrote: > > > Things appear broken on at least some of the GPU machines. This worked > > before last night. > > > > You compiled TensorFlow against Python from the base. Please recompile > and report. > > Predrag > > > mbarnes1 at gpu3$ python > > Python 2.7.5 (default, Apr 15 2018, 20:27:58) > > [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2 > > Type "help", "copyright", "credits" or "license" for more information. > > >>> import tensorflow as tf > > RuntimeError: module compiled against API version 0xb but this version of > > numpy is 0x7 > > RuntimeError: module compiled against API version 0xb but this version of > > numpy is 0x7 > > Traceback (most recent call last): > > File "", line 1, in > > File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line > 24, > > in > > from tensorflow.python import * > > File "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", > > line 63, in > > from tensorflow.python.framework.framework_lib import * > > File > > > "/usr/lib/python2.7/site-packages/tensorflow/python/framework/framework_lib.py", > > line 81, in > > from tensorflow.python.framework.sparse_tensor import SparseTensor > > File > > > "/usr/lib/python2.7/site-packages/tensorflow/python/framework/sparse_tensor.py", > > line 25, in > > from tensorflow.python.framework import tensor_util > > File > > > "/usr/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", > > line 33, in > > from tensorflow.python.framework import fast_tensor_util > > File "__init__.pxd", line 163, in init > > tensorflow.python.framework.fast_tensor_util > > ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, > > got 96 > > > > On Wed, Apr 25, 2018 at 10:49 PM Predrag Punosevac < > predragp at andrew.cmu.edu> > > wrote: > > > > > Dear Autonians, > > > > > > On April 10, Red Hat Inc. has announced the release of Red Hat > > > Enterprise Linux (RHEL) 7.5, the latest update of the company's > > > enterprise-class Linux distribution. > > > > > > Thanks to the hard work of my friend Josko Plazonic and his team at > > > Princeton University Springdale Linux a free, enterprise-class, > > > community-supported computing platform functionally compatible with its > > > upstream source, Red Hat Enterprise Linux (RHEL) has also been updated > > > last night to the version 7.5. > > > > > > I am happy to announce that as of this moment all Auton Lab computing > > > nodes have been updated to the version 7.5 with exception of few > > > obsolete machines running Springdale 6.9. Note that I didn't update > CUDA > > > and NVidia drivers on GPU[1-9] as that would require reboots and > perhaps > > > would break deep learning software many of you are using. I also didn't > > > reboot non GPU computing nodes in order to avoid disruption, thus nodes > > > are still running the same kernels but the very latest userland. > > > > > > Virtual hosts running Springdale Linux are also upgraded as well as > most > > > desktops. I upgrading few remaining desktops right now. > > > > > > Please test if the things work for you and report any strange behavior. > > > Also Ben and Jarod who have GPU cards in their desktops should be extra > > > vigilant. Please let me know if your desktops look broken. I will be > > > happy to upgrade your NVidia drivers and CUDA if the things appear > > > broken. > > > > > > Best, > > > Predrag > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sheath at andrew.cmu.edu Thu Apr 26 13:43:22 2018 From: sheath at andrew.cmu.edu (Simon Heath) Date: Thu, 26 Apr 2018 13:43:22 -0400 Subject: Red Hat 7.5 release In-Reply-To: References: <20180426024818.h4UpO59iH%predragp@andrew.cmu.edu> <20180426121858.tbu8OsFOa%predragp@andrew.cmu.edu> Message-ID: Reinstalled tensorflow on gpu8 and it appears to work fine now. Running reinstalls on all GPU's now but it might take 30 minutes or so to finish. Simon On Thu, Apr 26, 2018 at 8:20 AM, Matthew Barnes wrote: > That's the system install of Tensorflow, not me. An admin would have to do > that. > > On Thu, Apr 26, 2018 at 8:19 AM Predrag Punosevac > wrote: > >> Matthew Barnes wrote: >> >> > Things appear broken on at least some of the GPU machines. This worked >> > before last night. >> > >> >> You compiled TensorFlow against Python from the base. Please recompile >> and report. >> >> Predrag >> >> > mbarnes1 at gpu3$ python >> > Python 2.7.5 (default, Apr 15 2018, 20:27:58) >> > [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2 >> > Type "help", "copyright", "credits" or "license" for more information. >> > >>> import tensorflow as tf >> > RuntimeError: module compiled against API version 0xb but this version >> of >> > numpy is 0x7 >> > RuntimeError: module compiled against API version 0xb but this version >> of >> > numpy is 0x7 >> > Traceback (most recent call last): >> > File "", line 1, in >> > File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line >> 24, >> > in >> > from tensorflow.python import * >> > File "/usr/lib/python2.7/site-packages/tensorflow/python/__ >> init__.py", >> > line 63, in >> > from tensorflow.python.framework.framework_lib import * >> > File >> > "/usr/lib/python2.7/site-packages/tensorflow/python/ >> framework/framework_lib.py", >> > line 81, in >> > from tensorflow.python.framework.sparse_tensor import SparseTensor >> > File >> > "/usr/lib/python2.7/site-packages/tensorflow/python/ >> framework/sparse_tensor.py", >> > line 25, in >> > from tensorflow.python.framework import tensor_util >> > File >> > "/usr/lib/python2.7/site-packages/tensorflow/python/ >> framework/tensor_util.py", >> > line 33, in >> > from tensorflow.python.framework import fast_tensor_util >> > File "__init__.pxd", line 163, in init >> > tensorflow.python.framework.fast_tensor_util >> > ValueError: numpy.dtype has the wrong size, try recompiling. Expected >> 88, >> > got 96 >> > >> > On Wed, Apr 25, 2018 at 10:49 PM Predrag Punosevac < >> predragp at andrew.cmu.edu> >> > wrote: >> > >> > > Dear Autonians, >> > > >> > > On April 10, Red Hat Inc. has announced the release of Red Hat >> > > Enterprise Linux (RHEL) 7.5, the latest update of the company's >> > > enterprise-class Linux distribution. >> > > >> > > Thanks to the hard work of my friend Josko Plazonic and his team at >> > > Princeton University Springdale Linux a free, enterprise-class, >> > > community-supported computing platform functionally compatible with >> its >> > > upstream source, Red Hat Enterprise Linux (RHEL) has also been updated >> > > last night to the version 7.5. >> > > >> > > I am happy to announce that as of this moment all Auton Lab computing >> > > nodes have been updated to the version 7.5 with exception of few >> > > obsolete machines running Springdale 6.9. Note that I didn't update >> CUDA >> > > and NVidia drivers on GPU[1-9] as that would require reboots and >> perhaps >> > > would break deep learning software many of you are using. I also >> didn't >> > > reboot non GPU computing nodes in order to avoid disruption, thus >> nodes >> > > are still running the same kernels but the very latest userland. >> > > >> > > Virtual hosts running Springdale Linux are also upgraded as well as >> most >> > > desktops. I upgrading few remaining desktops right now. >> > > >> > > Please test if the things work for you and report any strange >> behavior. >> > > Also Ben and Jarod who have GPU cards in their desktops should be >> extra >> > > vigilant. Please let me know if your desktops look broken. I will be >> > > happy to upgrade your NVidia drivers and CUDA if the things appear >> > > broken. >> > > >> > > Best, >> > > Predrag >> > > >> > -- Simon Heath, Research Programmer and Analyst Robotics Institute - Auton Lab Carnegie Mellon University sheath at andrew.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mbarnes1 at andrew.cmu.edu Thu Apr 26 14:35:55 2018 From: mbarnes1 at andrew.cmu.edu (Matthew Barnes) Date: Thu, 26 Apr 2018 18:35:55 +0000 Subject: Red Hat 7.5 release In-Reply-To: References: <20180426024818.h4UpO59iH%predragp@andrew.cmu.edu> <20180426121858.tbu8OsFOa%predragp@andrew.cmu.edu>

Message-ID: Thanks Simon!! It looks good. On Thu, Apr 26, 2018 at 1:43 PM Simon Heath wrote: > Reinstalled tensorflow on gpu8 and it appears to work fine now. Running > reinstalls on all GPU's now but it might take 30 minutes or so to finish. > > Simon > > On Thu, Apr 26, 2018 at 8:20 AM, Matthew Barnes > wrote: > >> That's the system install of Tensorflow, not me. An admin would have to >> do that. >> >> On Thu, Apr 26, 2018 at 8:19 AM Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> Matthew Barnes wrote: >>> >>> > Things appear broken on at least some of the GPU machines. This worked >>> > before last night. >>> > >>> >>> You compiled TensorFlow against Python from the base. Please recompile >>> and report. >>> >>> Predrag >>> >>> > mbarnes1 at gpu3$ python >>> > Python 2.7.5 (default, Apr 15 2018, 20:27:58) >>> > [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2 >>> > Type "help", "copyright", "credits" or "license" for more information. >>> > >>> import tensorflow as tf >>> > RuntimeError: module compiled against API version 0xb but this version >>> of >>> > numpy is 0x7 >>> > RuntimeError: module compiled against API version 0xb but this version >>> of >>> > numpy is 0x7 >>> > Traceback (most recent call last): >>> > File "", line 1, in >>> > File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line >>> 24, >>> > in >>> > from tensorflow.python import * >>> > File >>> "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", >>> > line 63, in >>> > from tensorflow.python.framework.framework_lib import * >>> > File >>> > >>> "/usr/lib/python2.7/site-packages/tensorflow/python/framework/framework_lib.py", >>> > line 81, in >>> > from tensorflow.python.framework.sparse_tensor import SparseTensor >>> > File >>> > >>> "/usr/lib/python2.7/site-packages/tensorflow/python/framework/sparse_tensor.py", >>> > line 25, in >>> > from tensorflow.python.framework import tensor_util >>> > File >>> > >>> "/usr/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", >>> > line 33, in >>> > from tensorflow.python.framework import fast_tensor_util >>> > File "__init__.pxd", line 163, in init >>> > tensorflow.python.framework.fast_tensor_util >>> > ValueError: numpy.dtype has the wrong size, try recompiling. Expected >>> 88, >>> > got 96 >>> > >>> > On Wed, Apr 25, 2018 at 10:49 PM Predrag Punosevac < >>> predragp at andrew.cmu.edu> >>> > wrote: >>> > >>> > > Dear Autonians, >>> > > >>> > > On April 10, Red Hat Inc. has announced the release of Red Hat >>> > > Enterprise Linux (RHEL) 7.5, the latest update of the company's >>> > > enterprise-class Linux distribution. >>> > > >>> > > Thanks to the hard work of my friend Josko Plazonic and his team at >>> > > Princeton University Springdale Linux a free, enterprise-class, >>> > > community-supported computing platform functionally compatible with >>> its >>> > > upstream source, Red Hat Enterprise Linux (RHEL) has also been >>> updated >>> > > last night to the version 7.5. >>> > > >>> > > I am happy to announce that as of this moment all Auton Lab computing >>> > > nodes have been updated to the version 7.5 with exception of few >>> > > obsolete machines running Springdale 6.9. Note that I didn't update >>> CUDA >>> > > and NVidia drivers on GPU[1-9] as that would require reboots and >>> perhaps >>> > > would break deep learning software many of you are using. I also >>> didn't >>> > > reboot non GPU computing nodes in order to avoid disruption, thus >>> nodes >>> > > are still running the same kernels but the very latest userland. >>> > > >>> > > Virtual hosts running Springdale Linux are also upgraded as well as >>> most >>> > > desktops. I upgrading few remaining desktops right now. >>> > > >>> > > Please test if the things work for you and report any strange >>> behavior. >>> > > Also Ben and Jarod who have GPU cards in their desktops should be >>> extra >>> > > vigilant. Please let me know if your desktops look broken. I will be >>> > > happy to upgrade your NVidia drivers and CUDA if the things appear >>> > > broken. >>> > > >>> > > Best, >>> > > Predrag >>> > > >>> >> > > > -- > Simon Heath, Research Programmer and Analyst > Robotics Institute - Auton Lab > Carnegie Mellon University > sheath at andrew.cmu.edu > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eyolcu at cs.cmu.edu Fri Apr 27 15:35:30 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Fri, 27 Apr 2018 15:35:30 -0400 Subject: bash down? Message-ID: Hi, Is bash.autonlab.org down for anybody? Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Fri Apr 27 16:01:11 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 27 Apr 2018 16:01:11 -0400 Subject: bash down? In-Reply-To: References: Message-ID: <20180427200111.P2P99xNS0%predragp@andrew.cmu.edu> Emre Yolcu wrote: > Hi, > > Is bash.autonlab.org down for anybody? > > Emre I was working on NREC infrastructure. When I open VPN connection with NREC bash (which is my desktop) losses connection with the rest of the lab computers as Cisco AnyTimeConnect breaks routing tables and pretty much takes over my computer DNS, firewall and such. Unless you are using X2Go default gateway should be lop1.autonlab.org which is up (I just verified). Predrag P.S. Cisco VPN should not be used period! However it is a favorite of lazy system admins who don't want to learn how to configure IPSec, OpenVPN, ssh and such. From boecking at andrew.cmu.edu Mon Apr 30 15:52:36 2018 From: boecking at andrew.cmu.edu (Benedikt Boecking) Date: Mon, 30 Apr 2018 15:52:36 -0400 Subject: high system cpu usage with recent numpy updates Message-ID: All, With newer versions of numpy (and maybe scipy) it is possible that some operations use all available CPUs by default (thanks to David Bayani for pointing this out). This can also happen if you use packages that rely on numpy and scipy such as statsmodels. On our servers this appears to be caused by the use of the open MP API. While automatic multi processing can be a great feature, it can cause trouble if it is combined with additional multi processing (e.g. your own use of the multiprocessing or joblib libraries) or when multiple users unwittingly spawn too many threads at the same time. If you want to control the number of threads used through open MP, use the OMP_NUM_THREADS environment variable when you run your python code (with a reasonable number of threads): [user at server ~]$ OMP_NUM_THREADS=8 python yourscript.py Also, it is a great habit to run top or htop to monitor your resource consumption to make sure you aren?t inconveniencing other users of our lab?s resources. Best, Ben From mbarnes1 at andrew.cmu.edu Mon Apr 30 15:59:50 2018 From: mbarnes1 at andrew.cmu.edu (Matthew Barnes) Date: Mon, 30 Apr 2018 19:59:50 +0000 Subject: high system cpu usage with recent numpy updates In-Reply-To: References: Message-ID: This happens even with basic multiprocessing in Python, for example the multiprocessing.Pool.map operation. Don't be like me and accidentally start 2500 processes :) On Mon, Apr 30, 2018 at 3:53 PM Benedikt Boecking wrote: > All, > > With newer versions of numpy (and maybe scipy) it is possible that some > operations use all available CPUs by default (thanks to David Bayani for > pointing this out). This can also happen if you use packages that rely on > numpy and scipy such as statsmodels. On our servers this appears to be > caused by the use of the open MP API. > > While automatic multi processing can be a great feature, it can cause > trouble if it is combined with additional multi processing (e.g. your own > use of the multiprocessing or joblib libraries) or when multiple users > unwittingly spawn too many threads at the same time. > > If you want to control the number of threads used through open MP, use the > OMP_NUM_THREADS environment variable when you run your python code (with a > reasonable number of threads): > > [user at server ~]$ OMP_NUM_THREADS=8 python yourscript.py > > Also, it is a great habit to run top or htop to monitor your resource > consumption to make sure you aren?t inconveniencing other users of our > lab?s resources. > > Best, > Ben > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dbayani at andrew.cmu.edu Mon Apr 30 17:32:24 2018 From: dbayani at andrew.cmu.edu (David Bayani) Date: Mon, 30 Apr 2018 17:32:24 -0400 Subject: high system cpu usage with recent numpy updates In-Reply-To: References: Message-ID: Thanks Ben. You beat me to the punch sending this out. I give credit to you for first suggesting updates to scipy were responsible. As Ben pointed out in person, this is expected to become more of an issue as the machines in the lab have their packages updated, motivating this group email so everyone knows now. Providing a bit more material below on the subject in case anyone is interested: The place this flag suggestion was initially found was: https://stackoverflow.com/questions/22418634/numpy-openblas-set-maximum-number-of-threads There are some conversations and related links there that might be enlightening. Checking the release notes on the newer versions of scipy and numpy were also useful. For me, this became an issue when multiple threads were being spawn despite the fact that multithreading was not explicitly invoked in the code in question. Specifically, using only scipy sparse matrix libraries and looking at the column NLWP in htop, some machines could be seen using one thread and others using as many threads as thread contexts. Machines that had single threads listed for the processes were using: >>> import scipy >>> scipy.__version__ '0.12.1' >>> import numpy >>> numpy.__version__ '1.7.1' and machines that invoked multiple threads were using: >>> import scipy >>> scipy.__version__ '1.0.0' >>> import numpy >>> numpy.__version__ '1.14.0' For the runs in question, life went much better after forcing the number of threads to one. As Ben said, using more threads does not imply that runs will go no slower. On Mon, Apr 30, 2018 at 3:59 PM, Matthew Barnes wrote: > This happens even with basic multiprocessing in Python, for example the > multiprocessing.Pool.map operation. Don't be like me and accidentally start > 2500 processes :) > > On Mon, Apr 30, 2018 at 3:53 PM Benedikt Boecking > wrote: > >> All, >> >> With newer versions of numpy (and maybe scipy) it is possible that some >> operations use all available CPUs by default (thanks to David Bayani for >> pointing this out). This can also happen if you use packages that rely on >> numpy and scipy such as statsmodels. On our servers this appears to be >> caused by the use of the open MP API. >> >> While automatic multi processing can be a great feature, it can cause >> trouble if it is combined with additional multi processing (e.g. your own >> use of the multiprocessing or joblib libraries) or when multiple users >> unwittingly spawn too many threads at the same time. >> >> If you want to control the number of threads used through open MP, use >> the OMP_NUM_THREADS environment variable when you run your python code >> (with a reasonable number of threads): >> >> [user at server ~]$ OMP_NUM_THREADS=8 python yourscript.py >> >> Also, it is a great habit to run top or htop to monitor your resource >> consumption to make sure you aren?t inconveniencing other users of our >> lab?s resources. >> >> Best, >> Ben >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sheath at andrew.cmu.edu Mon Apr 30 17:44:32 2018 From: sheath at andrew.cmu.edu (Simon Heath) Date: Mon, 30 Apr 2018 17:44:32 -0400 Subject: high system cpu usage with recent numpy updates In-Reply-To: References:

Message-ID: Hi all, We've updated the /etc/profile default file on the compute nodes to have the environment variable OMP_NUM_THREADS=1 by default. This *should* make things behave the way they behaved before the update. If you want the OpenMP library to do its threading for you then you can just override the environment variable in your own ~/.profile file, or set it when you start your script like Ben demonstrated. NOTE that you will have to log out and log back in before this change takes effect. This includes any screen or tmux sessions you may have active! Let Predrag or I know if this causes other problems. Simon On Mon, Apr 30, 2018 at 5:32 PM, David Bayani wrote: > Thanks Ben. You beat me to the punch sending this out. I give credit to > you for first suggesting updates to scipy were responsible. As Ben pointed > out in person, this is expected to become more of an issue as the machines > in the lab have their packages updated, motivating this group email so > everyone knows now. Providing a bit more material below on the subject in > case anyone is interested: > > The place this flag suggestion was initially found was: > https://stackoverflow.com/questions/22418634/numpy- > openblas-set-maximum-number-of-threads > There are some conversations and related links there that might be > enlightening. Checking the release notes on the newer versions of scipy and > numpy were also useful. > > For me, this became an issue when multiple threads were being spawn > despite the fact that multithreading was not explicitly invoked in the code > in question. > Specifically, using only scipy sparse matrix libraries and looking at the > column NLWP in htop, some machines could be seen using one thread and > others using as many threads as thread contexts. > Machines that had single threads listed for the processes were using: > >>> import scipy > >>> scipy.__version__ > '0.12.1' > >>> import numpy > >>> numpy.__version__ > '1.7.1' > and machines that invoked multiple threads were using: > >>> import scipy > >>> scipy.__version__ > '1.0.0' > >>> import numpy > >>> numpy.__version__ > '1.14.0' > For the runs in question, life went much better after forcing the number > of threads to one. As Ben said, using more threads does not imply that runs > will go no slower. > > On Mon, Apr 30, 2018 at 3:59 PM, Matthew Barnes > wrote: > >> This happens even with basic multiprocessing in Python, for example the >> multiprocessing.Pool.map operation. Don't be like me and accidentally start >> 2500 processes :) >> >> On Mon, Apr 30, 2018 at 3:53 PM Benedikt Boecking < >> boecking at andrew.cmu.edu> wrote: >> >>> All, >>> >>> With newer versions of numpy (and maybe scipy) it is possible that some >>> operations use all available CPUs by default (thanks to David Bayani for >>> pointing this out). This can also happen if you use packages that rely on >>> numpy and scipy such as statsmodels. On our servers this appears to be >>> caused by the use of the open MP API. >>> >>> While automatic multi processing can be a great feature, it can cause >>> trouble if it is combined with additional multi processing (e.g. your own >>> use of the multiprocessing or joblib libraries) or when multiple users >>> unwittingly spawn too many threads at the same time. >>> >>> If you want to control the number of threads used through open MP, use >>> the OMP_NUM_THREADS environment variable when you run your python code >>> (with a reasonable number of threads): >>> >>> [user at server ~]$ OMP_NUM_THREADS=8 python yourscript.py >>> >>> Also, it is a great habit to run top or htop to monitor your resource >>> consumption to make sure you aren?t inconveniencing other users of our >>> lab?s resources. >>> >>> Best, >>> Ben >>> >> > -- Simon Heath, Research Programmer and Analyst Robotics Institute - Auton Lab Carnegie Mellon University sheath at andrew.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From dbayani at andrew.cmu.edu Mon Apr 30 18:02:53 2018 From: dbayani at andrew.cmu.edu (David Bayani) Date: Mon, 30 Apr 2018 18:02:53 -0400 Subject: high system cpu usage with recent numpy updates In-Reply-To: References:

Message-ID: Good going Simon and Predrag On Mon, Apr 30, 2018 at 5:44 PM, Simon Heath wrote: > Hi all, > > We've updated the /etc/profile default file on the compute nodes to have > the environment variable OMP_NUM_THREADS=1 by default. This *should* make > things behave the way they behaved before the update. If you want the > OpenMP library to do its threading for you then you can just override the > environment variable in your own ~/.profile file, or set it when you start > your script like Ben demonstrated. > > NOTE that you will have to log out and log back in before this change > takes effect. This includes any screen or tmux sessions you may have > active! > > Let Predrag or I know if this causes other problems. > > Simon > > On Mon, Apr 30, 2018 at 5:32 PM, David Bayani > wrote: > >> Thanks Ben. You beat me to the punch sending this out. I give credit to >> you for first suggesting updates to scipy were responsible. As Ben pointed >> out in person, this is expected to become more of an issue as the machines >> in the lab have their packages updated, motivating this group email so >> everyone knows now. Providing a bit more material below on the subject in >> case anyone is interested: >> >> The place this flag suggestion was initially found was: >> https://stackoverflow.com/questions/22418634/numpy-openblas- >> set-maximum-number-of-threads >> There are some conversations and related links there that might be >> enlightening. Checking the release notes on the newer versions of scipy and >> numpy were also useful. >> >> For me, this became an issue when multiple threads were being spawn >> despite the fact that multithreading was not explicitly invoked in the code >> in question. >> Specifically, using only scipy sparse matrix libraries and looking at the >> column NLWP in htop, some machines could be seen using one thread and >> others using as many threads as thread contexts. >> Machines that had single threads listed for the processes were using: >> >>> import scipy >> >>> scipy.__version__ >> '0.12.1' >> >>> import numpy >> >>> numpy.__version__ >> '1.7.1' >> and machines that invoked multiple threads were using: >> >>> import scipy >> >>> scipy.__version__ >> '1.0.0' >> >>> import numpy >> >>> numpy.__version__ >> '1.14.0' >> For the runs in question, life went much better after forcing the number >> of threads to one. As Ben said, using more threads does not imply that runs >> will go no slower. >> >> On Mon, Apr 30, 2018 at 3:59 PM, Matthew Barnes >> wrote: >> >>> This happens even with basic multiprocessing in Python, for example the >>> multiprocessing.Pool.map operation. Don't be like me and accidentally start >>> 2500 processes :) >>> >>> On Mon, Apr 30, 2018 at 3:53 PM Benedikt Boecking < >>> boecking at andrew.cmu.edu> wrote: >>> >>>> All, >>>> >>>> With newer versions of numpy (and maybe scipy) it is possible that some >>>> operations use all available CPUs by default (thanks to David Bayani for >>>> pointing this out). This can also happen if you use packages that rely on >>>> numpy and scipy such as statsmodels. On our servers this appears to be >>>> caused by the use of the open MP API. >>>> >>>> While automatic multi processing can be a great feature, it can cause >>>> trouble if it is combined with additional multi processing (e.g. your own >>>> use of the multiprocessing or joblib libraries) or when multiple users >>>> unwittingly spawn too many threads at the same time. >>>> >>>> If you want to control the number of threads used through open MP, use >>>> the OMP_NUM_THREADS environment variable when you run your python code >>>> (with a reasonable number of threads): >>>> >>>> [user at server ~]$ OMP_NUM_THREADS=8 python yourscript.py >>>> >>>> Also, it is a great habit to run top or htop to monitor your resource >>>> consumption to make sure you aren?t inconveniencing other users of our >>>> lab?s resources. >>>> >>>> Best, >>>> Ben >>>> >>> >> > > > -- > Simon Heath, Research Programmer and Analyst > Robotics Institute - Auton Lab > Carnegie Mellon University > sheath at andrew.cmu.edu > -------------- next part -------------- An HTML attachment was scrubbed... URL: