From ngisolfi at cs.cmu.edu Thu Apr 1 11:52:50 2021 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Thu, 1 Apr 2021 11:52:50 -0400 Subject: [Lunch] Today @noon over zoom Message-ID: <8034B0DE-335C-4DC1-99BB-A724BE3321BF@cs.cmu.edu> I am sorry for the late notice! https://cmu.zoom.us/j/95972096730?pwd=ZG1Vb0JnSEJ4Y0FPYUk0NGkrdHFHQT09 We hope to see you there! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sun Apr 4 23:02:42 2021 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 4 Apr 2021 23:02:42 -0400 Subject: website is down Message-ID: Dear Autonians, First off, I would like to say Happy Easter to all of you who celebrate it today. About 30 min ago I started getting warnings that our main web proxy is down. As a result, our normal website, wiki, and few other web servers appear to be down. After a bit of investigation, it appears that 10+ years old platter HDD gave up as I can still ping the server. That is not a big deal. I have a new SSD drive. I also have hardware in reserver in case it is more serious. I do have a hot spare web proxy but regenerating SSL certificates would probably take longer than driving to CMU and fixing the failed server. Best, Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Apr 5 01:14:07 2021 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 5 Apr 2021 01:14:07 -0400 Subject: website is down In-Reply-To: References: Message-ID: After a bit more pocking it turned out that the issue was CMU flaky network rather than our old HDD. I was able to remotely access the server and it never went down but the external interface did lose network connection for almost 2h. Things work as designed right now. Best, Predrag On Sun, Apr 4, 2021 at 11:02 PM Predrag Punosevac wrote: > Dear Autonians, > > First off, I would like to say Happy Easter to all of you who celebrate it > today. > > About 30 min ago I started getting warnings that our main web proxy is > down. As a result, our normal website, wiki, and few other web servers > appear to be down. After a bit of investigation, it appears that 10+ years > old platter HDD gave up as I can still ping the server. That is not a big > deal. I have a new SSD drive. I also have hardware in reserver in case it > is more serious. I do have a hot spare web proxy but regenerating SSL > certificates would probably take longer than driving to CMU and fixing the > failed server. > > Best, > Predrag > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Mon Apr 5 05:11:28 2021 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 5 Apr 2021 05:11:28 -0400 Subject: website is down In-Reply-To: References: Message-ID: Thank you Predrag. Artur On Mon, Apr 5, 2021, 1:15 AM Predrag Punosevac wrote: > After a bit more pocking it turned out that the issue was CMU flaky > network rather than our old HDD. I was able to remotely access the server > and it never went down but the external interface did lose network > connection for almost 2h. Things work as designed right now. > > Best, > Predrag > > On Sun, Apr 4, 2021 at 11:02 PM Predrag Punosevac > wrote: > >> Dear Autonians, >> >> First off, I would like to say Happy Easter to all of you who celebrate >> it today. >> >> About 30 min ago I started getting warnings that our main web proxy is >> down. As a result, our normal website, wiki, and few other web servers >> appear to be down. After a bit of investigation, it appears that 10+ years >> old platter HDD gave up as I can still ping the server. That is not a big >> deal. I have a new SSD drive. I also have hardware in reserver in case it >> is more serious. I do have a hot spare web proxy but regenerating SSL >> certificates would probably take longer than driving to CMU and fixing the >> failed server. >> >> Best, >> Predrag >> >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Apr 5 20:00:46 2021 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 5 Apr 2021 20:00:46 -0400 Subject: Multi Nodes Communication via MPI on Auton Cluster? In-Reply-To: References: Message-ID: Hi Zhe, I hope you don't mind me replying to the mailing list as you are asking a question that might be of concern to others. I hold a terminal degree in pure mathematics and work in math-physics when I have time. I am sure most of you know infinitely more about computing than I. Please take my answer with a grain of salt. There are two things in pre GPU era that people who were involved in HPC (high-performance computing) had to understand. One is OpenMPI and the second one is MPI computing. OpenMP is a way to program on shared memory devices. This means that parallelism occurs where every parallel thread has access to all of your data. You can think of it as: parallelism can happen during the execution of a specific for a loop by splitting up the loop among the different threads. Our CPU computing nodes are built for multi-threading computing. Unfortunately, most of you are using Python. Python doesn't support multi-threading due to GIL (global-interpreter-lock). Thus you have people spawning numerous scripts and crushing machines. I don't know about R which is essentially a fancy wrapper on the top of pure C to tell you how efficient it is. I do know enough about Julia to tell you that multi-threading is built in. Julia uses Threads. at threads macro to parallelize loops and Threads. at spawn to launch tasks on separate system threads. Use locks or atomic values to control the parallel execution. https://docs.julialang.org/en/v1/manual/parallel-computing/ MPI is a way to program on distributed memory devices. This means that parallelism occurs where every parallel process is working in its own memory space in isolation from the others. You can think of it as every bit of code you've written is executed independently by every process. The parallelism occurs because you tell each process exactly which part of the global problem they should be working on based entirely on their process ID. Historically, with the exception of the short Hadoop period when we run the Rocks cluster which comes pre-configured for distributed computing, we didn't utilize distributed computing. If you force me to speculate why that was the case I think it is due to the fact that the primary method of hardware acquisition in our lab was(still is) accretion. Our infrastructure was too inhomogeneous put together in an ad hoc fashion rather than the careful design. Blame it on the funding sources. We have never had the luxury of spending half a million dollars on the carefully designed cluster utilizing InfiniBand. Currently, our hardware is homogenous enough and 40 Gigabit InfiniBand are dirt cheap due to the fact that national labs have largely migrated to 100 Gigabit that I could clamp a few CPU or even GPU clusters if I get few thousands for used InfiniBand. IIRC Python uses a multiprocessing library for multiprocessing https://docs.python.org/3.8/library/multiprocessing.html and does support distributed computing https://wiki.python.org/moin/ParallelProcessing but I am not familiar with it. Julia which I use does have native support for distributive computing. Please see the above link. The way in which you write an OpenMP an MPI program, of course, is also very different. MPI stands for Message Passing Interface. It is a set of API declarations on message passing (such as to send, receive, broadcast, etc.), and what behavior should be expected from the implementations. I have not done enough of C and Fortran programming to know how to correctly use MPI. Also for the record, I don't know C++. People who know me well are well aware of how irritated I get when C and C++ are interchangeably used in a single sentence. The idea of "message passing" is rather abstract. It could mean passing the message between local processes or processes distributed across networked hosts, etc. Modern implementations try very hard to be versatile and abstract away the multiple underlying mechanisms (shared memory access, network IO, etc.). OpenMP is an API that is all about making it (presumably) easier to write shared-memory multi-processing programs. There is no notion of passing messages around. Instead, with a set of standard functions and compiler directives, you write programs that execute local threads in parallel, and you control the behavior of those threads (what resource they should have access to, how they are synchronized, etc.). OpenMP requires the support of the compiler, so you can also look at it as an extension of the supported languages. And it's not uncommon that an application can use both MPI and OpenMP. I am afraid if you were hoping for the pre-configured distributed environment which will enable you to execute the single magic command like mpiexec you will be disappointed. This is an instance where using Pittsburg Supercomputing Center is probably more appropriate. There are limitations to a one-man IT department model currently utilized by the Auton Lab. You just exposed the ugly truth. For the record, I would be far happier to spend more time on genuine HPC and never be bothered with trivialities but budgetary constraints are the major obstacle. Most Kind Regards, Predrag P.S. Please don't get me started with HPC GPU computing :-) On Mon, Apr 5, 2021 at 5:00 PM Zhe Huang wrote: > Hi Predrag, > > Sorry to bother you. I have been trying to run my experiment across > multiple nodes (e.g. on both gpu16 and gpu17) in a distributed manner. I > saw there is MPI backend pre-installed on the Auton cluster. However, I > tested it and I felt like it didn't work at all (I was using this command > on gpu16 to run jobs on gpu17: mpiexec -n 8 -hosts gpu17.int.autonlab.org > echo "hello"). > > Actually, is there no cross-node communication on the cluster at all or > did I do it wrong? If the latter is the case, could you point me to a > one-liner working example? Thanks. > > Sincerely, > Zhe > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhehuang at cmu.edu Mon Apr 5 20:33:04 2021 From: zhehuang at cmu.edu (Zhe Huang) Date: Mon, 5 Apr 2021 20:33:04 -0400 Subject: Multi Nodes Communication via MPI on Auton Cluster? In-Reply-To: References: Message-ID: Hi Predrag, Thanks a lot for clarification. In my original question, I should have stated that I have already implemented my code via mpi4py to make it work distributedly. I should have asked the question in another way directly related to the cross-node InfiniBand communication (gladly you provided this info. in your response so I am clear right now). I have already checked a lot and tried to set up jobs running across nodes in different ways but failed. At the time I was sending you the email, I am 99% sure that the infrastructure doesn't support this but I just want to confirm with you. I feel guilty that due to my vagueness, you end up with such a long feedback. Thank you for your detailed and informative explanation. Very appreciated. Most sincerely, Zhe On Mon, Apr 5, 2021 at 8:01 PM Predrag Punosevac wrote: > Hi Zhe, > > I hope you don't mind me replying to the mailing list as you are asking a > question that might be of concern to others. > > I hold a terminal degree in pure mathematics and work in math-physics when > I have time. I am sure most of you know infinitely more about computing > than I. Please take my answer with a grain of salt. There are two things in > pre GPU era that people who were involved in HPC > (high-performance computing) had to understand. One is OpenMPI and the > second one is MPI computing. > > OpenMP is a way to program on shared memory devices. This means that > parallelism occurs where every parallel thread has access to all of your > data. You can think of it as: parallelism can happen during the execution > of a specific for a loop by splitting up the loop among the different > threads. > Our CPU computing nodes are built for multi-threading computing. > Unfortunately, most of you are using Python. Python doesn't support > multi-threading due to GIL (global-interpreter-lock). Thus you have > people spawning numerous scripts and crushing machines. I don't know about > R which is essentially a fancy wrapper on the top of pure C to tell you how > efficient it is. I do know enough about Julia to tell you that > multi-threading is built in. Julia uses Threads. at threads macro to > parallelize loops and Threads. at spawn to launch tasks on separate system > threads. Use locks or atomic values to control the parallel execution. > > https://docs.julialang.org/en/v1/manual/parallel-computing/ > > MPI is a way to program on distributed memory devices. This means that > parallelism occurs where every parallel process is working in its > own memory space in isolation from the others. You can think of it as > every bit of code you've written is executed independently by every > process. The parallelism occurs because you tell each process exactly which > part of the global problem they should be working on based entirely on > their process ID. Historically, with the exception of the short Hadoop > period when we run the Rocks cluster which comes pre-configured for > distributed computing, we didn't utilize distributed computing. If you > force me to speculate why that was the case I think it is due to the fact > that the primary method of hardware acquisition in our lab was(still is) > accretion. Our infrastructure was too inhomogeneous put together in an ad > hoc fashion rather than the careful design. Blame it on the funding > sources. We have never had the luxury of spending half a million dollars on > the carefully designed cluster utilizing InfiniBand. Currently, our > hardware is homogenous enough and 40 Gigabit InfiniBand are dirt cheap due > to the fact that national labs have largely migrated to 100 Gigabit that I > could clamp a few CPU or even GPU clusters if I get few thousands for used > InfiniBand. IIRC Python uses a multiprocessing library for multiprocessing > > https://docs.python.org/3.8/library/multiprocessing.html > > and does support distributed computing > > https://wiki.python.org/moin/ParallelProcessing > > but I am not familiar with it. Julia which I use does have native support > for distributive computing. Please see the above link. > > > The way in which you write an OpenMP an MPI program, of course, is also > very different. > > MPI stands for Message Passing Interface. It is a set of API declarations > on message passing (such as to send, receive, broadcast, > etc.), and what behavior should be expected from the implementations. I > have not done enough of C and Fortran programming to know how to correctly > use MPI. Also for the record, I don't know C++. People who know me well are > well aware of how irritated I get when C and C++ are interchangeably used > in a single sentence. > > The idea of "message passing" is rather abstract. It could mean passing > the message between local processes or processes distributed across > networked hosts, etc. Modern implementations try very hard to be versatile > and abstract away the multiple underlying mechanisms (shared > memory access, network IO, etc.). > > OpenMP is an API that is all about making it (presumably) easier to write > shared-memory multi-processing programs. There is no notion of > passing messages around. Instead, with a set of standard functions and > compiler directives, you write programs that execute local threads in > parallel, and you control the behavior of those threads (what resource > they should have access to, how they are synchronized, etc.). OpenMP > requires the support of the compiler, so you can also look at it as an > extension of the supported languages. > > And it's not uncommon that an application can use both MPI and OpenMP. > > I am afraid if you were hoping for the pre-configured distributed > environment which will enable you to execute the single magic command like > mpiexec you will be disappointed. This is an instance where using > Pittsburg Supercomputing Center is probably more appropriate. There are > limitations to a one-man IT department model currently utilized by the > Auton Lab. You just exposed the ugly truth. > > For the record, I would be far happier to spend more time on genuine HPC > and never be bothered with trivialities but budgetary constraints are the > major obstacle. > > Most Kind Regards, > Predrag > > P.S. Please don't get me started with HPC GPU computing :-) > > > > On Mon, Apr 5, 2021 at 5:00 PM Zhe Huang wrote: > >> Hi Predrag, >> >> Sorry to bother you. I have been trying to run my experiment across >> multiple nodes (e.g. on both gpu16 and gpu17) in a distributed manner. I >> saw there is MPI backend pre-installed on the Auton cluster. However, I >> tested it and I felt like it didn't work at all (I was using this command >> on gpu16 to run jobs on gpu17: mpiexec -n 8 -hosts gpu17.int.autonlab.org >> echo "hello"). >> >> Actually, is there no cross-node communication on the cluster at all or >> did I do it wrong? If the latter is the case, could you point me to a >> one-liner working example? Thanks. >> >> Sincerely, >> Zhe >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Tue Apr 6 14:34:41 2021 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Tue, 6 Apr 2021 14:34:41 -0400 Subject: Fwd: [CMU AI Seminar] Apr 13 at 12pm (Zoom) -- Noah Smith (U of Washington) -- Language Models: Challenges and Progress -- AI Seminar sponsored by Fortive In-Reply-To: References: Message-ID: May be of interest to a few of us at the Auton Lab. Noah is an old collaborator. Artur ---------- Forwarded message --------- From: Shaojie Bai Date: Tue, Apr 6, 2021 at 2:14 PM Subject: [CMU AI Seminar] Apr 13 at 12pm (Zoom) -- Noah Smith (U of Washington) -- Language Models: Challenges and Progress -- AI Seminar sponsored by Fortive To: , , < ml-students at cs.cmu.edu>, , < ece-students at ece.cmu.edu> Dear all, We look forward to seeing you *next Tuesday (4/13)* from *1**2:00-1:00 PM (U.S. Eastern time)* for the next talk of our *CMU AI seminar*, sponsored by Fortive . To learn more about the seminar series or see the future schedule, please visit the seminar website . On 4/13, *Noah Smith* (University of Washington / AI2) will be giving a talk on "*Language Models: Challenges and Progress*". *Title*: Language Models: Challenges and Progress *Talk Abstract*: Probabilistic language models are once again foundational to many advances in natural language processing research, bringing the exciting opportunity to harness raw text to build language technologies. With the emergence of deep architectures and protocols for finetuning a pretrained language model, many NLP solutions are being cast as simple variations on language modeling. This talk is about challenges in language model-based NLP and some of our work toward solutions. First, we'll consider evaluation of generated language. I'll present some alarming findings about humans and models and make some recommendations. Second, I'll turn to an ubiquitous design limitation in language modeling -- the vocabulary -- and present a linguistically principled, sample-efficient solution that enables modifying the vocabulary during finetuning and/or deployment. Finally, I'll delve into today's most popular language modeling architecture, the transformer, and show how its attention layers' quadratic runtime can be made linear without affecting accuracy. Taken together, we hope these advances will broaden the population of people who can effectively use and contribute back to NLP. *Speaker Bio*: Noah Smith is a Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, as well as a Senior Research Manager at the Allen Institute for Artificial Intelligence. Previously, he was an Associate Professor of Language Technologies and Machine Learning in the School of Computer Science at Carnegie Mellon University. He received his Ph.D. in Computer Science from Johns Hopkins University in 2006 and his B.S. in Computer Science and B.A. in Linguistics from the University of Maryland in 2001. His research interests include statistical natural language processing, machine learning, and applications of natural language processing, especially to the social sciences. His book, Linguistic Structure Prediction, covers many of these topics. He has served on the editorial boards of the journals Computational Linguistics (2009-2011), Journal of Artificial Intelligence Research (2011-present), and Transactions of the Association for Computational Linguistics (2012-present), as the secretary-treasurer of SIGDAT (2012-2015 and 2018-present), and as program co-chair of ACL 2016. Alumni of his research group, Noah's ARK, are international leaders in NLP in academia and industry; in 2017 UW's Sounding Board team won the inaugural Amazon Alexa Prize. He was named an ACL Fellow in 2020, "for significant contributions to linguistic structure prediction, computational social sciences, and improving NLP research methodology." Smith's work has been recognized with a UW Innovation award (2016-2018), a Finmeccanica career development chair at CMU (2011-2014), an NSF CAREER award (2011-2016), a Hertz Foundation graduate fellowship (2001-2006), numerous best paper nominations and awards, and coverage by NPR, BBC, CBC, New York Times, Washington Post, and Time. *Zoom Link*: https://cmu.zoom.us/j/93338025712?pwd=dEZvTkc0bTVtTjNkRkQzeGo5KzVZUT09 Thanks, Shaojie Bai (MLD) -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Wed Apr 7 16:48:12 2021 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Wed, 7 Apr 2021 16:48:12 -0400 Subject: Fwd: MACHINE LEARNING in MEDICINE - VIRTUAL SEMINAR - APRIL 14, 2021 - 3PM (EST) ---ZOOM In-Reply-To: <3aa4dfc5989d49ed82aa9eb03d27ba20@cs.cmu.edu> References: <3aa4dfc5989d49ed82aa9eb03d27ba20@cs.cmu.edu> Message-ID: some of us may be interested in this topic ---------- Forwarded message --------- From: Christy Melucci Date: Wed, Apr 7, 2021 at 4:12 PM Subject: RE: MACHINE LEARNING in MEDICINE - VIRTUAL SEMINAR - APRIL 14, 2021 - 3PM (EST) ---ZOOM To: ml-core-faculty at cs.cmu.edu , ml-seminar at cs.cmu.edu Cc: Visweswaran, Shyam , Bartolotta, Genine M < bartgm at pitt.edu>, Batmanghelich, Kayhan , Roni Rosenfeld < roni at cs.cmu.edu> *From:* Bartolotta, Genine M *Sent:* Wednesday, April 7, 2021 3:22 PM *Cc:* Batmanghelich, Kayhan ; Visweswaran, Shyam < shv3 at pitt.edu> *Subject:* MACHINE LEARNING in MEDICINE - VIRTUAL SEMINAR - APRIL 14, 2021 - 3PM (EST) ---ZOOM *Machine Learning in Medicine (MLxMed)* *A Virtual Seminar Series in Pittsburgh* *Hosted by the Department of Biomedical Informatics* *Wednesday, April 14, 2021* *3:00 PM ? 4:00 PM Eastern Time University of Pittsburgh, UPMC, and CMU* *Machine Learning in Medicine: Early Recognition of Sepsis* *Zoom **https://pitt.zoom.us/j/93487765055* *(**details are listed at the end**)* *Karsten Borgwardt, PhD* Full Professor of Data Mining, Biosystems, ETH Z?rich *Abstract: *Sepsis is a major cause of mortality in intensive care units around the world. If recognized early, it can often be treated successfully, but early prediction of sepsis is an extremely difficult task in clinical practice. The data wealth from intensive care units that is increasingly becoming available for research now allows to study this problem of predicting sepsis using machine learning and data mining approaches. In this talk, I will describe our efforts towards data-driven early recognition of sepsis. *About MLxMed Seminar Series* *(**http://ml-in-medicine.org/* *)* Medicine is complex and data-driven while discovery and decision making are increasingly enabled by machine learning. Machine learning has the potential to support, enable and improve medical discovery and clinical decision making in areas such as medical imaging, cancer diagnostics, precision medicine, clinical trials, and electronic health records. This seminar series focuses on new algorithms, real-world deployment, and future trends in machine learning in medicine. It will feature prominent investigators who are developing and applying machine learning to biomedical discovery and in clinical decision support. For more information see MLxMed website. *Zoom Information* *When: April 14, 2021 - 3:00 PM Eastern Time (US and Canada)* *Please click the link below to join the webinar:* *https://pitt.zoom.us/j/93487765055* Or One tap mobile : US: *+12678310333, 93487765055# or 8778535247, 93487765055#* (Toll Free) Or Telephone: Dial (for higher quality, dial a number based on your current location): *US: +1 267 831 0333 or 877 853 5247 (Toll Free)* *Webinar ID: 934 8776 5055* International numbers available: *https://pitt.zoom.us/u/abbaYni0lZ* Or an H.323/SIP room system: H.323: 162.255.37.11 (US West) 162.255.36.11 (US East) 115.114.131.7 (India Mumbai) 115.114.115.7 (India Hyderabad) 213.19.144.110 (Amsterdam Netherlands) 213.244.140.110 (Germany) 103.122.166.55 (Australia Sydney) 103.122.167.55 (Australia Melbourne) 149.137.40.110 (Singapore) 64.211.144.160 (Brazil) 69.174.57.160 (Canada Toronto) 65.39.152.160 (Canada Vancouver) 207.226.132.110 (Japan Tokyo) 149.137.24.110 (Japan Osaka) * Meeting ID: 934 8776 5055* * SIP: **93487765055 at zoomcrc.com* <93487765055 at zoomcrc.com> -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Thu Apr 8 11:30:16 2021 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Thu, 8 Apr 2021 11:30:16 -0400 Subject: [Lunch] Today @noon over zoom Message-ID: <3478014A-3F65-4FE7-B546-E0BB3A9D5782@cs.cmu.edu> https://cmu.zoom.us/j/95972096730?pwd=ZG1Vb0JnSEJ4Y0FPYUk0NGkrdHFHQT09 We hope to see you there! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From boecking at andrew.cmu.edu Wed Apr 14 15:51:43 2021 From: boecking at andrew.cmu.edu (Benedikt Boecking) Date: Wed, 14 Apr 2021 14:51:43 -0500 Subject: Compute node usage Message-ID: <759A7322-8868-4AC1-B533-42B9A7A66D3F@andrew.cmu.edu> Hi everyone, This is just a reminder for everyone to please be mindful of others when running experiments on our cpu or gpu servers. Before running your experiments, please use $ htop and additionally on gpu nodes $ nvidia-smi to check that adequate resources are available. Also, while your jobs are running, please monitor memory, cpu, and gpu usage for unexpected behavior. From ngisolfi at cs.cmu.edu Thu Apr 15 11:34:56 2021 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Thu, 15 Apr 2021 11:34:56 -0400 Subject: [Lunch] Today @noon over zoom Message-ID: https://cmu.zoom.us/j/95972096730?pwd=ZG1Vb0JnSEJ4Y0FPYUk0NGkrdHFHQT09 We hope to see you there! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sun Apr 18 23:20:00 2021 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 18 Apr 2021 23:20:00 -0400 Subject: incompatible cudnn.h and libcublas? In-Reply-To: References: Message-ID: Hi Ifigeneia, I am CC-ing as this might be of wider interest to the lab members. This seems to be a cuDNN issue. gpu1 runs cuda11.2 on RHEL 7.9 while gpu2 runs cuda11 on RHEL7.9. Current CUDA release is 11.3 and all recently provisioned computing nodes run RHEL 8.3. In an ideal world I should firstly upgrade all computing nodes to 8.3 and CUDA installations to 11.3 before we talk about cuDNN libraries. cuDNN is a proprietary software. I logged into my NVidia developer account and I am downloading RedHat 8.1 RPMs of cuDNN v8.1 released on February 26. That release supposedly should be compatible with all versions of CUDA branch 11 i.e. 11.0, 11.1, 11.2, and 11.3 but runs on RHEL 8.1 (so there is no guarantee that it will run on 8.3). I can download RMPs for RHEL 7.3 but obviously there is no guarantee that will work on RHEL 7.9. Upgrading 7.9 to 8.3 on 30+ computing nodes is not realistic. The down time would be significant. Updating CUDA and cuDNN across 23+ servers is also non trivial as it requires reboot. Upgrading cuda on 5 GPU servers per week seems a more reasonable and less risky approach. Are there any impending deadlines that I should be aware of? If Ben who is CC to this email confirms that I would be happy to try to upgrade CUDA to 8.3 on GPU[1-5] and install cuDNN v8.1 but I will not upgrade OS to 8.3. Best, Predrag On Sat, Apr 17, 2021 at 10:40 AM Ifigeneia Apostolopoulou < iapostol at andrew.cmu.edu> wrote: > Hi Predrag, > > on gpu1/gpu2, I'm getting the following error: > > RuntimeError: Mixed dnn version. The header is version 8002 while the > library is version 7605. > > It seems that there exists an updated cudnn.h in /usr/include/ but no in > > /usr/local/cuda-11/include > /usr/local/cuda-11/targets/include/ > > In gpu20, there seems to be no cudnn.h. > > would it be possible to sync cudnn.h?? > > thanks! > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Sun Apr 18 23:33:34 2021 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Sun, 18 Apr 2021 23:33:34 -0400 Subject: apc00EDFC power overloading Message-ID: Dear Autonians, Just a quick heads up. About 10 minutes ago I started getting warnings that BANK1 on the PDU apc00EDFC is near overload. The following GPU machines are connected to that bank gpu21, gpu22, and gpu23. I have never seen this before. These gpu servers are 50K machines and are probably drawing more electricity than older GPU nodes. They must have never been hit as hard as today for me to see immanent power outrage. If you are running anything on those GPU nodes plus gpu20 which is connected to BANK2 on the same PDU please try to scale down a bit. Otherwise, it is probably safe to assume that servers will not live long enough to produce results. Best, Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Apr 19 11:58:53 2021 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 19 Apr 2021 11:58:53 -0400 Subject: incompatible cudnn.h and libcublas? In-Reply-To: <3EB9E619-6F6E-4E8F-B622-54D429002F27@andrew.cmu.edu> References: <3EB9E619-6F6E-4E8F-B622-54D429002F27@andrew.cmu.edu> Message-ID: Hi Ben, That is super useful info. That is exactly the feedback I was hoping to get by CC-ing users at autonlab. There are just too many moving wheels and 90% of time changing nothing is the correct approach to system administration. Best, Predrag On Mon, Apr 19, 2021 at 10:30 AM Benedikt Boecking wrote: > Just fyi, as far as I am aware, pytorch only supports cuda up to 11.1 for > now. It would be great if we could wait with updating cuda to 11.3 since > many lab members rely on pytorch. > > > > On Apr 18, 2021, at 10:20 PM, Predrag Punosevac > wrote: > > Hi Ifigeneia, > > I am CC-ing as this might be of wider interest to the lab members. > > This seems to be a cuDNN issue. gpu1 runs cuda11.2 on RHEL 7.9 while gpu2 > runs cuda11 on RHEL7.9. Current CUDA release is 11.3 and all recently > provisioned computing nodes run RHEL 8.3. In an ideal world I should > firstly upgrade all computing nodes to 8.3 and CUDA installations to 11.3 > before we talk about cuDNN libraries. cuDNN is a proprietary software. I > logged into my NVidia developer account and I am downloading RedHat 8.1 > RPMs of cuDNN v8.1 released on February 26. That release supposedly should > be compatible with all versions of CUDA branch 11 i.e. 11.0, 11.1, 11.2, > and 11.3 but runs on RHEL 8.1 (so there is no guarantee that it will run on > 8.3). I can download RMPs for RHEL 7.3 but obviously there is no guarantee > that will work on RHEL 7.9. > > Upgrading 7.9 to 8.3 on 30+ computing nodes is not realistic. The down > time would be significant. Updating CUDA and cuDNN across 23+ servers is > also non trivial as it requires reboot. Upgrading cuda on 5 GPU servers per > week seems a more reasonable and less risky approach. Are there any > impending deadlines that I should be aware of? If Ben who is CC to this > email confirms that I would be happy to try to upgrade CUDA to 8.3 on > GPU[1-5] and install cuDNN v8.1 but I will not upgrade OS to 8.3. > > Best, > Predrag > > On Sat, Apr 17, 2021 at 10:40 AM Ifigeneia Apostolopoulou < > iapostol at andrew.cmu.edu> wrote: > >> Hi Predrag, >> >> on gpu1/gpu2, I'm getting the following error: >> >> RuntimeError: Mixed dnn version. The header is version 8002 while the >> library is version 7605. >> >> It seems that there exists an updated cudnn.h in /usr/include/ but no in >> >> /usr/local/cuda-11/include >> /usr/local/cuda-11/targets/include/ >> >> In gpu20, there seems to be no cudnn.h. >> >> would it be possible to sync cudnn.h?? >> >> thanks! >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Thu Apr 22 11:20:34 2021 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Thu, 22 Apr 2021 11:20:34 -0400 Subject: [Lunch] Today @noon over zoom Message-ID: <92A3D484-9D87-4DDF-9FAA-673022E842A4@cs.cmu.edu> https://cmu.zoom.us/j/95972096730?pwd=ZG1Vb0JnSEJ4Y0FPYUk0NGkrdHFHQT09 We hope to see you there! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Thu Apr 29 11:28:26 2021 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Thu, 29 Apr 2021 11:28:26 -0400 Subject: [Lunch] Today @noon over zoom Message-ID: https://cmu.zoom.us/j/95972096730?pwd=ZG1Vb0JnSEJ4Y0FPYUk0NGkrdHFHQT09 We hope to see you there! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Fri Apr 30 16:01:11 2021 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Fri, 30 Apr 2021 16:01:11 -0400 Subject: Ridiculously long streak of victories of the Auton Lab's AutoML tool continues Message-ID: Dear Autonians, You have heard many times from me about our DARPA D3M Team and in particular Saswati Ray who is leading the automated machine learning effort in it (which we have recently aptly named Auto^{n}ML), and their long uninterrupted streak of wins in periodic DARPA-run evaluations. I guess everyone in the program was bored with that, so they decided to expose the competing systems to an external repository of challenges. So they picked 651 predictive problems from the OpenML repository. Guess what, our CMU team has done it again. See the attached screenshots from the leaderboard provided by DARPA. What is of note this time around is that Saswati's algorithm was able to yield better results than the best solutions available so far on OpenML in more than 100 cases. All contensting tools were given a budget of 30 minutes of compute time per task to propose the best ML architecture to tackle a problem. Way to go Saswati and the CMU Auton Lab DARPA D3M Team! Cheers, Artur [image: image.png] [image: image.png] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 23720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 32457 bytes Desc: not available URL: