From predragp at cs.cmu.edu Sun Oct 2 16:56:58 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Sun, 02 Oct 2016 16:56:58 -0400 Subject: MATLAB upgrade completed Message-ID: <20161002205658.2FVmVxnCA%predragp@cs.cmu.edu> Dear Autonians, MATLAB is now upgraded on all computing nodes. Predrag From predragp at cs.cmu.edu Mon Oct 3 07:31:00 2016 From: predragp at cs.cmu.edu (predragp at cs.cmu.edu) Date: Mon, 3 Oct 2016 11:31:00 +0000 (UTC) Subject: Fwd: POSTPONED: [Power Outage] SCS Wean Hall machine room - October 4th 2016 In-Reply-To: <4dff3aaa-a607-1682-0f69-b604579731c3@cs.cmu.edu> References: <0aee22ff-fd00-56d4-4563-f912cfe06184@cs.cmu.edu> <4dff3aaa-a607-1682-0f69-b604579731c3@cs.cmu.edu> Message-ID: <7B1F10ED8F1045F6.d4c53dad-27b1-4003-9697-cf9e62464310@mail.outlook.com> Get Outlook for Android ---------- Forwarded message ---------- From: "Edward Walter" Date: Mon, Oct 3, 2016 at 7:28 AM -0400 Subject: POSTPONED: [Power Outage] SCS Wean Hall machine room - October 4th 2016 To: "Edward J Walter" The planned power outage for the SCS Wean Hall machine room has been postponed and will NOT take place on October 4th. We are coordinating with the electrical contractors to get the work re-scheduled. We will let you know as soon as we have a new date for this power outage. Thank you. SCS Help Desk On 09/12/2016 08:01 AM, Edward Walter wrote: > SCS Computing Facilities and FMS are planning a partial power outage in > the SCS Wean Hall machine room. We expect this work to begin on Oct > 4th, 2016 and to take less than 24 hours. The outage may run into 48 > hours in the event that the electrical contractor encounters something > unexpected. > > The following servers or computational clusters will be affected by this > power outage: > > Affected clusters: > ACTR.HPC1.CS.CMU.EDU > AUTON > COMA.HPC1.CS.CMU.EDU > CORTEX.ML.CMU.EDU > LATEDAYS.ANDREW.CMU.EDU > PSYCH-O.HPC1.CS.CMU.EDU > ROCKS.IS.CS.CMU.EDU > WORKHORSE.LTI.CS.CMU.EDU > YODA.GRAPHICS.CS.CMU.EDU > > > Affected servers: > OMEPSLID.COMPBIO > SLIF.COMPBIO > PACIFIC.DB > GPUSERVER.PERCEPTION > GPUSERVER2.PERCEPTION > GPUSERVER3.PERCEPTION > GPUSERVER5.PERCEPTION > GPUSERVER6.PERCEPTION > GPUSERVER7.PERCEPTION > DENVER.LTI > LOR.LTI > MIAMI.LTI > SASKIA.ML > MARTEN.ML > JAN.ML > ARNOUT.ML > LYSBET.ML > FLORIS.ML > > > Please contact the SCS Help Desk at x8-4231 or send mail to > help at cs.cmu.edu with any questions or concerns regarding this > maintenance period. > > Thank you for your attention, > > SCS Help Desk -------------- next part -------------- An HTML attachment was scrubbed... URL: From jieshic at andrew.cmu.edu Mon Oct 3 11:30:18 2016 From: jieshic at andrew.cmu.edu (jieshic at andrew.cmu.edu) Date: Mon, 3 Oct 2016 11:30:18 -0400 Subject: Annual Auton Lab Picnic: Saturday October 1st at Schenley Park In-Reply-To: References: <07b2034c-153b-d4e3-5e46-8e1b65f8a2ac@cs.cmu.edu> Message-ID: <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> Hi Everyone, Thanks for attending the picnic last Saturday. Attached is the picture for the cake with winning slogan by Kyle. Since we received a lot of questions about the food vendors, here is the information.BBQ food from Double Wide Grill (2339 E Carson St, Pittsburgh, PA 15203).Taro cake from Pink Box Bakery Cafe (2104 Murray Ave, Pittsburgh, PA 15217). Thank you again for your support. Best, Auton Lab Entertainment Committee > > Hi Everyone, > Tomorrow's picnic will start at 11:30am. The shelter > is Vietnam Veterans Pavilion at the Schenley Park. We'll have lunch with > BBQ food, a large Taro cake, and beers & wines (pls bring your ID if > you'll drink :P ). > We have prepared some long games and you are > also welcome to bring your bikes, games, cameras, etc.. > BTW, here's > the weather forecast for your reference. "Showers in the morning, > then partly cloudy in the afternoon. Thunder possible. High 73F. Winds SSE > at 5 to 10 mph. Chance of rain 50%."? (source: www.weather.com) > > Pls feel free to let me know if you have any question.? > > Looking forward to seeing you tomorrow! > > > > Cheers, > Jessie >> Dear Autonians, >> >> We will be celebrating the 23rd > anniversary of the Lab this year at a >> nearby location. >> > We have reserved Vietnam Veterans Pavilion at the Schenley Park: >> > >> > https://www.google.com/maps/place/Vietnam+Veterans+Pavilion/@40.434036,-79.9453924,17z/data=!4m12!1m6!3m5!1s0x8834f18b186c4403:0xd24a0faef8f7e126!2sSchenley+Park!8m2!3d40.4318311!4d-79.9462078!3m4!1s0x0:0x4bcca3ccfd92c919!8m2!3d40.4338281!4d-79.9439894 >> >> We have it booked from 11am till 9pm, we will have lunch > food, and the >> Auton Lab Entertainment >> Committee led by > our CEO (Chief Entertainment Officer) Jessie is working >> on the - > you've guessed it - program >> of entertainment and activities. >> >> Would you please go to this google doc form to rsvp and > provide >> information useful for planning: >> >> > https://docs.google.com/forms/d/e/1FAIpQLSdf75XSHzcACDi9eiiUWjVSFuUzTT4mwTPNBWr1kIH7z0H69Q/viewform?usp=send_form >> >> Quick hint for the new members of our team: each year we > run a contest >> for the most fitting slogan to put >> on > the Auton Lab birthday cake. The bids are judged by a few wise men >> and the author of the winning slogan >> feels the pride of > seeing it on the cake and basks in glory forever. >> >> > Cheers! >> Artur >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 20161001_141039.jpeg Type: image/jpeg Size: 160849 bytes Desc: not available URL: From krw at andrew.cmu.edu Mon Oct 3 11:36:09 2016 From: krw at andrew.cmu.edu (Karen Widmaier) Date: Mon, 3 Oct 2016 11:36:09 -0400 Subject: Annual Auton Lab Picnic: Saturday October 1st at Schenley Park In-Reply-To: <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> References: <07b2034c-153b-d4e3-5e46-8e1b65f8a2ac@cs.cmu.edu> <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> Message-ID: <023e01d21d8b$dd6aa860$983ff920$@andrew.cmu.edu> Hi Jieshi, Thank you for all your hard work pulling it all together. A special thanks to you and Maria and Ben. Karen From: Autonlab-users [mailto:autonlab-users-bounces at autonlab.org] On Behalf Of jieshic at andrew.cmu.edu Sent: Monday, October 03, 2016 11:30 AM To: users at autonlab.org Cc: Chirag Nagpal; Luna Yang Subject: Re: Annual Auton Lab Picnic: Saturday October 1st at Schenley Park Hi Everyone, Thanks for attending the picnic last Saturday. Attached is the picture for the cake with winning slogan by Kyle. Since we received a lot of questions about the food vendors, here is the information. * BBQ food from Double Wide Grill (2339 E Carson St, Pittsburgh, PA 15203). * Taro cake from Pink Box Bakery Cafe (2104 Murray Ave, Pittsburgh, PA 15217). Thank you again for your support. Best, Auton Lab Entertainment Committee > > Hi Everyone, > Tomorrow's picnic will start at 11:30am. The shelter > is Vietnam Veterans Pavilion at the Schenley Park. We'll have lunch with > BBQ food, a large Taro cake, and beers & wines (pls bring your ID if > you'll drink :P ). > We have prepared some long games and you are > also welcome to bring your bikes, games, cameras, etc.. > BTW, here's > the weather forecast for your reference. "Showers in the morning, > then partly cloudy in the afternoon. Thunder possible. High 73F. Winds SSE > at 5 to 10 mph. Chance of rain 50%." (source: www.weather.com) > > Pls feel free to let me know if you have any question. > > Looking forward to seeing you tomorrow! > > > > Cheers, > Jessie >> Dear Autonians, >> >> We will be celebrating the 23rd > anniversary of the Lab this year at a >> nearby location. >> > We have reserved Vietnam Veterans Pavilion at the Schenley Park: >> > >> > https://www.google.com/maps/place/Vietnam+Veterans+Pavilion/@40.434036,-79.9 453924,17z/data=!4m12!1m6!3m5!1s0x8834f18b186c4403:0xd24a0faef8f7e126!2sSche nley+Park!8m2!3d40.4318311!4d-79.9462078!3m4!1s0x0:0x4bcca3ccfd92c919!8m2!3d 40.4338281!4d-79.9439894 >> >> We have it booked from 11am till 9pm, we will have lunch > food, and the >> Auton Lab Entertainment >> Committee led by > our CEO (Chief Entertainment Officer) Jessie is working >> on the - > you've guessed it - program >> of entertainment and activities. >> >> Would you please go to this google doc form to rsvp and > provide >> information useful for planning: >> >> > https://docs.google.com/forms/d/e/1FAIpQLSdf75XSHzcACDi9eiiUWjVSFuUzTT4mwTPN BWr1kIH7z0H69Q/viewform?usp=send_form >> >> Quick hint for the new members of our team: each year we > run a contest >> for the most fitting slogan to put >> on > the Auton Lab birthday cake. The bids are judged by a few wise men >> and the author of the winning slogan >> feels the pride of > seeing it on the cake and basks in glory forever. >> >> > Cheers! >> Artur >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mbarnes1 at andrew.cmu.edu Mon Oct 3 11:47:50 2016 From: mbarnes1 at andrew.cmu.edu (Matt Barnes) Date: Mon, 03 Oct 2016 15:47:50 +0000 Subject: Annual Auton Lab Picnic: Saturday October 1st at Schenley Park In-Reply-To: <023e01d21d8b$dd6aa860$983ff920$@andrew.cmu.edu> References: <07b2034c-153b-d4e3-5e46-8e1b65f8a2ac@cs.cmu.edu> <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> <023e01d21d8b$dd6aa860$983ff920$@andrew.cmu.edu> Message-ID: Seconded. Thanks for organizing -- I've never eaten so many ribs. On Mon, Oct 3, 2016 at 11:37 AM Karen Widmaier wrote: > Hi Jieshi, > > Thank you for all your hard work pulling it all together. > > > > A special thanks to you and Maria and Ben. > > > > Karen > > > > > > *From:* Autonlab-users [mailto:autonlab-users-bounces at autonlab.org] *On > Behalf Of *jieshic at andrew.cmu.edu > *Sent:* Monday, October 03, 2016 11:30 AM > *To:* users at autonlab.org > *Cc:* Chirag Nagpal; Luna Yang > *Subject:* Re: Annual Auton Lab Picnic: Saturday October 1st at Schenley > Park > > > > Hi Everyone, > > Thanks for attending the picnic last Saturday. Attached is the picture for > the cake with winning slogan by Kyle. > > Since we received a lot of questions about the food vendors, here is the > information. > > - BBQ food from Double Wide Grill (2339 E Carson St, Pittsburgh, PA > 15203). > - Taro cake from Pink Box Bakery Cafe (2104 Murray Ave, Pittsburgh, PA > 15217). > > Thank you again for your support. > > Best, > Auton Lab Entertainment Committee > > > > > Hi Everyone, > > Tomorrow's picnic will start at 11:30am. The shelter > > is Vietnam Veterans Pavilion at the Schenley Park. We'll have lunch with > > BBQ food, a large Taro cake, and beers & wines (pls bring your ID if > > you'll drink :P ). > > We have prepared some long games and you are > > also welcome to bring your bikes, games, cameras, etc.. > > BTW, here's > > the weather forecast for your reference. "Showers in the morning, > > then partly cloudy in the afternoon. Thunder possible. High 73F. Winds > SSE > > at 5 to 10 mph. Chance of rain 50%." (source: www.weather.com) > > > > Pls feel free to let me know if you have any question. > > > > Looking forward to seeing you tomorrow! > > > > > > > > Cheers, > > Jessie > >> Dear Autonians, > >> > >> We will be celebrating the 23rd > > anniversary of the Lab this year at a > >> nearby location. > >> > > We have reserved Vietnam Veterans Pavilion at the Schenley Park: > >> > > > >> > > > https://www.google.com/maps/place/Vietnam+Veterans+Pavilion/@40.434036,-79.9453924,17z/data=!4m12!1m6!3m5!1s0x8834f18b186c4403:0xd24a0faef8f7e126!2sSchenley+Park!8m2!3d40.4318311!4d-79.9462078!3m4!1s0x0:0x4bcca3ccfd92c919!8m2!3d40.4338281!4d-79.9439894 > >> > >> We have it booked from 11am till 9pm, we will have lunch > > food, and the > >> Auton Lab Entertainment > >> Committee led by > > our CEO (Chief Entertainment Officer) Jessie is working > >> on the - > > you've guessed it - program > >> of entertainment and activities. > >> > >> Would you please go to this google doc form to rsvp and > > provide > >> information useful for planning: > >> > >> > > > https://docs.google.com/forms/d/e/1FAIpQLSdf75XSHzcACDi9eiiUWjVSFuUzTT4mwTPNBWr1kIH7z0H69Q/viewform?usp=send_form > >> > >> Quick hint for the new members of our team: each year we > > run a contest > >> for the most fitting slogan to put > >> on > > the Auton Lab birthday cake. The bids are judged by a few wise men > >> and the author of the winning slogan > >> feels the pride of > > seeing it on the cake and basks in glory forever. > >> > >> > > Cheers! > >> Artur > >> > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Mon Oct 3 11:54:51 2016 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 3 Oct 2016 11:54:51 -0400 Subject: Annual Auton Lab Picnic: Saturday October 1st at Schenley Park In-Reply-To: <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> References: <07b2034c-153b-d4e3-5e46-8e1b65f8a2ac@cs.cmu.edu> <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> Message-ID: <54c94770-e719-ac60-aef3-26ad110e7be8@cs.cmu.edu> Jessie and the Entertainment Committee, Thank you so much for organizing a meticulous and very enjoyable event! This might have been the most attended Lab Picnic in Auton history, so feeding and entertaining everyone was not a small feat. Way to go! Thanks Artur On 10/3/2016 11:30 AM, jieshic at andrew.cmu.edu wrote: > > Hi Everyone, > > Thanks for attending the picnic last Saturday. Attached is the picture > for the cake with winning slogan by Kyle. > > Since we received a lot of questions about the food vendors, here is > the information. > > * BBQ food from Double Wide Grill(2339 E Carson St, Pittsburgh, PA > 15203). > * Taro cake from Pink Box Bakery Cafe (2104 Murray Ave, Pittsburgh, > PA 15217). > > Thank you again for your support. > > Best, > Auton Lab Entertainment Committee > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kandasamy at cmu.edu Mon Oct 3 11:56:27 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Mon, 3 Oct 2016 11:56:27 -0400 Subject: Annual Auton Lab Picnic: Saturday October 1st at Schenley Park In-Reply-To: References: <07b2034c-153b-d4e3-5e46-8e1b65f8a2ac@cs.cmu.edu> <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> <023e01d21d8b$dd6aa860$983ff920$@andrew.cmu.edu> Message-ID: Thanks for organising this Jieshi, Ben, Maria and everyone else who chipped in. This was one of the best picnics we've had! On Mon, Oct 3, 2016 at 11:47 AM, Matt Barnes wrote: > Seconded. Thanks for organizing -- I've never eaten so many ribs. > > On Mon, Oct 3, 2016 at 11:37 AM Karen Widmaier wrote: > >> Hi Jieshi, >> >> Thank you for all your hard work pulling it all together. >> >> >> >> A special thanks to you and Maria and Ben. >> >> >> >> Karen >> >> >> >> >> >> *From:* Autonlab-users [mailto:autonlab-users-bounces at autonlab.org] *On >> Behalf Of *jieshic at andrew.cmu.edu >> *Sent:* Monday, October 03, 2016 11:30 AM >> *To:* users at autonlab.org >> *Cc:* Chirag Nagpal; Luna Yang >> *Subject:* Re: Annual Auton Lab Picnic: Saturday October 1st at Schenley >> Park >> >> >> >> Hi Everyone, >> >> Thanks for attending the picnic last Saturday. Attached is the picture >> for the cake with winning slogan by Kyle. >> >> Since we received a lot of questions about the food vendors, here is the >> information. >> >> - BBQ food from Double Wide Grill (2339 E Carson St, Pittsburgh, PA >> 15203). >> - Taro cake from Pink Box Bakery Cafe (2104 Murray Ave, Pittsburgh, >> PA 15217). >> >> Thank you again for your support. >> >> Best, >> Auton Lab Entertainment Committee >> >> > >> > Hi Everyone, >> > Tomorrow's picnic will start at 11:30am. The shelter >> > is Vietnam Veterans Pavilion at the Schenley Park. We'll have lunch with >> > BBQ food, a large Taro cake, and beers & wines (pls bring your ID if >> > you'll drink :P ). >> > We have prepared some long games and you are >> > also welcome to bring your bikes, games, cameras, etc.. >> > BTW, here's >> > the weather forecast for your reference. "Showers in the morning, >> > then partly cloudy in the afternoon. Thunder possible. High 73F. Winds >> SSE >> > at 5 to 10 mph. Chance of rain 50%." (source: www.weather.com) >> > >> > Pls feel free to let me know if you have any question. >> > >> > Looking forward to seeing you tomorrow! >> > >> > >> > >> > Cheers, >> > Jessie >> >> Dear Autonians, >> >> >> >> We will be celebrating the 23rd >> > anniversary of the Lab this year at a >> >> nearby location. >> >> >> > We have reserved Vietnam Veterans Pavilion at the Schenley Park: >> >> >> > >> >> >> > https://www.google.com/maps/place/Vietnam+Veterans+ >> Pavilion/@40.434036,-79.9453924,17z/data=!4m12!1m6! >> 3m5!1s0x8834f18b186c4403:0xd24a0faef8f7e126!2sSchenley+ >> Park!8m2!3d40.4318311!4d-79.9462078!3m4!1s0x0: >> 0x4bcca3ccfd92c919!8m2!3d40.4338281!4d-79.9439894 >> >> >> >> We have it booked from 11am till 9pm, we will have lunch >> > food, and the >> >> Auton Lab Entertainment >> >> Committee led by >> > our CEO (Chief Entertainment Officer) Jessie is working >> >> on the - >> > you've guessed it - program >> >> of entertainment and activities. >> >> >> >> Would you please go to this google doc form to rsvp and >> > provide >> >> information useful for planning: >> >> >> >> >> > https://docs.google.com/forms/d/e/1FAIpQLSdf75XSHzcACDi9eiiUWjVS >> FuUzTT4mwTPNBWr1kIH7z0H69Q/viewform?usp=send_form >> >> >> >> Quick hint for the new members of our team: each year we >> > run a contest >> >> for the most fitting slogan to put >> >> on >> > the Auton Lab birthday cake. The bids are judged by a few wise men >> >> and the author of the winning slogan >> >> feels the pride of >> > seeing it on the cake and basks in glory forever. >> >> >> >> >> > Cheers! >> >> Artur >> >> >> > >> > >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sray at cs.cmu.edu Mon Oct 3 14:57:17 2016 From: sray at cs.cmu.edu (Saswati Ray) Date: Mon, 3 Oct 2016 14:57:17 -0400 Subject: Annual Auton Lab Picnic: Saturday October 1st at Schenley Park In-Reply-To: <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> References: <07b2034c-153b-d4e3-5e46-8e1b65f8a2ac@cs.cmu.edu> <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> Message-ID: <0447e89e-8649-e950-59a6-4c4951c99667@cs.cmu.edu> Wonderful picnic. Very delicious cake! Thank you all, Saswati On 10/03/2016 11:30 AM, jieshic at andrew.cmu.edu wrote: > > Hi Everyone, > > Thanks for attending the picnic last Saturday. Attached is the picture > for the cake with winning slogan by Kyle. > > Since we received a lot of questions about the food vendors, here is > the information. > > * BBQ food from Double Wide Grill(2339 E Carson St, Pittsburgh, PA > 15203). > * Taro cake from Pink Box Bakery Cafe (2104 Murray Ave, Pittsburgh, > PA 15217). > > Thank you again for your support. > > Best, > Auton Lab Entertainment Committee > > > > > Hi Everyone, > > Tomorrow's picnic will start at 11:30am. The shelter > > is Vietnam Veterans Pavilion at the Schenley Park. We'll have lunch with > > BBQ food, a large Taro cake, and beers & wines (pls bring your ID if > > you'll drink :P ). > > We have prepared some long games and you are > > also welcome to bring your bikes, games, cameras, etc.. > > BTW, here's > > the weather forecast for your reference. "Showers in the morning, > > then partly cloudy in the afternoon. Thunder possible. High 73F. > Winds SSE > > at 5 to 10 mph. Chance of rain 50%." (source: www.weather.com) > > > > Pls feel free to let me know if you have any question. > > > > Looking forward to seeing you tomorrow! > > > > > > > > Cheers, > > Jessie > >> Dear Autonians, > >> > >> We will be celebrating the 23rd > > anniversary of the Lab this year at a > >> nearby location. > >> > > We have reserved Vietnam Veterans Pavilion at the Schenley Park: > >> > > > >> > > > https://www.google.com/maps/place/Vietnam+Veterans+Pavilion/@40.434036,-79.9453924,17z/data=!4m12!1m6!3m5!1s0x8834f18b186c4403:0xd24a0faef8f7e126!2sSchenley+Park!8m2!3d40.4318311!4d-79.9462078!3m4!1s0x0:0x4bcca3ccfd92c919!8m2!3d40.4338281!4d-79.9439894 > >> > >> We have it booked from 11am till 9pm, we will have lunch > > food, and the > >> Auton Lab Entertainment > >> Committee led by > > our CEO (Chief Entertainment Officer) Jessie is working > >> on the - > > you've guessed it - program > >> of entertainment and activities. > >> > >> Would you please go to this google doc form to rsvp and > > provide > >> information useful for planning: > >> > >> > > > https://docs.google.com/forms/d/e/1FAIpQLSdf75XSHzcACDi9eiiUWjVSFuUzTT4mwTPNBWr1kIH7z0H69Q/viewform?usp=send_form > >> > >> Quick hint for the new members of our team: each year we > > run a contest > >> for the most fitting slogan to put > >> on > > the Auton Lab birthday cake. The bids are judged by a few wise men > >> and the author of the winning slogan > >> feels the pride of > > seeing it on the cake and basks in glory forever. > >> > >> > > Cheers! > >> Artur > >> > > > > > > > -- Saswati Ray Senior Research Programmer Carnegie Mellon University - Auton Lab Newell-Simon Hall Room 3115, Pittsburgh PA 15213 Phone: 412-268-1238 -------------- next part -------------- An HTML attachment was scrubbed... URL: From junieroliva at gmail.com Mon Oct 3 14:59:47 2016 From: junieroliva at gmail.com (Junier Oliva) Date: Mon, 3 Oct 2016 14:59:47 -0400 Subject: Annual Auton Lab Picnic: Saturday October 1st at Schenley Park In-Reply-To: <0447e89e-8649-e950-59a6-4c4951c99667@cs.cmu.edu> References: <07b2034c-153b-d4e3-5e46-8e1b65f8a2ac@cs.cmu.edu> <942587fbd979ccea27143628ce8ee5a3.squirrel@webmail.andrew.cmu.edu> <0447e89e-8649-e950-59a6-4c4951c99667@cs.cmu.edu> Message-ID: Everything was great (especially those ribs!) :D Thanks! Junier On Mon, Oct 3, 2016 at 2:57 PM, Saswati Ray wrote: > Wonderful picnic. > > Very delicious cake! > > Thank you all, > Saswati > > > On 10/03/2016 11:30 AM, jieshic at andrew.cmu.edu wrote: > > Hi Everyone, > > Thanks for attending the picnic last Saturday. Attached is the picture for > the cake with winning slogan by Kyle. > > Since we received a lot of questions about the food vendors, here is the > information. > > - BBQ food from Double Wide Grill (2339 E Carson St, Pittsburgh, PA > 15203). > - Taro cake from Pink Box Bakery Cafe (2104 Murray Ave, Pittsburgh, PA > 15217). > > Thank you again for your support. > > Best, > Auton Lab Entertainment Committee > > > > > Hi Everyone, > > Tomorrow's picnic will start at 11:30am. The shelter > > is Vietnam Veterans Pavilion at the Schenley Park. We'll have lunch with > > BBQ food, a large Taro cake, and beers & wines (pls bring your ID if > > you'll drink :P ). > > We have prepared some long games and you are > > also welcome to bring your bikes, games, cameras, etc.. > > BTW, here's > > the weather forecast for your reference. "Showers in the morning, > > then partly cloudy in the afternoon. Thunder possible. High 73F. Winds > SSE > > at 5 to 10 mph. Chance of rain 50%." (source: www.weather.com) > > > > Pls feel free to let me know if you have any question. > > > > Looking forward to seeing you tomorrow! > > > > > > > > Cheers, > > Jessie > >> Dear Autonians, > >> > >> We will be celebrating the 23rd > > anniversary of the Lab this year at a > >> nearby location. > >> > > We have reserved Vietnam Veterans Pavilion at the Schenley Park: > >> > > > >> > > https://www.google.com/maps/place/Vietnam+Veterans+ > Pavilion/@40.434036,-79.9453924,17z/data=!4m12!1m6! > 3m5!1s0x8834f18b186c4403:0xd24a0faef8f7e126!2sSchenley+ > Park!8m2!3d40.4318311!4d-79.9462078!3m4!1s0x0:0x4bcca3ccfd92c919!8m2!3d40. > 4338281!4d-79.9439894 > >> > >> We have it booked from 11am till 9pm, we will have lunch > > food, and the > >> Auton Lab Entertainment > >> Committee led by > > our CEO (Chief Entertainment Officer) Jessie is working > >> on the - > > you've guessed it - program > >> of entertainment and activities. > >> > >> Would you please go to this google doc form to rsvp and > > provide > >> information useful for planning: > >> > >> > > https://docs.google.com/forms/d/e/1FAIpQLSdf75XSHzcACDi9eiiUWjVS > FuUzTT4mwTPNBWr1kIH7z0H69Q/viewform?usp=send_form > >> > >> Quick hint for the new members of our team: each year we > > run a contest > >> for the most fitting slogan to put > >> on > > the Auton Lab birthday cake. The bids are judged by a few wise men > >> and the author of the winning slogan > >> feels the pride of > > seeing it on the cake and basks in glory forever. > >> > >> > > Cheers! > >> Artur > >> > > > > > > > > > > -- > Saswati Ray > Senior Research Programmer > Carnegie Mellon University - Auton Lab > Newell-Simon Hall Room 3115, Pittsburgh PA 15213 > Phone: 412-268-1238 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From db78349 at gmail.com Mon Oct 3 17:33:33 2016 From: db78349 at gmail.com (dave booni) Date: Mon, 3 Oct 2016 17:33:33 -0400 Subject: Request for Submission of Website Content Message-ID: Hello Fellow Autonians- First, so that you don't worry, this is David Ba. , new guy at the lab, room 3119 NSH - if you have any questions about the authenticity of this email, feel free to pay me a visit. I am partially responsible for assembling the latest iteration of the Auton Lab website. In an effort to fill the website with up-to-date, relevant content, we are asking the members of the lab to provide materials relating to their work which they have the ability to publicly share. For instance, if you submitted any papers which are in the public domain, a link to said paper would be appreciated. If you presented at a conference, we would like to know - if there is a video of your presentation, a pointer (link) to said presentation would be all the better. Any additional content that you would like to share, including appropriate content from your personal website, is also welcome. Thank you all for your cooperation. We look forward to scrap-booking the souvenirs of your excellent output. Feel free to email me with any questions. -Sincerely David B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dbayani at andrew.cmu.edu Wed Oct 5 09:02:25 2016 From: dbayani at andrew.cmu.edu (David Bayani) Date: Wed, 5 Oct 2016 06:02:25 -0700 Subject: Request for Submission of Website Content (Update) Message-ID: Hello Fellow Autonians- First, so that you don't worry, this is David Ba. , new guy at the lab, room 3119 NSH - if you have any questions about the authenticity of this email, feel free to pay me a visit. I am partially responsible for assembling the latest iteration of the Auton Lab website. In an effort to fill the website with up-to-date, relevant content, we are asking the members of the lab to provide materials relating to their work which they have the ability to publicly share. For instance, if you submitted any papers which are in the public domain, a link to said paper would be appreciated. If you presented at a conference, we would like to know - if there is a video of your presentation, a pointer (link) to said presentation would be all the better. Any additional content that you would like to share, including appropriate content from your personal website, is also welcome. We ask that some material be provided within the next two weeks. Naturally, we will accept submissions later than this, but some representative content from each lab member would be appreciated within this time frame. This email was previously sent from my other account, db78349 at gmail.com, which was quite possibly caught in the spam filter of some lab members. As an addendum to the previous email, we ask that you respond indicating that this message was received. We understand that gathering the requested materials may take some time, which is a rationed resource given the busy schedules of our lab members; with that in mind, we need to distinguish between cases of "received message, but gathering materials" and "failed to receive message". Please direct responses to db78349 at gmail.com. While the methods of constructing a website are largely solved research questions, understand that autonlab.org is an important component of The Auton Lab's face. It provides an overview of the laboratory's focus and achievements to the general public, interested students, and potential funding organizations. Lack of current and sufficient content on the website may fail to reflect the otherwise exceptional work produced by the members of this laboratory. Thank you all for your cooperation. We look forward to scrap-booking the souvenirs of your excellent output. Feel free to email me with any questions. -Sincerely David B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at cs.cmu.edu Thu Oct 6 15:00:43 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Thu, 06 Oct 2016 15:00:43 -0400 Subject: Auton Lab Intranet is now functional! Message-ID: <20161006190043.Fpozmq95S%predragp@cs.cmu.edu> Dear Autonians, You should have received or will receive shortly the initial password for the new the Auton Lab Intranet. Make sure you check spam mailboxes before reporting a problem. You can use those passwords to log into new Intranet http://www.autonlab.org/start?do=login§ok=f4aa30200a856d99d7410faad4a007db Feel free to change the password to whatever you fancy upon first login. Our new webpage is DokuWiki based and anybody with the password will be able to edit internal content. Only admins can edit external content or create new users. Login tab is located under tools tab on the far upper right corner. We are working on putting it on the more prominent place. We (Simon, David, and I) are working on resurrecting old content but any help will be appreciated. Predrag From predragp at cs.cmu.edu Fri Oct 7 20:52:00 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Fri, 07 Oct 2016 20:52:00 -0400 Subject: GPU2 status update Message-ID: <20161008005200.0cb1MfHlv%predragp@cs.cmu.edu> Dear Autonians, I got out of the machine room 45 minutes ago where I spent 1h trying to boot GPU3 of the USB drive. For some reason it didn't work. No big deal as the machine has DVD drive which are much more bootable. However it will have to wait until Monday. I am sorry for that. Predrag From awd at cs.cmu.edu Mon Oct 10 08:35:02 2016 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Mon, 10 Oct 2016 08:35:02 -0400 Subject: "The closest thing to getting to go to Hogwarts" In-Reply-To: <6d8ebc95-4f4c-21f7-ce99-ddf109af3d55@cs.cmu.edu> References: <6d8ebc95-4f4c-21f7-ce99-ddf109af3d55@cs.cmu.edu> Message-ID: <2b8a82d9-4d08-4eba-27c9-f33fbbb32ed3@cs.cmu.edu> A must see :) http://www.cbsnews.com/news/60-minutes-artificial-intelligence-real-life-applications/ From predragp at cs.cmu.edu Mon Oct 10 10:27:26 2016 From: predragp at cs.cmu.edu (predragp at cs.cmu.edu) Date: Mon, 10 Oct 2016 14:27:26 +0000 (UTC) Subject: Fwd: RESCHEDULED - October 25th 2016: [Power Outage] SCS Wean Hall machine room In-Reply-To: References: <0aee22ff-fd00-56d4-4563-f912cfe06184@cs.cmu.edu> <4dff3aaa-a607-1682-0f69-b604579731c3@cs.cmu.edu> Message-ID: <7B1F10ED8F1045F6.cc1488a9-74dc-4d22-92e2-c3ffc206a072@mail.outlook.com> Get Outlook for Android ---------- Forwarded message ---------- From: "Edward Walter" Date: Mon, Oct 10, 2016 at 9:27 AM -0400 Subject: RESCHEDULED - October 25th 2016: [Power Outage] SCS Wean Hall machine room To: "Edward J Walter" The partial power outage for the SCS Wean Hall machine room has been rescheduled for October 25th. We expect this work to take less than 24 hours. The outage may run into 48 hours if the electrical contractor encounters problems related to the planned maintenance tasks. Please contact the SCS Help Desk at x8-4231 or send mail to help at cs.cmu.edu with any questions or concerns regarding this maintenance period. Thank you for your attention, SCS Help Desk On 10/03/2016 07:28 AM, Edward Walter wrote: > The planned power outage for the SCS Wean Hall machine room has been > postponed and will NOT take place on October 4th. We are coordinating > with the electrical contractors to get the work re-scheduled. We will > let you know as soon as we have a new date for this power outage. > > Thank you. > > SCS Help Desk > > On 09/12/2016 08:01 AM, Edward Walter wrote: >> SCS Computing Facilities and FMS are planning a partial power outage in >> the SCS Wean Hall machine room. We expect this work to begin on Oct >> 4th, 2016 and to take less than 24 hours. The outage may run into 48 >> hours in the event that the electrical contractor encounters something >> unexpected. >> >> The following servers or computational clusters will be affected by this >> power outage: >> >> Affected clusters: >> ACTR.HPC1.CS.CMU.EDU >> AUTON >> COMA.HPC1.CS.CMU.EDU >> CORTEX.ML.CMU.EDU >> LATEDAYS.ANDREW.CMU.EDU >> PSYCH-O.HPC1.CS.CMU.EDU >> ROCKS.IS.CS.CMU.EDU >> WORKHORSE.LTI.CS.CMU.EDU >> YODA.GRAPHICS.CS.CMU.EDU >> >> >> Affected servers: >> OMEPSLID.COMPBIO >> SLIF.COMPBIO >> PACIFIC.DB >> GPUSERVER.PERCEPTION >> GPUSERVER2.PERCEPTION >> GPUSERVER3.PERCEPTION >> GPUSERVER5.PERCEPTION >> GPUSERVER6.PERCEPTION >> GPUSERVER7.PERCEPTION >> DENVER.LTI >> LOR.LTI >> MIAMI.LTI >> SASKIA.ML >> MARTEN.ML >> JAN.ML >> ARNOUT.ML >> LYSBET.ML >> FLORIS.ML >> >> >> Please contact the SCS Help Desk at x8-4231 or send mail to >> help at cs.cmu.edu with any questions or concerns regarding this >> maintenance period. >> >> Thank you for your attention, >> >> SCS Help Desk -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at cs.cmu.edu Wed Oct 12 18:26:58 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Wed, 12 Oct 2016 18:26:58 -0400 Subject: GPU3 is "configured" Message-ID: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> Dear Autonians, GPU3 is "configured". Namely you can log into it and all packages are installed. However I couldn't get NVIDIA provided CUDA driver to recognize GPU cards. They appear to be properly installed from the hardware point of view and you can list them with lshw -class display root at gpu3$ lshw -class display *-display UNCLAIMED description: VGA compatible controller product: NVIDIA Corporation vendor: NVIDIA Corporation physical id: 0 bus info: pci at 0000:02:00.0 version: a1 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress vga_controller cap_list configuration: latency=0 resources: iomemory:383f0-383ef iomemory:383f0-383ef memory:cf000000-cfffffff memory:383fe0000000-383fefffffff memory:383ff0000000-383ff1ffffff ioport:6000(size=128) memory:d0000000-d007ffff *-display UNCLAIMED description: VGA compatible controller product: NVIDIA Corporation vendor: NVIDIA Corporation physical id: 0 bus info: pci at 0000:03:00.0 version: a1 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress vga_controller cap_list configuration: latency=0 resources: iomemory:383f0-383ef iomemory:383f0-383ef memory:cd000000-cdffffff memory:383fc0000000-383fcfffffff memory:383fd0000000-383fd1ffffff ioport:5000(size=128) memory:ce000000-ce07ffff *-display description: VGA compatible controller product: ASPEED Graphics Family vendor: ASPEED Technology, Inc. physical id: 0 bus info: pci at 0000:06:00.0 version: 30 width: 32 bits clock: 33MHz capabilities: pm msi vga_controller bus_master cap_list rom configuration: driver=ast latency=0 resources: irq:19 memory:cb000000-cbffffff memory:cc000000-cc01ffff ioport:4000(size=128) *-display UNCLAIMED description: VGA compatible controller product: NVIDIA Corporation vendor: NVIDIA Corporation physical id: 0 bus info: pci at 0000:82:00.0 version: a1 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress vga_controller cap_list configuration: latency=0 resources: iomemory:387f0-387ef iomemory:387f0-387ef memory:fa000000-faffffff memory:387fe0000000-387fefffffff memory:387ff0000000-387ff1ffffff ioport:e000(size=128) memory:fb000000-fb07ffff *-display UNCLAIMED description: VGA compatible controller product: NVIDIA Corporation vendor: NVIDIA Corporation physical id: 0 bus info: pci at 0000:83:00.0 version: a1 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress vga_controller cap_list configuration: latency=0 resources: iomemory:387f0-387ef iomemory:387f0-387ef memory:f8000000-f8ffffff memory:387fc0000000-387fcfffffff memory:387fd0000000-387fd1ffffff ioport:d000(size=128) memory:f9000000-f907ffff However what scares the hell out of me is that I don't see NVIDIA driver loaded lsmod|grep nvidia and the device nodes /dev/nvidia are not created. I am guessing I just missed some trivial step during the CUDA installation which is very involving. I am unfortunately too tired to debug this tonight. Predrag From predragp at cs.cmu.edu Wed Oct 12 22:23:32 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Wed, 12 Oct 2016 22:23:32 -0400 Subject: GPU3 is "configured" In-Reply-To: References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> Message-ID: <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> Arne Suppe wrote: > Hi Predrag, > Don???t know if this applies to you, but I just build a machines with a GTX1080 which has the same PASCAL architecture as the Titan. After installing CUDA 8, I still found I needed to install the latest driver off of the NVIDIA web site to get the card recognized. Right now, I am running 367.44. > > Arne Arne, Thank you so much for this e-mail. Yes it is damn PASCAL arhitecture I see lots of people complaining about it on the forums. I downloaded and installed driver from http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run&lang=us&type=GeForce That seems to made a real difference. Check out this beautiful outputs root at gpu3$ ls nvidia* nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm nvidia-uvm-tools root at gpu3$ lspci | grep -i nvidia 02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev a1) 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) 03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev a1) 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) 82:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev a1) 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) 83:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev a1) 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) root at gpu3$ ls /proc/driver nvidia nvidia-uvm nvram rtc root at gpu3$ lsmod |grep nvidia nvidia_uvm 738901 0 nvidia_drm 43405 0 nvidia_modeset 764432 1 nvidia_drm nvidia 11492947 2 nvidia_modeset,nvidia_uvm drm_kms_helper 125056 2 ast,nvidia_drm drm 349210 5 ast,ttm,drm_kms_helper,nvidia_drm i2c_core 40582 7 ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia root at gpu3$ nvidia-smi Wed Oct 12 22:03:27 2016 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.57 Driver Version: 367.57 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | N/A | | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | N/A | | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | N/A | | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | N/A | | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ /usr/local/cuda/extras/demo_suite/deviceQuery Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU1) : Yes > Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU2) : No > Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU3) : No > Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU0) : Yes > Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU2) : No > Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU3) : No > Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU0) : No > Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU1) : No > Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU3) : Yes > Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU0) : No > Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU1) : No > Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU2) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X (Pascal), Device1 = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = TITAN X (Pascal) Result = PASS Now not everything is rosy root at gpu3$ cd ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody root at gpu3$ make >>> WARNING - libGL.so not found, refer to CUDA Getting Started Guide for how to find and install them. <<< >>> WARNING - libGLU.so not found, refer to CUDA Getting Started Guide for how to find and install them. <<< >>> WARNING - libX11.so not found, refer to CUDA Getting Started Guide for how to find and install them. <<< even though those are installed. For example root at gpu3$ yum whatprovides */libX11.so libX11-devel-1.6.3-2.el7.i686 : Development files for libX11 Repo : core Matched from: Filename : /usr/lib/libX11.so also mesa-libGLU-devel mesa-libGL-devel xorg-x11-drv-nvidia-devel but root at gpu3$ yum -y install mesa-libGLU-devel mesa-libGL-devel xorg-x11-drv-nvidia-devel Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already installed and latest version Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 already installed and latest version Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 already installed and latest version Also from MATLAB gpuDevice hangs. So we still don't have a working installation. Any help would be appreciated. Best, Predrag P.S. Once we have a working installation we can think of installing Caffe and TensorFlow. For now we have to see why the things are not working. > > > On Oct 12, 2016, at 6:26 PM, Predrag Punosevac wrote: > > > > Dear Autonians, > > > > GPU3 is "configured". Namely you can log into it and all packages are > > installed. However I couldn't get NVIDIA provided CUDA driver to > > recognize GPU cards. They appear to be properly installed from the > > hardware point of view and you can list them with > > > > lshw -class display > > > > root at gpu3$ lshw -class display > > *-display UNCLAIMED > > description: VGA compatible controller > > product: NVIDIA Corporation > > vendor: NVIDIA Corporation > > physical id: 0 > > bus info: pci at 0000:02:00.0 > > version: a1 > > width: 64 bits > > clock: 33MHz > > capabilities: pm msi pciexpress vga_controller cap_list > > configuration: latency=0 > > resources: iomemory:383f0-383ef iomemory:383f0-383ef > > memory:cf000000-cfffffff memory:383fe0000000-383fefffffff > > memory:383ff0000000-383ff1ffffff ioport:6000(size=128) > > memory:d0000000-d007ffff > > *-display UNCLAIMED > > description: VGA compatible controller > > product: NVIDIA Corporation > > vendor: NVIDIA Corporation > > physical id: 0 > > bus info: pci at 0000:03:00.0 > > version: a1 > > width: 64 bits > > clock: 33MHz > > capabilities: pm msi pciexpress vga_controller cap_list > > configuration: latency=0 > > resources: iomemory:383f0-383ef iomemory:383f0-383ef > > memory:cd000000-cdffffff memory:383fc0000000-383fcfffffff > > memory:383fd0000000-383fd1ffffff ioport:5000(size=128) > > memory:ce000000-ce07ffff > > *-display > > description: VGA compatible controller > > product: ASPEED Graphics Family > > vendor: ASPEED Technology, Inc. > > physical id: 0 > > bus info: pci at 0000:06:00.0 > > version: 30 > > width: 32 bits > > clock: 33MHz > > capabilities: pm msi vga_controller bus_master cap_list rom > > configuration: driver=ast latency=0 > > resources: irq:19 memory:cb000000-cbffffff > > memory:cc000000-cc01ffff ioport:4000(size=128) > > *-display UNCLAIMED > > description: VGA compatible controller > > product: NVIDIA Corporation > > vendor: NVIDIA Corporation > > physical id: 0 > > bus info: pci at 0000:82:00.0 > > version: a1 > > width: 64 bits > > clock: 33MHz > > capabilities: pm msi pciexpress vga_controller cap_list > > configuration: latency=0 > > resources: iomemory:387f0-387ef iomemory:387f0-387ef > > memory:fa000000-faffffff memory:387fe0000000-387fefffffff > > memory:387ff0000000-387ff1ffffff ioport:e000(size=128) > > memory:fb000000-fb07ffff > > *-display UNCLAIMED > > description: VGA compatible controller > > product: NVIDIA Corporation > > vendor: NVIDIA Corporation > > physical id: 0 > > bus info: pci at 0000:83:00.0 > > version: a1 > > width: 64 bits > > clock: 33MHz > > capabilities: pm msi pciexpress vga_controller cap_list > > configuration: latency=0 > > resources: iomemory:387f0-387ef iomemory:387f0-387ef > > memory:f8000000-f8ffffff memory:387fc0000000-387fcfffffff > > memory:387fd0000000-387fd1ffffff ioport:d000(size=128) > > memory:f9000000-f907ffff > > > > > > However what scares the hell out of me is that I don't see NVIDIA driver > > loaded > > > > lsmod|grep nvidia > > > > and the device nodes /dev/nvidia are not created. I am guessing I just > > missed some trivial step during the CUDA installation which is very > > involving. I am unfortunately too tired to debug this tonight. > > > > Predrag From suppe at andrew.cmu.edu Wed Oct 12 23:26:48 2016 From: suppe at andrew.cmu.edu (Arne Suppe) Date: Wed, 12 Oct 2016 23:26:48 -0400 Subject: GPU3 is "configured" In-Reply-To: <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> Message-ID: <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> Hmm - I don?t use matlab for deep learning, but gpuDevice also hangs on my computer with R2016a. I was able compile the matrixMul example in the CUDA samples and run it on gpu3, so I think the build environment is probably all set. As for the openGL, I think its possibly a problem with their build script findgl.mk which is not familiar with Springdale OS. The demo_suite directory has a precompiled nbody binary you may try, but I suspect most users will not need graphics. Arne > On Oct 12, 2016, at 10:23 PM, Predrag Punosevac wrote: > > Arne Suppe wrote: > >> Hi Predrag, >> Don???t know if this applies to you, but I just build a machines with a GTX1080 which has the same PASCAL architecture as the Titan. After installing CUDA 8, I still found I needed to install the latest driver off of the NVIDIA web site to get the card recognized. Right now, I am running 367.44. >> >> Arne > > Arne, > > Thank you so much for this e-mail. Yes it is damn PASCAL arhitecture I > see lots of people complaining about it on the forums. I downloaded and > installed driver from > > http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run&lang=us&type=GeForce > > That seems to made a real difference. Check out this beautiful outputs > > root at gpu3$ ls nvidia* > nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm > nvidia-uvm-tools > > root at gpu3$ lspci | grep -i nvidia > 02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev > a1) > 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) > 03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev > a1) > 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) > 82:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev > a1) > 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) > 83:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev > a1) > 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) > > > root at gpu3$ ls /proc/driver > nvidia nvidia-uvm nvram rtc > > root at gpu3$ lsmod |grep nvidia > nvidia_uvm 738901 0 > nvidia_drm 43405 0 > nvidia_modeset 764432 1 nvidia_drm > nvidia 11492947 2 nvidia_modeset,nvidia_uvm > drm_kms_helper 125056 2 ast,nvidia_drm > drm 349210 5 ast,ttm,drm_kms_helper,nvidia_drm > i2c_core 40582 7 > ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia > > root at gpu3$ nvidia-smi > Wed Oct 12 22:03:27 2016 > +-----------------------------------------------------------------------------+ > | NVIDIA-SMI 367.57 Driver Version: 367.57 > | > |-------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M| Bus-Id Disp.A | Volatile > Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util > Compute M. | > |===============================+======================+======================| > | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | > N/A | > | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | 0% > Default | > +-------------------------------+----------------------+----------------------+ > | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | > N/A | > | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | 0% > Default | > +-------------------------------+----------------------+----------------------+ > | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | > N/A | > | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | 0% > Default | > +-------------------------------+----------------------+----------------------+ > | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | > N/A | > | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | 0% > Default | > +-------------------------------+----------------------+----------------------+ > > > +-----------------------------------------------------------------------------+ > | Processes: GPU > Memory | > | GPU PID Type Process name Usage > | > |=============================================================================| > | No running processes found > | > +-----------------------------------------------------------------------------+ > > > > /usr/local/cuda/extras/demo_suite/deviceQuery > > Alignment requirement for Surfaces: Yes > Device has ECC support: Disabled > Device supports Unified Addressing (UVA): Yes > Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0 > Compute Mode: > < Default (multiple host threads can use ::cudaSetDevice() with > device simultaneously) > >> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU1) : > Yes >> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU2) : > No >> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU3) : > No >> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU0) : > Yes >> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU2) : > No >> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU3) : > No >> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU0) : > No >> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU1) : > No >> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU3) : > Yes >> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU0) : > No >> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU1) : > No >> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU2) : > Yes > > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA > Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X (Pascal), Device1 > = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = TITAN X > (Pascal) > Result = PASS > > > > Now not everything is rosy > > root at gpu3$ cd ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody > root at gpu3$ make >>>> WARNING - libGL.so not found, refer to CUDA Getting Started Guide > for how to find and install them. <<< >>>> WARNING - libGLU.so not found, refer to CUDA Getting Started Guide > for how to find and install them. <<< >>>> WARNING - libX11.so not found, refer to CUDA Getting Started Guide > for how to find and install them. <<< > > > even though those are installed. For example > > root at gpu3$ yum whatprovides */libX11.so > libX11-devel-1.6.3-2.el7.i686 : Development files for libX11 > Repo : core > Matched from: > Filename : /usr/lib/libX11.so > > also > > mesa-libGLU-devel > mesa-libGL-devel > xorg-x11-drv-nvidia-devel > > but > > root at gpu3$ yum -y install mesa-libGLU-devel mesa-libGL-devel > xorg-x11-drv-nvidia-devel > Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already installed and > latest version > Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 already installed > and latest version > Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 already > installed and latest version > > Also from MATLAB gpuDevice hangs. > > So we still don't have a working installation. Any help would be > appreciated. > > Best, > Predrag > > P.S. Once we have a working installation we can think of installing > Caffe and TensorFlow. For now we have to see why the things are not > working. > > > > > > >> >>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac wrote: >>> >>> Dear Autonians, >>> >>> GPU3 is "configured". Namely you can log into it and all packages are >>> installed. However I couldn't get NVIDIA provided CUDA driver to >>> recognize GPU cards. They appear to be properly installed from the >>> hardware point of view and you can list them with >>> >>> lshw -class display >>> >>> root at gpu3$ lshw -class display >>> *-display UNCLAIMED >>> description: VGA compatible controller >>> product: NVIDIA Corporation >>> vendor: NVIDIA Corporation >>> physical id: 0 >>> bus info: pci at 0000:02:00.0 >>> version: a1 >>> width: 64 bits >>> clock: 33MHz >>> capabilities: pm msi pciexpress vga_controller cap_list >>> configuration: latency=0 >>> resources: iomemory:383f0-383ef iomemory:383f0-383ef >>> memory:cf000000-cfffffff memory:383fe0000000-383fefffffff >>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128) >>> memory:d0000000-d007ffff >>> *-display UNCLAIMED >>> description: VGA compatible controller >>> product: NVIDIA Corporation >>> vendor: NVIDIA Corporation >>> physical id: 0 >>> bus info: pci at 0000:03:00.0 >>> version: a1 >>> width: 64 bits >>> clock: 33MHz >>> capabilities: pm msi pciexpress vga_controller cap_list >>> configuration: latency=0 >>> resources: iomemory:383f0-383ef iomemory:383f0-383ef >>> memory:cd000000-cdffffff memory:383fc0000000-383fcfffffff >>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128) >>> memory:ce000000-ce07ffff >>> *-display >>> description: VGA compatible controller >>> product: ASPEED Graphics Family >>> vendor: ASPEED Technology, Inc. >>> physical id: 0 >>> bus info: pci at 0000:06:00.0 >>> version: 30 >>> width: 32 bits >>> clock: 33MHz >>> capabilities: pm msi vga_controller bus_master cap_list rom >>> configuration: driver=ast latency=0 >>> resources: irq:19 memory:cb000000-cbffffff >>> memory:cc000000-cc01ffff ioport:4000(size=128) >>> *-display UNCLAIMED >>> description: VGA compatible controller >>> product: NVIDIA Corporation >>> vendor: NVIDIA Corporation >>> physical id: 0 >>> bus info: pci at 0000:82:00.0 >>> version: a1 >>> width: 64 bits >>> clock: 33MHz >>> capabilities: pm msi pciexpress vga_controller cap_list >>> configuration: latency=0 >>> resources: iomemory:387f0-387ef iomemory:387f0-387ef >>> memory:fa000000-faffffff memory:387fe0000000-387fefffffff >>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128) >>> memory:fb000000-fb07ffff >>> *-display UNCLAIMED >>> description: VGA compatible controller >>> product: NVIDIA Corporation >>> vendor: NVIDIA Corporation >>> physical id: 0 >>> bus info: pci at 0000:83:00.0 >>> version: a1 >>> width: 64 bits >>> clock: 33MHz >>> capabilities: pm msi pciexpress vga_controller cap_list >>> configuration: latency=0 >>> resources: iomemory:387f0-387ef iomemory:387f0-387ef >>> memory:f8000000-f8ffffff memory:387fc0000000-387fcfffffff >>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128) >>> memory:f9000000-f907ffff >>> >>> >>> However what scares the hell out of me is that I don't see NVIDIA driver >>> loaded >>> >>> lsmod|grep nvidia >>> >>> and the device nodes /dev/nvidia are not created. I am guessing I just >>> missed some trivial step during the CUDA installation which is very >>> involving. I am unfortunately too tired to debug this tonight. >>> >>> Predrag > From predragp at imap.srv.cs.cmu.edu Thu Oct 13 10:44:16 2016 From: predragp at imap.srv.cs.cmu.edu (Predrag Punosevac) Date: Thu, 13 Oct 2016 10:44:16 -0400 Subject: GPU3 is "configured" In-Reply-To: <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> Message-ID: On 2016-10-12 23:26, Arne Suppe wrote: > Hmm - I don?t use matlab for deep learning, but gpuDevice also hangs > on my computer with R2016a. > We would have to escalate this with MathWorks. I have seen work around Internet but it looks like a bug in one of Mathworks provided MEX files. > I was able compile the matrixMul example in the CUDA samples and run > it on gpu3, so I think the build environment is probably all set. > > As for the openGL, I think its possibly a problem with their build > script findgl.mk which is not familiar with Springdale OS. The > demo_suite directory has a precompiled nbody binary you may try, but I > suspect most users will not need graphics. > That should not be too hard to fix. Some header files have to be manually edited. The funny part until 7.2 Princeton people didn't bother to remove RHEL branding which actually made things easier for us. Doug is trying right now to compile the latest Caffe, TensorFlow, and protobuf-3. We will try to create an RPM for that so that we don't have to go through this again. I also asked Princeton and Rutgers guys if they have WIP RPMs to share. Predrag > Arne > > > > >> On Oct 12, 2016, at 10:23 PM, Predrag Punosevac >> wrote: >> >> Arne Suppe wrote: >> >>> Hi Predrag, >>> Don???t know if this applies to you, but I just build a machines with >>> a GTX1080 which has the same PASCAL architecture as the Titan. After >>> installing CUDA 8, I still found I needed to install the latest >>> driver off of the NVIDIA web site to get the card recognized. Right >>> now, I am running 367.44. >>> >>> Arne >> >> Arne, >> >> Thank you so much for this e-mail. Yes it is damn PASCAL arhitecture I >> see lots of people complaining about it on the forums. I downloaded >> and >> installed driver from >> >> http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run&lang=us&type=GeForce >> >> That seems to made a real difference. Check out this beautiful outputs >> >> root at gpu3$ ls nvidia* >> nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm >> nvidia-uvm-tools >> >> root at gpu3$ lspci | grep -i nvidia >> 02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev >> a1) >> 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) >> 03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev >> a1) >> 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) >> 82:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev >> a1) >> 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) >> 83:00.0 VGA compatible controller: NVIDIA Corporation Device 1b00 (rev >> a1) >> 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1) >> >> >> root at gpu3$ ls /proc/driver >> nvidia nvidia-uvm nvram rtc >> >> root at gpu3$ lsmod |grep nvidia >> nvidia_uvm 738901 0 >> nvidia_drm 43405 0 >> nvidia_modeset 764432 1 nvidia_drm >> nvidia 11492947 2 nvidia_modeset,nvidia_uvm >> drm_kms_helper 125056 2 ast,nvidia_drm >> drm 349210 5 ast,ttm,drm_kms_helper,nvidia_drm >> i2c_core 40582 7 >> ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia >> >> root at gpu3$ nvidia-smi >> Wed Oct 12 22:03:27 2016 >> +-----------------------------------------------------------------------------+ >> | NVIDIA-SMI 367.57 Driver Version: 367.57 >> | >> |-------------------------------+----------------------+----------------------+ >> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >> Uncorr. ECC | >> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >> Compute M. | >> |===============================+======================+======================| >> | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | >> N/A | >> | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | 0% >> Default | >> +-------------------------------+----------------------+----------------------+ >> | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | >> N/A | >> | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | 0% >> Default | >> +-------------------------------+----------------------+----------------------+ >> | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | >> N/A | >> | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | 0% >> Default | >> +-------------------------------+----------------------+----------------------+ >> | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | >> N/A | >> | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | 0% >> Default | >> +-------------------------------+----------------------+----------------------+ >> >> >> +-----------------------------------------------------------------------------+ >> | Processes: GPU >> Memory | >> | GPU PID Type Process name >> Usage >> | >> |=============================================================================| >> | No running processes found >> | >> +-----------------------------------------------------------------------------+ >> >> >> >> /usr/local/cuda/extras/demo_suite/deviceQuery >> >> Alignment requirement for Surfaces: Yes >> Device has ECC support: Disabled >> Device supports Unified Addressing (UVA): Yes >> Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0 >> Compute Mode: >> < Default (multiple host threads can use ::cudaSetDevice() with >> device simultaneously) > >>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU1) : >> Yes >>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU2) : >> No >>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X (Pascal) (GPU3) : >> No >>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU0) : >> Yes >>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU2) : >> No >>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X (Pascal) (GPU3) : >> No >>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU0) : >> No >>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU1) : >> No >>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X (Pascal) (GPU3) : >> Yes >>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU0) : >> No >>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU1) : >> No >>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X (Pascal) (GPU2) : >> Yes >> >> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA >> Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X (Pascal), >> Device1 >> = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = TITAN X >> (Pascal) >> Result = PASS >> >> >> >> Now not everything is rosy >> >> root at gpu3$ cd ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody >> root at gpu3$ make >>>>> WARNING - libGL.so not found, refer to CUDA Getting Started Guide >> for how to find and install them. <<< >>>>> WARNING - libGLU.so not found, refer to CUDA Getting Started Guide >> for how to find and install them. <<< >>>>> WARNING - libX11.so not found, refer to CUDA Getting Started Guide >> for how to find and install them. <<< >> >> >> even though those are installed. For example >> >> root at gpu3$ yum whatprovides */libX11.so >> libX11-devel-1.6.3-2.el7.i686 : Development files for libX11 >> Repo : core >> Matched from: >> Filename : /usr/lib/libX11.so >> >> also >> >> mesa-libGLU-devel >> mesa-libGL-devel >> xorg-x11-drv-nvidia-devel >> >> but >> >> root at gpu3$ yum -y install mesa-libGLU-devel mesa-libGL-devel >> xorg-x11-drv-nvidia-devel >> Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already installed and >> latest version >> Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 already >> installed >> and latest version >> Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 already >> installed and latest version >> >> Also from MATLAB gpuDevice hangs. >> >> So we still don't have a working installation. Any help would be >> appreciated. >> >> Best, >> Predrag >> >> P.S. Once we have a working installation we can think of installing >> Caffe and TensorFlow. For now we have to see why the things are not >> working. >> >> >> >> >> >> >>> >>>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac >>>> wrote: >>>> >>>> Dear Autonians, >>>> >>>> GPU3 is "configured". Namely you can log into it and all packages >>>> are >>>> installed. However I couldn't get NVIDIA provided CUDA driver to >>>> recognize GPU cards. They appear to be properly installed from the >>>> hardware point of view and you can list them with >>>> >>>> lshw -class display >>>> >>>> root at gpu3$ lshw -class display >>>> *-display UNCLAIMED >>>> description: VGA compatible controller >>>> product: NVIDIA Corporation >>>> vendor: NVIDIA Corporation >>>> physical id: 0 >>>> bus info: pci at 0000:02:00.0 >>>> version: a1 >>>> width: 64 bits >>>> clock: 33MHz >>>> capabilities: pm msi pciexpress vga_controller cap_list >>>> configuration: latency=0 >>>> resources: iomemory:383f0-383ef iomemory:383f0-383ef >>>> memory:cf000000-cfffffff memory:383fe0000000-383fefffffff >>>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128) >>>> memory:d0000000-d007ffff >>>> *-display UNCLAIMED >>>> description: VGA compatible controller >>>> product: NVIDIA Corporation >>>> vendor: NVIDIA Corporation >>>> physical id: 0 >>>> bus info: pci at 0000:03:00.0 >>>> version: a1 >>>> width: 64 bits >>>> clock: 33MHz >>>> capabilities: pm msi pciexpress vga_controller cap_list >>>> configuration: latency=0 >>>> resources: iomemory:383f0-383ef iomemory:383f0-383ef >>>> memory:cd000000-cdffffff memory:383fc0000000-383fcfffffff >>>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128) >>>> memory:ce000000-ce07ffff >>>> *-display >>>> description: VGA compatible controller >>>> product: ASPEED Graphics Family >>>> vendor: ASPEED Technology, Inc. >>>> physical id: 0 >>>> bus info: pci at 0000:06:00.0 >>>> version: 30 >>>> width: 32 bits >>>> clock: 33MHz >>>> capabilities: pm msi vga_controller bus_master cap_list rom >>>> configuration: driver=ast latency=0 >>>> resources: irq:19 memory:cb000000-cbffffff >>>> memory:cc000000-cc01ffff ioport:4000(size=128) >>>> *-display UNCLAIMED >>>> description: VGA compatible controller >>>> product: NVIDIA Corporation >>>> vendor: NVIDIA Corporation >>>> physical id: 0 >>>> bus info: pci at 0000:82:00.0 >>>> version: a1 >>>> width: 64 bits >>>> clock: 33MHz >>>> capabilities: pm msi pciexpress vga_controller cap_list >>>> configuration: latency=0 >>>> resources: iomemory:387f0-387ef iomemory:387f0-387ef >>>> memory:fa000000-faffffff memory:387fe0000000-387fefffffff >>>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128) >>>> memory:fb000000-fb07ffff >>>> *-display UNCLAIMED >>>> description: VGA compatible controller >>>> product: NVIDIA Corporation >>>> vendor: NVIDIA Corporation >>>> physical id: 0 >>>> bus info: pci at 0000:83:00.0 >>>> version: a1 >>>> width: 64 bits >>>> clock: 33MHz >>>> capabilities: pm msi pciexpress vga_controller cap_list >>>> configuration: latency=0 >>>> resources: iomemory:387f0-387ef iomemory:387f0-387ef >>>> memory:f8000000-f8ffffff memory:387fc0000000-387fcfffffff >>>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128) >>>> memory:f9000000-f907ffff >>>> >>>> >>>> However what scares the hell out of me is that I don't see NVIDIA >>>> driver >>>> loaded >>>> >>>> lsmod|grep nvidia >>>> >>>> and the device nodes /dev/nvidia are not created. I am guessing I >>>> just >>>> missed some trivial step during the CUDA installation which is very >>>> involving. I am unfortunately too tired to debug this tonight. >>>> >>>> Predrag >> From predragp at imap.srv.cs.cmu.edu Thu Oct 13 13:39:19 2016 From: predragp at imap.srv.cs.cmu.edu (Predrag Punosevac) Date: Thu, 13 Oct 2016 13:39:19 -0400 Subject: GPU3 is "configured" In-Reply-To: References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> <20161013153826.f4agzWkMb%predragp@cs.cmu.edu> <3ad25168d2dc7502872b0cde94950655@imap.srv.cs.cmu.edu> Message-ID: <576ceb12fa4fffb3b72b68a742a9b0b1@imap.srv.cs.cmu.edu> Dear Autonians, In the case anybody is interested what happens behind the scenes, Doug got Caffe and TensorFlow to work on GPU3. Please see message below. I also got the very useful feed back from Princeton and Rutgers people. Please check out if you care (you will have to log into Gmail to see the exchange). https://groups.google.com/forum/#!forum/springdale-users I need to think how we move forward with this before start pulling triggers. If somebody is itchy and can't wait please build Caffe and TensorFlow in your scratch directory following below howto. Predrag On 2016-10-13 13:24, Dougal Sutherland wrote: > A note about cudnn: > > There are a bunch of versions of cudnn. They're not > backwards-compatible, and different versions of > caffe/tensorflow/whatever want different ones. > > I currently am using the setup in ~dsutherl/cudnn_files: > > * I have a bunch of versions of the installer there. > * The use-cudnn.sh script, intended to be used like "source > use-cudnn.sh 5.1", will untar the appropriate one into a scratch > directory (if it hasn't already been done) and set > CPATH/LIBRARY_PATH/LD_LIBRARY_PATH appropriately. LD_LIBRARY_PATH is > needed for caffe binaries, since they don't link to the absolute path; > the first two (not sure about the the third) are needed for theano. > Dunno about tensorflow yet. > > So, here's the Caffe setup: > > cd /home/scratch/$USER > git clone https://github.com/BVLC/caffe > cd caffe > cp Makefile.config.example Makefile.config > > # tell it to use openblas; using atlas needs some changes to the > Makefile > sed -i 's/BLAS := atlas/BLAS := open/' Makefile.config > > # configure to use cudnn (optional) > source ~dsutherl/cudnn-files/use-cudnn.sh 5.1 > sed -i 's/# USE_CUDNN := 1/USE_CUDNN := 1/' Makefile.config > perl -i -pe 's|$| '$CUDNN_DIR'/include| if /INCLUDE_DIRS :=/' > Makefile.config > perl -i -pe 's|$| '$CUDNN_DIR'/lib64| if /LIBRARY_DIRS :=/' > Makefile.config > > # build the library > make -j23 > > # to do tests (takes ~10 minutes): > make -j23 test > make runtest > > # Now, to run caffe binaries you'll need to remember to source > use-cudnn if you used cudnn before. > > # To build the python libary: > make py > > # Requirements for the python library: > # Some of the system packages are too old; this installs them in your > scratch directory. > # You'll have to set PYTHONUSERBASE again before running any python > processes that use these libs. > export PYTHONUSERBASE=$HOME/scratch/.local; > export PATH=$PYTHONUSERBASE/bin:"$PATH" # <- optional > pip install --user -r python/requirements.txt > > # Caffe is dumb and doesn't package its python library properly. The > easiest way to use it is: > export PYTHONPATH=/home/scratch/$USER/caffe/python:$PYTHONPATH > python -c 'import caffe' > > On Thu, Oct 13, 2016 at 6:01 PM Dougal Sutherland > wrote: > >> Java fix seemed to work. Now tensorflow wants python-wheel and >> swig. >> >> On Thu, Oct 13, 2016 at 5:08 PM Predrag Punosevac >> wrote: >> >>> On 2016-10-13 11:46, Dougal Sutherland wrote: >>> >>>> Having some trouble with tensorflow, because: >>> >>>> >>> >>>> * it require's Google's bazel build system >>> >>>> >>> >>>> * The bazel installer says >>> >>>> Java version is 1.7.0_111 while at least 1.8 is needed. >>> >>>> * >>> >>>> >>> >>>> * $ java -version >>> >>>> openjdk version "1.8.0_102" >>> >>>> OpenJDK Runtime Environment (build 1.8.0_102-b14) >>> >>>> OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode) >>> >>>> $ javac -version >>> >>>> javac 1.7.0_111 >>> >>>> >>> >>> I just did yum -y install java-1.8.0* which installs openjdk 1.8. >>> Please >>> >>> change your java. Let me know if >>> >>> you want me to install Oracle JDK 1.8 >>> >>> Predrag >>> >>>> On Thu, Oct 13, 2016 at 4:38 PM Predrag Punosevac >>> >>>> wrote: >>> >>>> >>> >>>>> Dougal Sutherland wrote: >>> >>>>> >>> >>>>>> Also, this seemed to work for me so far for protobuf: >>> >>>>>> >>> >>>>>> cd /home/scratch/$USER >>> >>>>>> VER=3.1.0 >>> >>>>>> wget >>> >>>>>> >>> >>>>> >>> >>>> >>> >> > https://github.com/google/protobuf/releases/download/v$VER/protobuf-cpp-$VER.tar.gz >>> >>>>>> tar xf protobuf-cpp-$VER.tar.gz >>> >>>>>> cd protobuf-cpp-$VER >>> >>>>>> ./configure --prefix=/home/scratch/$USER >>> >>>>>> make -j12 >>> >>>>>> make -j12 check >>> >>>>>> make install >>> >>>>> >>> >>>>> That is great help! >>> >>>>> >>> >>>>>> >>> >>>>>> You could change --prefix=/usr if making an RPM. >>> >>>>>> >>> >>>>>> On Thu, Oct 13, 2016 at 4:26 PM Dougal Sutherland >>> >>>>> wrote: >>> >>>>>> >>> >>>>>>> Some more packages for caffe: >>> >>>>>>> >>> >>>>>>> leveldb-devel snappy-devel opencv-devel boost-devel >>> >>>>>>> hdf5-devel gflags-devel glog-devel lmdb-devel >>> >>>>>>> >>> >>>>>>> (Some of those might be installed already, but at least >>> gflags >>> >>>>> is >>> >>>>>>> definitely missing.) >>> >>>>>>> >>> >>>>>>> On Thu, Oct 13, 2016 at 3:44 PM Predrag Punosevac < >>> >>>>>>> predragp at imap.srv.cs.cmu.edu> wrote: >>> >>>>>>> >>> >>>>>>> On 2016-10-12 23:26, Arne Suppe wrote: >>> >>>>>>>> Hmm - I don???t use matlab for deep learning, but gpuDevice >>> >>>>> also hangs >>> >>>>>>>> on my computer with R2016a. >>> >>>>>>>> >>> >>>>>>> >>> >>>>>>> We would have to escalate this with MathWorks. I have seen >>> work >>> >>>>> around >>> >>>>>>> Internet but it looks like a bug in one of Mathworks provided >>> >>>>> MEX files. >>> >>>>>>> >>> >>>>>>>> I was able compile the matrixMul example in the CUDA >>> samples >>> >>>>> and run >>> >>>>>>>> it on gpu3, so I think the build environment is probably >>> all >>> >>>>> set. >>> >>>>>>>> >>> >>>>>>>> As for the openGL, I think its possibly a problem with >>> their >>> >>>>> build >>> >>>>>>>> script findgl.mk [1] [1] which is not familiar with >>> Springdale OS. >>> >>>>> The >>> >>>>>>>> demo_suite directory has a precompiled nbody binary you may >>> >>>>> try, but I >>> >>>>>>>> suspect most users will not need graphics. >>> >>>>>>>> >>> >>>>>>> >>> >>>>>>> That should not be too hard to fix. Some header files have to >>> be >>> >>>>>>> manually edited. The funny part until 7.2 Princeton people >>> >>>>> didn't bother >>> >>>>>>> to remove RHEL branding which actually made things easier for >>> >>>>> us. >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> Doug is trying right now to compile the latest Caffe, >>> >>>>> TensorFlow, and >>> >>>>>>> protobuf-3. We will try to create an RPM for that so that we >>> >>>>> don't have >>> >>>>>>> to go through this again. I also asked Princeton and Rutgers >>> >>>>> guys if >>> >>>>>>> they >>> >>>>>>> have WIP RPMs to share. >>> >>>>>>> >>> >>>>>>> Predrag >>> >>>>>>> >>> >>>>>>>> Arne >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>>> On Oct 12, 2016, at 10:23 PM, Predrag Punosevac >>> >>>>> >>> >>>>>>>>> wrote: >>> >>>>>>>>> >>> >>>>>>>>> Arne Suppe wrote: >>> >>>>>>>>> >>> >>>>>>>>>> Hi Predrag, >>> >>>>>>>>>> Don???t know if this applies to you, but I just build a >>> >>>>> machines with >>> >>>>>>>>>> a GTX1080 which has the same PASCAL architecture as the >>> >>>>> Titan. After >>> >>>>>>>>>> installing CUDA 8, I still found I needed to install the >>> >>>>> latest >>> >>>>>>>>>> driver off of the NVIDIA web site to get the card >>> >>>>> recognized. Right >>> >>>>>>>>>> now, I am running 367.44. >>> >>>>>>>>>> >>> >>>>>>>>>> Arne >>> >>>>>>>>> >>> >>>>>>>>> Arne, >>> >>>>>>>>> >>> >>>>>>>>> Thank you so much for this e-mail. Yes it is damn PASCAL >>> >>>>> arhitecture I >>> >>>>>>>>> see lots of people complaining about it on the forums. I >>> >>>>> downloaded >>> >>>>>>>>> and >>> >>>>>>>>> installed driver from >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run&lang=us&type=GeForce >>> >>>>>>>>> >>> >>>>>>>>> That seems to made a real difference. Check out this >>> >>>>> beautiful outputs >>> >>>>>>>>> >>> >>>>>>>>> root at gpu3$ ls nvidia* >>> >>>>>>>>> nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm >>> >>>>>>>>> nvidia-uvm-tools >>> >>>>>>>>> >>> >>>>>>>>> root at gpu3$ lspci | grep -i nvidia >>> >>>>>>>>> 02:00.0 VGA compatible controller: NVIDIA Corporation >>> Device >>> >>>>> 1b00 (rev >>> >>>>>>>>> a1) >>> >>>>>>>>> 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >>> a1) >>> >>>>>>>>> 03:00.0 VGA compatible controller: NVIDIA Corporation >>> Device >>> >>>>> 1b00 (rev >>> >>>>>>>>> a1) >>> >>>>>>>>> 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >>> a1) >>> >>>>>>>>> 82:00.0 VGA compatible controller: NVIDIA Corporation >>> Device >>> >>>>> 1b00 (rev >>> >>>>>>>>> a1) >>> >>>>>>>>> 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >>> a1) >>> >>>>>>>>> 83:00.0 VGA compatible controller: NVIDIA Corporation >>> Device >>> >>>>> 1b00 (rev >>> >>>>>>>>> a1) >>> >>>>>>>>> 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >>> a1) >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> root at gpu3$ ls /proc/driver >>> >>>>>>>>> nvidia nvidia-uvm nvram rtc >>> >>>>>>>>> >>> >>>>>>>>> root at gpu3$ lsmod |grep nvidia >>> >>>>>>>>> nvidia_uvm 738901 0 >>> >>>>>>>>> nvidia_drm 43405 0 >>> >>>>>>>>> nvidia_modeset 764432 1 nvidia_drm >>> >>>>>>>>> nvidia 11492947 2 nvidia_modeset,nvidia_uvm >>> >>>>>>>>> drm_kms_helper 125056 2 ast,nvidia_drm >>> >>>>>>>>> drm 349210 5 >>> >>>>> ast,ttm,drm_kms_helper,nvidia_drm >>> >>>>>>>>> i2c_core 40582 7 >>> >>>>>>>>> ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia >>> >>>>>>>>> >>> >>>>>>>>> root at gpu3$ nvidia-smi >>> >>>>>>>>> Wed Oct 12 22:03:27 2016 >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > +-----------------------------------------------------------------------------+ >>> >>>>>>>>> | NVIDIA-SMI 367.57 Driver Version: 367.57 >>> >>>>>>>>> | >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > |-------------------------------+----------------------+----------------------+ >>> >>>>>>>>> | GPU Name Persistence-M| Bus-Id Disp.A | >>> >>>>> Volatile >>> >>>>>>>>> Uncorr. ECC | >>> >>>>>>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | >>> >>>>> GPU-Util >>> >>>>>>>>> Compute M. | >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > |===============================+======================+======================| >>> >>>>>>>>> | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | >>> >>>>>>>>> N/A | >>> >>>>>>>>> | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | >>> >>>>> 0% >>> >>>>>>>>> Default | >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > +-------------------------------+----------------------+----------------------+ >>> >>>>>>>>> | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | >>> >>>>>>>>> N/A | >>> >>>>>>>>> | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | >>> >>>>> 0% >>> >>>>>>>>> Default | >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > +-------------------------------+----------------------+----------------------+ >>> >>>>>>>>> | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | >>> >>>>>>>>> N/A | >>> >>>>>>>>> | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | >>> >>>>> 0% >>> >>>>>>>>> Default | >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > +-------------------------------+----------------------+----------------------+ >>> >>>>>>>>> | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | >>> >>>>>>>>> N/A | >>> >>>>>>>>> | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | >>> >>>>> 0% >>> >>>>>>>>> Default | >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > +-------------------------------+----------------------+----------------------+ >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > +-----------------------------------------------------------------------------+ >>> >>>>>>>>> | Processes: >>> >>>>> GPU >>> >>>>>>>>> Memory | >>> >>>>>>>>> | GPU PID Type Process name >>> >>>>>>>>> Usage >>> >>>>>>>>> | >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > |=============================================================================| >>> >>>>>>>>> | No running processes found >>> >>>>>>>>> | >>> >>>>>>>>> >>> >>>>>>> >>> >>>>> >>> >>>> >>> >> > +-----------------------------------------------------------------------------+ >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> /usr/local/cuda/extras/demo_suite/deviceQuery >>> >>>>>>>>> >>> >>>>>>>>> Alignment requirement for Surfaces: Yes >>> >>>>>>>>> Device has ECC support: Disabled >>> >>>>>>>>> Device supports Unified Addressing (UVA): Yes >>> >>>>>>>>> Device PCI Domain ID / Bus ID / location ID: 0 / 131 / >>> 0 >>> >>>>>>>>> Compute Mode: >>> >>>>>>>>> < Default (multiple host threads can use >>> >>>>> ::cudaSetDevice() with >>> >>>>>>>>> device simultaneously) > >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >>> (Pascal) >>> >>>>> (GPU1) : >>> >>>>>>>>> Yes >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >>> (Pascal) >>> >>>>> (GPU2) : >>> >>>>>>>>> No >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >>> (Pascal) >>> >>>>> (GPU3) : >>> >>>>>>>>> No >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >>> (Pascal) >>> >>>>> (GPU0) : >>> >>>>>>>>> Yes >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >>> (Pascal) >>> >>>>> (GPU2) : >>> >>>>>>>>> No >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >>> (Pascal) >>> >>>>> (GPU3) : >>> >>>>>>>>> No >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >>> (Pascal) >>> >>>>> (GPU0) : >>> >>>>>>>>> No >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >>> (Pascal) >>> >>>>> (GPU1) : >>> >>>>>>>>> No >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >>> (Pascal) >>> >>>>> (GPU3) : >>> >>>>>>>>> Yes >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >>> (Pascal) >>> >>>>> (GPU0) : >>> >>>>>>>>> No >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >>> (Pascal) >>> >>>>> (GPU1) : >>> >>>>>>>>> No >>> >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >>> (Pascal) >>> >>>>> (GPU2) : >>> >>>>>>>>> Yes >>> >>>>>>>>> >>> >>>>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = >>> 8.0, >>> >>>>> CUDA >>> >>>>>>>>> Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X >>> >>>>> (Pascal), >>> >>>>>>>>> Device1 >>> >>>>>>>>> = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = >>> >>>>> TITAN X >>> >>>>>>>>> (Pascal) >>> >>>>>>>>> Result = PASS >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> Now not everything is rosy >>> >>>>>>>>> >>> >>>>>>>>> root at gpu3$ cd >>> ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody >>> >>>>>>>>> root at gpu3$ make >>> >>>>>>>>>>>> WARNING - libGL.so not found, refer to CUDA Getting >>> >>>>> Started Guide >>> >>>>>>>>> for how to find and install them. <<< >>> >>>>>>>>>>>> WARNING - libGLU.so not found, refer to CUDA Getting >>> >>>>> Started Guide >>> >>>>>>>>> for how to find and install them. <<< >>> >>>>>>>>>>>> WARNING - libX11.so not found, refer to CUDA Getting >>> >>>>> Started Guide >>> >>>>>>>>> for how to find and install them. <<< >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> even though those are installed. For example >>> >>>>>>>>> >>> >>>>>>>>> root at gpu3$ yum whatprovides */libX11.so >>> >>>>>>>>> libX11-devel-1.6.3-2.el7.i686 : Development files for >>> libX11 >>> >>>>>>>>> Repo : core >>> >>>>>>>>> Matched from: >>> >>>>>>>>> Filename : /usr/lib/libX11.so >>> >>>>>>>>> >>> >>>>>>>>> also >>> >>>>>>>>> >>> >>>>>>>>> mesa-libGLU-devel >>> >>>>>>>>> mesa-libGL-devel >>> >>>>>>>>> xorg-x11-drv-nvidia-devel >>> >>>>>>>>> >>> >>>>>>>>> but >>> >>>>>>>>> >>> >>>>>>>>> root at gpu3$ yum -y install mesa-libGLU-devel >>> mesa-libGL-devel >>> >>>>>>>>> xorg-x11-drv-nvidia-devel >>> >>>>>>>>> Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already >>> >>>>> installed and >>> >>>>>>>>> latest version >>> >>>>>>>>> Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 >>> already >>> >>>>>>>>> installed >>> >>>>>>>>> and latest version >>> >>>>>>>>> Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 >>> >>>>> already >>> >>>>>>>>> installed and latest version >>> >>>>>>>>> >>> >>>>>>>>> Also from MATLAB gpuDevice hangs. >>> >>>>>>>>> >>> >>>>>>>>> So we still don't have a working installation. Any help >>> would >>> >>>>> be >>> >>>>>>>>> appreciated. >>> >>>>>>>>> >>> >>>>>>>>> Best, >>> >>>>>>>>> Predrag >>> >>>>>>>>> >>> >>>>>>>>> P.S. Once we have a working installation we can think of >>> >>>>> installing >>> >>>>>>>>> Caffe and TensorFlow. For now we have to see why the >>> things >>> >>>>> are not >>> >>>>>>>>> working. >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac >>> >>>>> >>> >>>>>>>>>>> wrote: >>> >>>>>>>>>>> >>> >>>>>>>>>>> Dear Autonians, >>> >>>>>>>>>>> >>> >>>>>>>>>>> GPU3 is "configured". Namely you can log into it and all >>> >>>>> packages >>> >>>>>>>>>>> are >>> >>>>>>>>>>> installed. However I couldn't get NVIDIA provided CUDA >>> >>>>> driver to >>> >>>>>>>>>>> recognize GPU cards. They appear to be properly >>> installed >>> >>>>> from the >>> >>>>>>>>>>> hardware point of view and you can list them with >>> >>>>>>>>>>> >>> >>>>>>>>>>> lshw -class display >>> >>>>>>>>>>> >>> >>>>>>>>>>> root at gpu3$ lshw -class display >>> >>>>>>>>>>> *-display UNCLAIMED >>> >>>>>>>>>>> description: VGA compatible controller >>> >>>>>>>>>>> product: NVIDIA Corporation >>> >>>>>>>>>>> vendor: NVIDIA Corporation >>> >>>>>>>>>>> physical id: 0 >>> >>>>>>>>>>> bus info: pci at 0000:02:00.0 >>> >>>>>>>>>>> version: a1 >>> >>>>>>>>>>> width: 64 bits >>> >>>>>>>>>>> clock: 33MHz >>> >>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >>> >>>>> cap_list >>> >>>>>>>>>>> configuration: latency=0 >>> >>>>>>>>>>> resources: iomemory:383f0-383ef >>> iomemory:383f0-383ef >>> >>>>>>>>>>> memory:cf000000-cfffffff >>> memory:383fe0000000-383fefffffff >>> >>>>>>>>>>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128) >>> >>>>>>>>>>> memory:d0000000-d007ffff >>> >>>>>>>>>>> *-display UNCLAIMED >>> >>>>>>>>>>> description: VGA compatible controller >>> >>>>>>>>>>> product: NVIDIA Corporation >>> >>>>>>>>>>> vendor: NVIDIA Corporation >>> >>>>>>>>>>> physical id: 0 >>> >>>>>>>>>>> bus info: pci at 0000:03:00.0 >>> >>>>>>>>>>> version: a1 >>> >>>>>>>>>>> width: 64 bits >>> >>>>>>>>>>> clock: 33MHz >>> >>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >>> >>>>> cap_list >>> >>>>>>>>>>> configuration: latency=0 >>> >>>>>>>>>>> resources: iomemory:383f0-383ef >>> iomemory:383f0-383ef >>> >>>>>>>>>>> memory:cd000000-cdffffff >>> memory:383fc0000000-383fcfffffff >>> >>>>>>>>>>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128) >>> >>>>>>>>>>> memory:ce000000-ce07ffff >>> >>>>>>>>>>> *-display >>> >>>>>>>>>>> description: VGA compatible controller >>> >>>>>>>>>>> product: ASPEED Graphics Family >>> >>>>>>>>>>> vendor: ASPEED Technology, Inc. >>> >>>>>>>>>>> physical id: 0 >>> >>>>>>>>>>> bus info: pci at 0000:06:00.0 >>> >>>>>>>>>>> version: 30 >>> >>>>>>>>>>> width: 32 bits >>> >>>>>>>>>>> clock: 33MHz >>> >>>>>>>>>>> capabilities: pm msi vga_controller bus_master >>> >>>>> cap_list rom >>> >>>>>>>>>>> configuration: driver=ast latency=0 >>> >>>>>>>>>>> resources: irq:19 memory:cb000000-cbffffff >>> >>>>>>>>>>> memory:cc000000-cc01ffff ioport:4000(size=128) >>> >>>>>>>>>>> *-display UNCLAIMED >>> >>>>>>>>>>> description: VGA compatible controller >>> >>>>>>>>>>> product: NVIDIA Corporation >>> >>>>>>>>>>> vendor: NVIDIA Corporation >>> >>>>>>>>>>> physical id: 0 >>> >>>>>>>>>>> bus info: pci at 0000:82:00.0 >>> >>>>>>>>>>> version: a1 >>> >>>>>>>>>>> width: 64 bits >>> >>>>>>>>>>> clock: 33MHz >>> >>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >>> >>>>> cap_list >>> >>>>>>>>>>> configuration: latency=0 >>> >>>>>>>>>>> resources: iomemory:387f0-387ef >>> iomemory:387f0-387ef >>> >>>>>>>>>>> memory:fa000000-faffffff >>> memory:387fe0000000-387fefffffff >>> >>>>>>>>>>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128) >>> >>>>>>>>>>> memory:fb000000-fb07ffff >>> >>>>>>>>>>> *-display UNCLAIMED >>> >>>>>>>>>>> description: VGA compatible controller >>> >>>>>>>>>>> product: NVIDIA Corporation >>> >>>>>>>>>>> vendor: NVIDIA Corporation >>> >>>>>>>>>>> physical id: 0 >>> >>>>>>>>>>> bus info: pci at 0000:83:00.0 >>> >>>>>>>>>>> version: a1 >>> >>>>>>>>>>> width: 64 bits >>> >>>>>>>>>>> clock: 33MHz >>> >>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >>> >>>>> cap_list >>> >>>>>>>>>>> configuration: latency=0 >>> >>>>>>>>>>> resources: iomemory:387f0-387ef >>> iomemory:387f0-387ef >>> >>>>>>>>>>> memory:f8000000-f8ffffff >>> memory:387fc0000000-387fcfffffff >>> >>>>>>>>>>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128) >>> >>>>>>>>>>> memory:f9000000-f907ffff >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> However what scares the hell out of me is that I don't >>> see >>> >>>>> NVIDIA >>> >>>>>>>>>>> driver >>> >>>>>>>>>>> loaded >>> >>>>>>>>>>> >>> >>>>>>>>>>> lsmod|grep nvidia >>> >>>>>>>>>>> >>> >>>>>>>>>>> and the device nodes /dev/nvidia are not created. I am >>> >>>>> guessing I >>> >>>>>>>>>>> just >>> >>>>>>>>>>> missed some trivial step during the CUDA installation >>> which >>> >>>>> is very >>> >>>>>>>>>>> involving. I am unfortunately too tired to debug this >>> >>>>> tonight. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Predrag >>> >>>>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>> >>> >>>> >>> >>>> Links: >>> >>>> ------ >>> >>>> [1] http://findgl.mk > > > Links: > ------ > [1] http://findgl.mk From predragp at imap.srv.cs.cmu.edu Thu Oct 13 13:55:34 2016 From: predragp at imap.srv.cs.cmu.edu (Predrag Punosevac) Date: Thu, 13 Oct 2016 13:55:34 -0400 Subject: GPU3 is "configured" In-Reply-To: References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> <20161013153826.f4agzWkMb%predragp@cs.cmu.edu> <3ad25168d2dc7502872b0cde94950655@imap.srv.cs.cmu.edu> <576ceb12fa4fffb3b72b68a742a9b0b1@imap.srv.cs.cmu.edu> Message-ID: On 2016-10-13 13:51, Dougal Sutherland wrote: > I actually haven't gotten tensorflow working yet -- the bazel build > just hangs on me. I think it maybe has to do with home directories > being on NFS, but I can't figure out bazel at all. I'll try some more > tonight. > According to one of Princeton guys we could just use Python conda for TensorFlow. Please check out and use your scratch directory instead of NFS. Quote: Hello, Predrag. We have caffe 1.00rc3 if you are interested. ftp://ftp.cs.princeton.edu/pub/people/advorkin/SRPM/sd7/caffe-1.00rc3-3.sd7.src.rpm TensforFlow and protobuf-3 work great with conda (http://conda.pydata.org). I just tried and had no problems installing it for Python 2.7 and 3.5 > Caffe should be workable following the instructions Predrag forwarded. > > - Dougal > > On Thu, Oct 13, 2016 at 6:39 PM Predrag Punosevac > wrote: > >> Dear Autonians, >> >> In the case anybody is interested what happens behind the scenes, >> Doug >> got Caffe and TensorFlow to work on >> GPU3. Please see message below. I also got the very useful feed >> back >> from Princeton and Rutgers people. Please check out if you care (you >> will have to log into Gmail to see the exchange). >> >> https://groups.google.com/forum/#!forum/springdale-users >> >> I need to think how we move forward with this before start pulling >> triggers. If somebody is itchy and can't wait please build Caffe and >> TensorFlow in your scratch directory following below howto. >> >> Predrag >> >> On 2016-10-13 13:24, Dougal Sutherland wrote: >>> A note about cudnn: >>> >>> There are a bunch of versions of cudnn. They're not >>> backwards-compatible, and different versions of >>> caffe/tensorflow/whatever want different ones. >>> >>> I currently am using the setup in ~dsutherl/cudnn_files: >>> >>> * I have a bunch of versions of the installer there. >>> * The use-cudnn.sh script, intended to be used like "source >>> use-cudnn.sh 5.1", will untar the appropriate one into a scratch >>> directory (if it hasn't already been done) and set >>> CPATH/LIBRARY_PATH/LD_LIBRARY_PATH appropriately. LD_LIBRARY_PATH >> is >>> needed for caffe binaries, since they don't link to the absolute >> path; >>> the first two (not sure about the the third) are needed for >> theano. >>> Dunno about tensorflow yet. >>> >>> So, here's the Caffe setup: >>> >>> cd /home/scratch/$USER >>> git clone https://github.com/BVLC/caffe >>> cd caffe >>> cp Makefile.config.example Makefile.config >>> >>> # tell it to use openblas; using atlas needs some changes to the >>> Makefile >>> sed -i 's/BLAS := atlas/BLAS := open/' Makefile.config >>> >>> # configure to use cudnn (optional) >>> source ~dsutherl/cudnn-files/use-cudnn.sh 5.1 >>> sed -i 's/# USE_CUDNN := 1/USE_CUDNN := 1/' Makefile.config >>> perl -i -pe 's|$| '$CUDNN_DIR'/include| if /INCLUDE_DIRS :=/' >>> Makefile.config >>> perl -i -pe 's|$| '$CUDNN_DIR'/lib64| if /LIBRARY_DIRS :=/' >>> Makefile.config >>> >>> # build the library >>> make -j23 >>> >>> # to do tests (takes ~10 minutes): >>> make -j23 test >>> make runtest >>> >>> # Now, to run caffe binaries you'll need to remember to source >>> use-cudnn if you used cudnn before. >>> >>> # To build the python libary: >>> make py >>> >>> # Requirements for the python library: >>> # Some of the system packages are too old; this installs them in >> your >>> scratch directory. >>> # You'll have to set PYTHONUSERBASE again before running any >> python >>> processes that use these libs. >>> export PYTHONUSERBASE=$HOME/scratch/.local; >>> export PATH=$PYTHONUSERBASE/bin:"$PATH" # <- optional >>> pip install --user -r python/requirements.txt >>> >>> # Caffe is dumb and doesn't package its python library properly. >> The >>> easiest way to use it is: >>> export PYTHONPATH=/home/scratch/$USER/caffe/python:$PYTHONPATH >>> python -c 'import caffe' >>> >>> On Thu, Oct 13, 2016 at 6:01 PM Dougal Sutherland >> >>> wrote: >>> >>>> Java fix seemed to work. Now tensorflow wants python-wheel and >>>> swig. >>>> >>>> On Thu, Oct 13, 2016 at 5:08 PM Predrag Punosevac >>>> wrote: >>>> >>>>> On 2016-10-13 11:46, Dougal Sutherland wrote: >>>>> >>>>>> Having some trouble with tensorflow, because: >>>>> >>>>>> >>>>> >>>>>> * it require's Google's bazel build system >>>>> >>>>>> >>>>> >>>>>> * The bazel installer says >>>>> >>>>>> Java version is 1.7.0_111 while at least 1.8 is needed. >>>>> >>>>>> * >>>>> >>>>>> >>>>> >>>>>> * $ java -version >>>>> >>>>>> openjdk version "1.8.0_102" >>>>> >>>>>> OpenJDK Runtime Environment (build 1.8.0_102-b14) >>>>> >>>>>> OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode) >>>>> >>>>>> $ javac -version >>>>> >>>>>> javac 1.7.0_111 >>>>> >>>>>> >>>>> >>>>> I just did yum -y install java-1.8.0* which installs openjdk >> 1.8. >>>>> Please >>>>> >>>>> change your java. Let me know if >>>>> >>>>> you want me to install Oracle JDK 1.8 >>>>> >>>>> Predrag >>>>> >>>>>> On Thu, Oct 13, 2016 at 4:38 PM Predrag Punosevac >>>>> >>>>>> wrote: >>>>> >>>>>> >>>>> >>>>>>> Dougal Sutherland wrote: >>>>> >>>>>>> >>>>> >>>>>>>> Also, this seemed to work for me so far for protobuf: >>>>> >>>>>>>> >>>>> >>>>>>>> cd /home/scratch/$USER >>>>> >>>>>>>> VER=3.1.0 >>>>> >>>>>>>> wget >>>>> >>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > https://github.com/google/protobuf/releases/download/v$VER/protobuf-cpp-$VER.tar.gz >>>>> >>>>>>>> tar xf protobuf-cpp-$VER.tar.gz >>>>> >>>>>>>> cd protobuf-cpp-$VER >>>>> >>>>>>>> ./configure --prefix=/home/scratch/$USER >>>>> >>>>>>>> make -j12 >>>>> >>>>>>>> make -j12 check >>>>> >>>>>>>> make install >>>>> >>>>>>> >>>>> >>>>>>> That is great help! >>>>> >>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> You could change --prefix=/usr if making an RPM. >>>>> >>>>>>>> >>>>> >>>>>>>> On Thu, Oct 13, 2016 at 4:26 PM Dougal Sutherland >>>>> >>>>>>> wrote: >>>>> >>>>>>>> >>>>> >>>>>>>>> Some more packages for caffe: >>>>> >>>>>>>>> >>>>> >>>>>>>>> leveldb-devel snappy-devel opencv-devel boost-devel >>>>> >>>>>>>>> hdf5-devel gflags-devel glog-devel lmdb-devel >>>>> >>>>>>>>> >>>>> >>>>>>>>> (Some of those might be installed already, but at least >>>>> gflags >>>>> >>>>>>> is >>>>> >>>>>>>>> definitely missing.) >>>>> >>>>>>>>> >>>>> >>>>>>>>> On Thu, Oct 13, 2016 at 3:44 PM Predrag Punosevac < >>>>> >>>>>>>>> predragp at imap.srv.cs.cmu.edu> wrote: >>>>> >>>>>>>>> >>>>> >>>>>>>>> On 2016-10-12 23:26, Arne Suppe wrote: >>>>> >>>>>>>>>> Hmm - I don???t use matlab for deep learning, but gpuDevice >>>>> >>>>>>> also hangs >>>>> >>>>>>>>>> on my computer with R2016a. >>>>> >>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> We would have to escalate this with MathWorks. I have seen >>>>> work >>>>> >>>>>>> around >>>>> >>>>>>>>> Internet but it looks like a bug in one of Mathworks >> provided >>>>> >>>>>>> MEX files. >>>>> >>>>>>>>> >>>>> >>>>>>>>>> I was able compile the matrixMul example in the CUDA >>>>> samples >>>>> >>>>>>> and run >>>>> >>>>>>>>>> it on gpu3, so I think the build environment is probably >>>>> all >>>>> >>>>>>> set. >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> As for the openGL, I think its possibly a problem with >>>>> their >>>>> >>>>>>> build >>>>> >>>>>>>>>> script findgl.mk [1] [1] [1] which is not familiar with >>>>> Springdale OS. >>>>> >>>>>>> The >>>>> >>>>>>>>>> demo_suite directory has a precompiled nbody binary you may >>>>> >>>>>>> try, but I >>>>> >>>>>>>>>> suspect most users will not need graphics. >>>>> >>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> That should not be too hard to fix. Some header files have >> to >>>>> be >>>>> >>>>>>>>> manually edited. The funny part until 7.2 Princeton people >>>>> >>>>>>> didn't bother >>>>> >>>>>>>>> to remove RHEL branding which actually made things easier >> for >>>>> >>>>>>> us. >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> Doug is trying right now to compile the latest Caffe, >>>>> >>>>>>> TensorFlow, and >>>>> >>>>>>>>> protobuf-3. We will try to create an RPM for that so that we >>>>> >>>>>>> don't have >>>>> >>>>>>>>> to go through this again. I also asked Princeton and Rutgers >>>>> >>>>>>> guys if >>>>> >>>>>>>>> they >>>>> >>>>>>>>> have WIP RPMs to share. >>>>> >>>>>>>>> >>>>> >>>>>>>>> Predrag >>>>> >>>>>>>>> >>>>> >>>>>>>>>> Arne >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> >>>>>>>>>>> On Oct 12, 2016, at 10:23 PM, Predrag Punosevac >>>>> >>>>>>> >>>>> >>>>>>>>>>> wrote: >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Arne Suppe wrote: >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>>> Hi Predrag, >>>>> >>>>>>>>>>>> Don???t know if this applies to you, but I just build a >>>>> >>>>>>> machines with >>>>> >>>>>>>>>>>> a GTX1080 which has the same PASCAL architecture as the >>>>> >>>>>>> Titan. After >>>>> >>>>>>>>>>>> installing CUDA 8, I still found I needed to install the >>>>> >>>>>>> latest >>>>> >>>>>>>>>>>> driver off of the NVIDIA web site to get the card >>>>> >>>>>>> recognized. Right >>>>> >>>>>>>>>>>> now, I am running 367.44. >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> Arne >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Arne, >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Thank you so much for this e-mail. Yes it is damn PASCAL >>>>> >>>>>>> arhitecture I >>>>> >>>>>>>>>>> see lots of people complaining about it on the forums. I >>>>> >>>>>>> downloaded >>>>> >>>>>>>>>>> and >>>>> >>>>>>>>>>> installed driver from >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run&lang=us&type=GeForce >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> That seems to made a real difference. Check out this >>>>> >>>>>>> beautiful outputs >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> root at gpu3$ ls nvidia* >>>>> >>>>>>>>>>> nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm >>>>> >>>>>>>>>>> nvidia-uvm-tools >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> root at gpu3$ lspci | grep -i nvidia >>>>> >>>>>>>>>>> 02:00.0 VGA compatible controller: NVIDIA Corporation >>>>> Device >>>>> >>>>>>> 1b00 (rev >>>>> >>>>>>>>>>> a1) >>>>> >>>>>>>>>>> 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >>>>> a1) >>>>> >>>>>>>>>>> 03:00.0 VGA compatible controller: NVIDIA Corporation >>>>> Device >>>>> >>>>>>> 1b00 (rev >>>>> >>>>>>>>>>> a1) >>>>> >>>>>>>>>>> 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >>>>> a1) >>>>> >>>>>>>>>>> 82:00.0 VGA compatible controller: NVIDIA Corporation >>>>> Device >>>>> >>>>>>> 1b00 (rev >>>>> >>>>>>>>>>> a1) >>>>> >>>>>>>>>>> 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >>>>> a1) >>>>> >>>>>>>>>>> 83:00.0 VGA compatible controller: NVIDIA Corporation >>>>> Device >>>>> >>>>>>> 1b00 (rev >>>>> >>>>>>>>>>> a1) >>>>> >>>>>>>>>>> 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >>>>> a1) >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> root at gpu3$ ls /proc/driver >>>>> >>>>>>>>>>> nvidia nvidia-uvm nvram rtc >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> root at gpu3$ lsmod |grep nvidia >>>>> >>>>>>>>>>> nvidia_uvm 738901 0 >>>>> >>>>>>>>>>> nvidia_drm 43405 0 >>>>> >>>>>>>>>>> nvidia_modeset 764432 1 nvidia_drm >>>>> >>>>>>>>>>> nvidia 11492947 2 nvidia_modeset,nvidia_uvm >>>>> >>>>>>>>>>> drm_kms_helper 125056 2 ast,nvidia_drm >>>>> >>>>>>>>>>> drm 349210 5 >>>>> >>>>>>> ast,ttm,drm_kms_helper,nvidia_drm >>>>> >>>>>>>>>>> i2c_core 40582 7 >>>>> >>>>>>>>>>> ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> root at gpu3$ nvidia-smi >>>>> >>>>>>>>>>> Wed Oct 12 22:03:27 2016 >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > +-----------------------------------------------------------------------------+ >>>>> >>>>>>>>>>> | NVIDIA-SMI 367.57 Driver Version: 367.57 >>>>> >>>>>>>>>>> | >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > |-------------------------------+----------------------+----------------------+ >>>>> >>>>>>>>>>> | GPU Name Persistence-M| Bus-Id Disp.A | >>>>> >>>>>>> Volatile >>>>> >>>>>>>>>>> Uncorr. ECC | >>>>> >>>>>>>>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | >>>>> >>>>>>> GPU-Util >>>>> >>>>>>>>>>> Compute M. | >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > |===============================+======================+======================| >>>>> >>>>>>>>>>> | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | >>>>> >>>>>>>>>>> N/A | >>>>> >>>>>>>>>>> | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | >>>>> >>>>>>> 0% >>>>> >>>>>>>>>>> Default | >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > +-------------------------------+----------------------+----------------------+ >>>>> >>>>>>>>>>> | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | >>>>> >>>>>>>>>>> N/A | >>>>> >>>>>>>>>>> | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | >>>>> >>>>>>> 0% >>>>> >>>>>>>>>>> Default | >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > +-------------------------------+----------------------+----------------------+ >>>>> >>>>>>>>>>> | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | >>>>> >>>>>>>>>>> N/A | >>>>> >>>>>>>>>>> | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | >>>>> >>>>>>> 0% >>>>> >>>>>>>>>>> Default | >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > +-------------------------------+----------------------+----------------------+ >>>>> >>>>>>>>>>> | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | >>>>> >>>>>>>>>>> N/A | >>>>> >>>>>>>>>>> | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | >>>>> >>>>>>> 0% >>>>> >>>>>>>>>>> Default | >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > +-------------------------------+----------------------+----------------------+ >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > +-----------------------------------------------------------------------------+ >>>>> >>>>>>>>>>> | Processes: >>>>> >>>>>>> GPU >>>>> >>>>>>>>>>> Memory | >>>>> >>>>>>>>>>> | GPU PID Type Process name >>>>> >>>>>>>>>>> Usage >>>>> >>>>>>>>>>> | >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > |=============================================================================| >>>>> >>>>>>>>>>> | No running processes found >>>>> >>>>>>>>>>> | >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>> >>> >> > +-----------------------------------------------------------------------------+ >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> /usr/local/cuda/extras/demo_suite/deviceQuery >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Alignment requirement for Surfaces: Yes >>>>> >>>>>>>>>>> Device has ECC support: Disabled >>>>> >>>>>>>>>>> Device supports Unified Addressing (UVA): Yes >>>>> >>>>>>>>>>> Device PCI Domain ID / Bus ID / location ID: 0 / 131 / >>>>> 0 >>>>> >>>>>>>>>>> Compute Mode: >>>>> >>>>>>>>>>> < Default (multiple host threads can use >>>>> >>>>>>> ::cudaSetDevice() with >>>>> >>>>>>>>>>> device simultaneously) > >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU1) : >>>>> >>>>>>>>>>> Yes >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU2) : >>>>> >>>>>>>>>>> No >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU3) : >>>>> >>>>>>>>>>> No >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU0) : >>>>> >>>>>>>>>>> Yes >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU2) : >>>>> >>>>>>>>>>> No >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU3) : >>>>> >>>>>>>>>>> No >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU0) : >>>>> >>>>>>>>>>> No >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU1) : >>>>> >>>>>>>>>>> No >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU3) : >>>>> >>>>>>>>>>> Yes >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU0) : >>>>> >>>>>>>>>>> No >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU1) : >>>>> >>>>>>>>>>> No >>>>> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >>>>> (Pascal) >>>>> >>>>>>> (GPU2) : >>>>> >>>>>>>>>>> Yes >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = >>>>> 8.0, >>>>> >>>>>>> CUDA >>>>> >>>>>>>>>>> Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X >>>>> >>>>>>> (Pascal), >>>>> >>>>>>>>>>> Device1 >>>>> >>>>>>>>>>> = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = >>>>> >>>>>>> TITAN X >>>>> >>>>>>>>>>> (Pascal) >>>>> >>>>>>>>>>> Result = PASS >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Now not everything is rosy >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> root at gpu3$ cd >>>>> ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody >>>>> >>>>>>>>>>> root at gpu3$ make >>>>> >>>>>>>>>>>>>> WARNING - libGL.so not found, refer to CUDA Getting >>>>> >>>>>>> Started Guide >>>>> >>>>>>>>>>> for how to find and install them. <<< >>>>> >>>>>>>>>>>>>> WARNING - libGLU.so not found, refer to CUDA Getting >>>>> >>>>>>> Started Guide >>>>> >>>>>>>>>>> for how to find and install them. <<< >>>>> >>>>>>>>>>>>>> WARNING - libX11.so not found, refer to CUDA Getting >>>>> >>>>>>> Started Guide >>>>> >>>>>>>>>>> for how to find and install them. <<< >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> even though those are installed. For example >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> root at gpu3$ yum whatprovides */libX11.so >>>>> >>>>>>>>>>> libX11-devel-1.6.3-2.el7.i686 : Development files for >>>>> libX11 >>>>> >>>>>>>>>>> Repo : core >>>>> >>>>>>>>>>> Matched from: >>>>> >>>>>>>>>>> Filename : /usr/lib/libX11.so >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> also >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> mesa-libGLU-devel >>>>> >>>>>>>>>>> mesa-libGL-devel >>>>> >>>>>>>>>>> xorg-x11-drv-nvidia-devel >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> but >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> root at gpu3$ yum -y install mesa-libGLU-devel >>>>> mesa-libGL-devel >>>>> >>>>>>>>>>> xorg-x11-drv-nvidia-devel >>>>> >>>>>>>>>>> Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already >>>>> >>>>>>> installed and >>>>> >>>>>>>>>>> latest version >>>>> >>>>>>>>>>> Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 >>>>> already >>>>> >>>>>>>>>>> installed >>>>> >>>>>>>>>>> and latest version >>>>> >>>>>>>>>>> Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 >>>>> >>>>>>> already >>>>> >>>>>>>>>>> installed and latest version >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Also from MATLAB gpuDevice hangs. >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> So we still don't have a working installation. Any help >>>>> would >>>>> >>>>>>> be >>>>> >>>>>>>>>>> appreciated. >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Best, >>>>> >>>>>>>>>>> Predrag >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> P.S. Once we have a working installation we can think of >>>>> >>>>>>> installing >>>>> >>>>>>>>>>> Caffe and TensorFlow. For now we have to see why the >>>>> things >>>>> >>>>>>> are not >>>>> >>>>>>>>>>> working. >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac >>>>> >>>>>>> >>>>> >>>>>>>>>>>>> wrote: >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> Dear Autonians, >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> GPU3 is "configured". Namely you can log into it and all >>>>> >>>>>>> packages >>>>> >>>>>>>>>>>>> are >>>>> >>>>>>>>>>>>> installed. However I couldn't get NVIDIA provided CUDA >>>>> >>>>>>> driver to >>>>> >>>>>>>>>>>>> recognize GPU cards. They appear to be properly >>>>> installed >>>>> >>>>>>> from the >>>>> >>>>>>>>>>>>> hardware point of view and you can list them with >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> lshw -class display >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> root at gpu3$ lshw -class display >>>>> >>>>>>>>>>>>> *-display UNCLAIMED >>>>> >>>>>>>>>>>>> description: VGA compatible controller >>>>> >>>>>>>>>>>>> product: NVIDIA Corporation >>>>> >>>>>>>>>>>>> vendor: NVIDIA Corporation >>>>> >>>>>>>>>>>>> physical id: 0 >>>>> >>>>>>>>>>>>> bus info: pci at 0000:02:00.0 >>>>> >>>>>>>>>>>>> version: a1 >>>>> >>>>>>>>>>>>> width: 64 bits >>>>> >>>>>>>>>>>>> clock: 33MHz >>>>> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >>>>> >>>>>>> cap_list >>>>> >>>>>>>>>>>>> configuration: latency=0 >>>>> >>>>>>>>>>>>> resources: iomemory:383f0-383ef >>>>> iomemory:383f0-383ef >>>>> >>>>>>>>>>>>> memory:cf000000-cfffffff >>>>> memory:383fe0000000-383fefffffff >>>>> >>>>>>>>>>>>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128) >>>>> >>>>>>>>>>>>> memory:d0000000-d007ffff >>>>> >>>>>>>>>>>>> *-display UNCLAIMED >>>>> >>>>>>>>>>>>> description: VGA compatible controller >>>>> >>>>>>>>>>>>> product: NVIDIA Corporation >>>>> >>>>>>>>>>>>> vendor: NVIDIA Corporation >>>>> >>>>>>>>>>>>> physical id: 0 >>>>> >>>>>>>>>>>>> bus info: pci at 0000:03:00.0 >>>>> >>>>>>>>>>>>> version: a1 >>>>> >>>>>>>>>>>>> width: 64 bits >>>>> >>>>>>>>>>>>> clock: 33MHz >>>>> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >>>>> >>>>>>> cap_list >>>>> >>>>>>>>>>>>> configuration: latency=0 >>>>> >>>>>>>>>>>>> resources: iomemory:383f0-383ef >>>>> iomemory:383f0-383ef >>>>> >>>>>>>>>>>>> memory:cd000000-cdffffff >>>>> memory:383fc0000000-383fcfffffff >>>>> >>>>>>>>>>>>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128) >>>>> >>>>>>>>>>>>> memory:ce000000-ce07ffff >>>>> >>>>>>>>>>>>> *-display >>>>> >>>>>>>>>>>>> description: VGA compatible controller >>>>> >>>>>>>>>>>>> product: ASPEED Graphics Family >>>>> >>>>>>>>>>>>> vendor: ASPEED Technology, Inc. >>>>> >>>>>>>>>>>>> physical id: 0 >>>>> >>>>>>>>>>>>> bus info: pci at 0000:06:00.0 >>>>> >>>>>>>>>>>>> version: 30 >>>>> >>>>>>>>>>>>> width: 32 bits >>>>> >>>>>>>>>>>>> clock: 33MHz >>>>> >>>>>>>>>>>>> capabilities: pm msi vga_controller bus_master >>>>> >>>>>>> cap_list rom >>>>> >>>>>>>>>>>>> configuration: driver=ast latency=0 >>>>> >>>>>>>>>>>>> resources: irq:19 memory:cb000000-cbffffff >>>>> >>>>>>>>>>>>> memory:cc000000-cc01ffff ioport:4000(size=128) >>>>> >>>>>>>>>>>>> *-display UNCLAIMED >>>>> >>>>>>>>>>>>> description: VGA compatible controller >>>>> >>>>>>>>>>>>> product: NVIDIA Corporation >>>>> >>>>>>>>>>>>> vendor: NVIDIA Corporation >>>>> >>>>>>>>>>>>> physical id: 0 >>>>> >>>>>>>>>>>>> bus info: pci at 0000:82:00.0 >>>>> >>>>>>>>>>>>> version: a1 >>>>> >>>>>>>>>>>>> width: 64 bits >>>>> >>>>>>>>>>>>> clock: 33MHz >>>>> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >>>>> >>>>>>> cap_list >>>>> >>>>>>>>>>>>> configuration: latency=0 >>>>> >>>>>>>>>>>>> resources: iomemory:387f0-387ef >>>>> iomemory:387f0-387ef >>>>> >>>>>>>>>>>>> memory:fa000000-faffffff >>>>> memory:387fe0000000-387fefffffff >>>>> >>>>>>>>>>>>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128) >>>>> >>>>>>>>>>>>> memory:fb000000-fb07ffff >>>>> >>>>>>>>>>>>> *-display UNCLAIMED >>>>> >>>>>>>>>>>>> description: VGA compatible controller >>>>> >>>>>>>>>>>>> product: NVIDIA Corporation >>>>> >>>>>>>>>>>>> vendor: NVIDIA Corporation >>>>> >>>>>>>>>>>>> physical id: 0 >>>>> >>>>>>>>>>>>> bus info: pci at 0000:83:00.0 >>>>> >>>>>>>>>>>>> version: a1 >>>>> >>>>>>>>>>>>> width: 64 bits >>>>> >>>>>>>>>>>>> clock: 33MHz >>>>> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >>>>> >>>>>>> cap_list >>>>> >>>>>>>>>>>>> configuration: latency=0 >>>>> >>>>>>>>>>>>> resources: iomemory:387f0-387ef >>>>> iomemory:387f0-387ef >>>>> >>>>>>>>>>>>> memory:f8000000-f8ffffff >>>>> memory:387fc0000000-387fcfffffff >>>>> >>>>>>>>>>>>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128) >>>>> >>>>>>>>>>>>> memory:f9000000-f907ffff >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> However what scares the hell out of me is that I don't >>>>> see >>>>> >>>>>>> NVIDIA >>>>> >>>>>>>>>>>>> driver >>>>> >>>>>>>>>>>>> loaded >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> lsmod|grep nvidia >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> and the device nodes /dev/nvidia are not created. I am >>>>> >>>>>>> guessing I >>>>> >>>>>>>>>>>>> just >>>>> >>>>>>>>>>>>> missed some trivial step during the CUDA installation >>>>> which >>>>> >>>>>>> is very >>>>> >>>>>>>>>>>>> involving. I am unfortunately too tired to debug this >>>>> >>>>>>> tonight. >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> Predrag >>>>> >>>>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> Links: >>>>> >>>>>> ------ >>>>> >>>>>> [1] http://findgl.mk >>> >>> >>> Links: >>> ------ >>> [1] http://findgl.mk > > > Links: > ------ > [1] http://findgl.mk From dougal at gmail.com Thu Oct 13 13:51:23 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Thu, 13 Oct 2016 17:51:23 +0000 Subject: GPU3 is "configured" In-Reply-To: <576ceb12fa4fffb3b72b68a742a9b0b1@imap.srv.cs.cmu.edu> References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> <20161013153826.f4agzWkMb%predragp@cs.cmu.edu> <3ad25168d2dc7502872b0cde94950655@imap.srv.cs.cmu.edu> <576ceb12fa4fffb3b72b68a742a9b0b1@imap.srv.cs.cmu.edu> Message-ID: I actually haven't gotten tensorflow working yet -- the bazel build just hangs on me. I think it maybe has to do with home directories being on NFS, but I can't figure out bazel at all. I'll try some more tonight. Caffe should be workable following the instructions Predrag forwarded. - Dougal On Thu, Oct 13, 2016 at 6:39 PM Predrag Punosevac < predragp at imap.srv.cs.cmu.edu> wrote: > Dear Autonians, > > In the case anybody is interested what happens behind the scenes, Doug > got Caffe and TensorFlow to work on > GPU3. Please see message below. I also got the very useful feed back > from Princeton and Rutgers people. Please check out if you care (you > will have to log into Gmail to see the exchange). > > https://groups.google.com/forum/#!forum/springdale-users > > I need to think how we move forward with this before start pulling > triggers. If somebody is itchy and can't wait please build Caffe and > TensorFlow in your scratch directory following below howto. > > Predrag > > On 2016-10-13 13:24, Dougal Sutherland wrote: > > A note about cudnn: > > > > There are a bunch of versions of cudnn. They're not > > backwards-compatible, and different versions of > > caffe/tensorflow/whatever want different ones. > > > > I currently am using the setup in ~dsutherl/cudnn_files: > > > > * I have a bunch of versions of the installer there. > > * The use-cudnn.sh script, intended to be used like "source > > use-cudnn.sh 5.1", will untar the appropriate one into a scratch > > directory (if it hasn't already been done) and set > > CPATH/LIBRARY_PATH/LD_LIBRARY_PATH appropriately. LD_LIBRARY_PATH is > > needed for caffe binaries, since they don't link to the absolute path; > > the first two (not sure about the the third) are needed for theano. > > Dunno about tensorflow yet. > > > > So, here's the Caffe setup: > > > > cd /home/scratch/$USER > > git clone https://github.com/BVLC/caffe > > cd caffe > > cp Makefile.config.example Makefile.config > > > > # tell it to use openblas; using atlas needs some changes to the > > Makefile > > sed -i 's/BLAS := atlas/BLAS := open/' Makefile.config > > > > # configure to use cudnn (optional) > > source ~dsutherl/cudnn-files/use-cudnn.sh 5.1 > > sed -i 's/# USE_CUDNN := 1/USE_CUDNN := 1/' Makefile.config > > perl -i -pe 's|$| '$CUDNN_DIR'/include| if /INCLUDE_DIRS :=/' > > Makefile.config > > perl -i -pe 's|$| '$CUDNN_DIR'/lib64| if /LIBRARY_DIRS :=/' > > Makefile.config > > > > # build the library > > make -j23 > > > > # to do tests (takes ~10 minutes): > > make -j23 test > > make runtest > > > > # Now, to run caffe binaries you'll need to remember to source > > use-cudnn if you used cudnn before. > > > > # To build the python libary: > > make py > > > > # Requirements for the python library: > > # Some of the system packages are too old; this installs them in your > > scratch directory. > > # You'll have to set PYTHONUSERBASE again before running any python > > processes that use these libs. > > export PYTHONUSERBASE=$HOME/scratch/.local; > > export PATH=$PYTHONUSERBASE/bin:"$PATH" # <- optional > > pip install --user -r python/requirements.txt > > > > # Caffe is dumb and doesn't package its python library properly. The > > easiest way to use it is: > > export PYTHONPATH=/home/scratch/$USER/caffe/python:$PYTHONPATH > > python -c 'import caffe' > > > > On Thu, Oct 13, 2016 at 6:01 PM Dougal Sutherland > > wrote: > > > >> Java fix seemed to work. Now tensorflow wants python-wheel and > >> swig. > >> > >> On Thu, Oct 13, 2016 at 5:08 PM Predrag Punosevac > >> wrote: > >> > >>> On 2016-10-13 11:46, Dougal Sutherland wrote: > >>> > >>>> Having some trouble with tensorflow, because: > >>> > >>>> > >>> > >>>> * it require's Google's bazel build system > >>> > >>>> > >>> > >>>> * The bazel installer says > >>> > >>>> Java version is 1.7.0_111 while at least 1.8 is needed. > >>> > >>>> * > >>> > >>>> > >>> > >>>> * $ java -version > >>> > >>>> openjdk version "1.8.0_102" > >>> > >>>> OpenJDK Runtime Environment (build 1.8.0_102-b14) > >>> > >>>> OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode) > >>> > >>>> $ javac -version > >>> > >>>> javac 1.7.0_111 > >>> > >>>> > >>> > >>> I just did yum -y install java-1.8.0* which installs openjdk 1.8. > >>> Please > >>> > >>> change your java. Let me know if > >>> > >>> you want me to install Oracle JDK 1.8 > >>> > >>> Predrag > >>> > >>>> On Thu, Oct 13, 2016 at 4:38 PM Predrag Punosevac > >>> > >>>> wrote: > >>> > >>>> > >>> > >>>>> Dougal Sutherland wrote: > >>> > >>>>> > >>> > >>>>>> Also, this seemed to work for me so far for protobuf: > >>> > >>>>>> > >>> > >>>>>> cd /home/scratch/$USER > >>> > >>>>>> VER=3.1.0 > >>> > >>>>>> wget > >>> > >>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > https://github.com/google/protobuf/releases/download/v$VER/protobuf-cpp-$VER.tar.gz > >>> > >>>>>> tar xf protobuf-cpp-$VER.tar.gz > >>> > >>>>>> cd protobuf-cpp-$VER > >>> > >>>>>> ./configure --prefix=/home/scratch/$USER > >>> > >>>>>> make -j12 > >>> > >>>>>> make -j12 check > >>> > >>>>>> make install > >>> > >>>>> > >>> > >>>>> That is great help! > >>> > >>>>> > >>> > >>>>>> > >>> > >>>>>> You could change --prefix=/usr if making an RPM. > >>> > >>>>>> > >>> > >>>>>> On Thu, Oct 13, 2016 at 4:26 PM Dougal Sutherland > >>> > >>>>> wrote: > >>> > >>>>>> > >>> > >>>>>>> Some more packages for caffe: > >>> > >>>>>>> > >>> > >>>>>>> leveldb-devel snappy-devel opencv-devel boost-devel > >>> > >>>>>>> hdf5-devel gflags-devel glog-devel lmdb-devel > >>> > >>>>>>> > >>> > >>>>>>> (Some of those might be installed already, but at least > >>> gflags > >>> > >>>>> is > >>> > >>>>>>> definitely missing.) > >>> > >>>>>>> > >>> > >>>>>>> On Thu, Oct 13, 2016 at 3:44 PM Predrag Punosevac < > >>> > >>>>>>> predragp at imap.srv.cs.cmu.edu> wrote: > >>> > >>>>>>> > >>> > >>>>>>> On 2016-10-12 23:26, Arne Suppe wrote: > >>> > >>>>>>>> Hmm - I don???t use matlab for deep learning, but gpuDevice > >>> > >>>>> also hangs > >>> > >>>>>>>> on my computer with R2016a. > >>> > >>>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> We would have to escalate this with MathWorks. I have seen > >>> work > >>> > >>>>> around > >>> > >>>>>>> Internet but it looks like a bug in one of Mathworks provided > >>> > >>>>> MEX files. > >>> > >>>>>>> > >>> > >>>>>>>> I was able compile the matrixMul example in the CUDA > >>> samples > >>> > >>>>> and run > >>> > >>>>>>>> it on gpu3, so I think the build environment is probably > >>> all > >>> > >>>>> set. > >>> > >>>>>>>> > >>> > >>>>>>>> As for the openGL, I think its possibly a problem with > >>> their > >>> > >>>>> build > >>> > >>>>>>>> script findgl.mk [1] [1] which is not familiar with > >>> Springdale OS. > >>> > >>>>> The > >>> > >>>>>>>> demo_suite directory has a precompiled nbody binary you may > >>> > >>>>> try, but I > >>> > >>>>>>>> suspect most users will not need graphics. > >>> > >>>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> That should not be too hard to fix. Some header files have to > >>> be > >>> > >>>>>>> manually edited. The funny part until 7.2 Princeton people > >>> > >>>>> didn't bother > >>> > >>>>>>> to remove RHEL branding which actually made things easier for > >>> > >>>>> us. > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> Doug is trying right now to compile the latest Caffe, > >>> > >>>>> TensorFlow, and > >>> > >>>>>>> protobuf-3. We will try to create an RPM for that so that we > >>> > >>>>> don't have > >>> > >>>>>>> to go through this again. I also asked Princeton and Rutgers > >>> > >>>>> guys if > >>> > >>>>>>> they > >>> > >>>>>>> have WIP RPMs to share. > >>> > >>>>>>> > >>> > >>>>>>> Predrag > >>> > >>>>>>> > >>> > >>>>>>>> Arne > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>>> On Oct 12, 2016, at 10:23 PM, Predrag Punosevac > >>> > >>>>> > >>> > >>>>>>>>> wrote: > >>> > >>>>>>>>> > >>> > >>>>>>>>> Arne Suppe wrote: > >>> > >>>>>>>>> > >>> > >>>>>>>>>> Hi Predrag, > >>> > >>>>>>>>>> Don???t know if this applies to you, but I just build a > >>> > >>>>> machines with > >>> > >>>>>>>>>> a GTX1080 which has the same PASCAL architecture as the > >>> > >>>>> Titan. After > >>> > >>>>>>>>>> installing CUDA 8, I still found I needed to install the > >>> > >>>>> latest > >>> > >>>>>>>>>> driver off of the NVIDIA web site to get the card > >>> > >>>>> recognized. Right > >>> > >>>>>>>>>> now, I am running 367.44. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> Arne > >>> > >>>>>>>>> > >>> > >>>>>>>>> Arne, > >>> > >>>>>>>>> > >>> > >>>>>>>>> Thank you so much for this e-mail. Yes it is damn PASCAL > >>> > >>>>> arhitecture I > >>> > >>>>>>>>> see lots of people complaining about it on the forums. I > >>> > >>>>> downloaded > >>> > >>>>>>>>> and > >>> > >>>>>>>>> installed driver from > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run&lang=us&type=GeForce > >>> > >>>>>>>>> > >>> > >>>>>>>>> That seems to made a real difference. Check out this > >>> > >>>>> beautiful outputs > >>> > >>>>>>>>> > >>> > >>>>>>>>> root at gpu3$ ls nvidia* > >>> > >>>>>>>>> nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm > >>> > >>>>>>>>> nvidia-uvm-tools > >>> > >>>>>>>>> > >>> > >>>>>>>>> root at gpu3$ lspci | grep -i nvidia > >>> > >>>>>>>>> 02:00.0 VGA compatible controller: NVIDIA Corporation > >>> Device > >>> > >>>>> 1b00 (rev > >>> > >>>>>>>>> a1) > >>> > >>>>>>>>> 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >>> a1) > >>> > >>>>>>>>> 03:00.0 VGA compatible controller: NVIDIA Corporation > >>> Device > >>> > >>>>> 1b00 (rev > >>> > >>>>>>>>> a1) > >>> > >>>>>>>>> 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >>> a1) > >>> > >>>>>>>>> 82:00.0 VGA compatible controller: NVIDIA Corporation > >>> Device > >>> > >>>>> 1b00 (rev > >>> > >>>>>>>>> a1) > >>> > >>>>>>>>> 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >>> a1) > >>> > >>>>>>>>> 83:00.0 VGA compatible controller: NVIDIA Corporation > >>> Device > >>> > >>>>> 1b00 (rev > >>> > >>>>>>>>> a1) > >>> > >>>>>>>>> 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >>> a1) > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> root at gpu3$ ls /proc/driver > >>> > >>>>>>>>> nvidia nvidia-uvm nvram rtc > >>> > >>>>>>>>> > >>> > >>>>>>>>> root at gpu3$ lsmod |grep nvidia > >>> > >>>>>>>>> nvidia_uvm 738901 0 > >>> > >>>>>>>>> nvidia_drm 43405 0 > >>> > >>>>>>>>> nvidia_modeset 764432 1 nvidia_drm > >>> > >>>>>>>>> nvidia 11492947 2 nvidia_modeset,nvidia_uvm > >>> > >>>>>>>>> drm_kms_helper 125056 2 ast,nvidia_drm > >>> > >>>>>>>>> drm 349210 5 > >>> > >>>>> ast,ttm,drm_kms_helper,nvidia_drm > >>> > >>>>>>>>> i2c_core 40582 7 > >>> > >>>>>>>>> ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia > >>> > >>>>>>>>> > >>> > >>>>>>>>> root at gpu3$ nvidia-smi > >>> > >>>>>>>>> Wed Oct 12 22:03:27 2016 > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > +-----------------------------------------------------------------------------+ > >>> > >>>>>>>>> | NVIDIA-SMI 367.57 Driver Version: 367.57 > >>> > >>>>>>>>> | > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > |-------------------------------+----------------------+----------------------+ > >>> > >>>>>>>>> | GPU Name Persistence-M| Bus-Id Disp.A | > >>> > >>>>> Volatile > >>> > >>>>>>>>> Uncorr. ECC | > >>> > >>>>>>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | > >>> > >>>>> GPU-Util > >>> > >>>>>>>>> Compute M. | > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > |===============================+======================+======================| > >>> > >>>>>>>>> | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | > >>> > >>>>>>>>> N/A | > >>> > >>>>>>>>> | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | > >>> > >>>>> 0% > >>> > >>>>>>>>> Default | > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > +-------------------------------+----------------------+----------------------+ > >>> > >>>>>>>>> | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | > >>> > >>>>>>>>> N/A | > >>> > >>>>>>>>> | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | > >>> > >>>>> 0% > >>> > >>>>>>>>> Default | > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > +-------------------------------+----------------------+----------------------+ > >>> > >>>>>>>>> | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | > >>> > >>>>>>>>> N/A | > >>> > >>>>>>>>> | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | > >>> > >>>>> 0% > >>> > >>>>>>>>> Default | > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > +-------------------------------+----------------------+----------------------+ > >>> > >>>>>>>>> | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | > >>> > >>>>>>>>> N/A | > >>> > >>>>>>>>> | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | > >>> > >>>>> 0% > >>> > >>>>>>>>> Default | > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > +-------------------------------+----------------------+----------------------+ > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > +-----------------------------------------------------------------------------+ > >>> > >>>>>>>>> | Processes: > >>> > >>>>> GPU > >>> > >>>>>>>>> Memory | > >>> > >>>>>>>>> | GPU PID Type Process name > >>> > >>>>>>>>> Usage > >>> > >>>>>>>>> | > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > |=============================================================================| > >>> > >>>>>>>>> | No running processes found > >>> > >>>>>>>>> | > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >> > > > +-----------------------------------------------------------------------------+ > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> /usr/local/cuda/extras/demo_suite/deviceQuery > >>> > >>>>>>>>> > >>> > >>>>>>>>> Alignment requirement for Surfaces: Yes > >>> > >>>>>>>>> Device has ECC support: Disabled > >>> > >>>>>>>>> Device supports Unified Addressing (UVA): Yes > >>> > >>>>>>>>> Device PCI Domain ID / Bus ID / location ID: 0 / 131 / > >>> 0 > >>> > >>>>>>>>> Compute Mode: > >>> > >>>>>>>>> < Default (multiple host threads can use > >>> > >>>>> ::cudaSetDevice() with > >>> > >>>>>>>>> device simultaneously) > > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU1) : > >>> > >>>>>>>>> Yes > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU2) : > >>> > >>>>>>>>> No > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU3) : > >>> > >>>>>>>>> No > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU0) : > >>> > >>>>>>>>> Yes > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU2) : > >>> > >>>>>>>>> No > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU3) : > >>> > >>>>>>>>> No > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU0) : > >>> > >>>>>>>>> No > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU1) : > >>> > >>>>>>>>> No > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU3) : > >>> > >>>>>>>>> Yes > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU0) : > >>> > >>>>>>>>> No > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU1) : > >>> > >>>>>>>>> No > >>> > >>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >>> (Pascal) > >>> > >>>>> (GPU2) : > >>> > >>>>>>>>> Yes > >>> > >>>>>>>>> > >>> > >>>>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = > >>> 8.0, > >>> > >>>>> CUDA > >>> > >>>>>>>>> Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X > >>> > >>>>> (Pascal), > >>> > >>>>>>>>> Device1 > >>> > >>>>>>>>> = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = > >>> > >>>>> TITAN X > >>> > >>>>>>>>> (Pascal) > >>> > >>>>>>>>> Result = PASS > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> Now not everything is rosy > >>> > >>>>>>>>> > >>> > >>>>>>>>> root at gpu3$ cd > >>> ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody > >>> > >>>>>>>>> root at gpu3$ make > >>> > >>>>>>>>>>>> WARNING - libGL.so not found, refer to CUDA Getting > >>> > >>>>> Started Guide > >>> > >>>>>>>>> for how to find and install them. <<< > >>> > >>>>>>>>>>>> WARNING - libGLU.so not found, refer to CUDA Getting > >>> > >>>>> Started Guide > >>> > >>>>>>>>> for how to find and install them. <<< > >>> > >>>>>>>>>>>> WARNING - libX11.so not found, refer to CUDA Getting > >>> > >>>>> Started Guide > >>> > >>>>>>>>> for how to find and install them. <<< > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> even though those are installed. For example > >>> > >>>>>>>>> > >>> > >>>>>>>>> root at gpu3$ yum whatprovides */libX11.so > >>> > >>>>>>>>> libX11-devel-1.6.3-2.el7.i686 : Development files for > >>> libX11 > >>> > >>>>>>>>> Repo : core > >>> > >>>>>>>>> Matched from: > >>> > >>>>>>>>> Filename : /usr/lib/libX11.so > >>> > >>>>>>>>> > >>> > >>>>>>>>> also > >>> > >>>>>>>>> > >>> > >>>>>>>>> mesa-libGLU-devel > >>> > >>>>>>>>> mesa-libGL-devel > >>> > >>>>>>>>> xorg-x11-drv-nvidia-devel > >>> > >>>>>>>>> > >>> > >>>>>>>>> but > >>> > >>>>>>>>> > >>> > >>>>>>>>> root at gpu3$ yum -y install mesa-libGLU-devel > >>> mesa-libGL-devel > >>> > >>>>>>>>> xorg-x11-drv-nvidia-devel > >>> > >>>>>>>>> Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already > >>> > >>>>> installed and > >>> > >>>>>>>>> latest version > >>> > >>>>>>>>> Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 > >>> already > >>> > >>>>>>>>> installed > >>> > >>>>>>>>> and latest version > >>> > >>>>>>>>> Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 > >>> > >>>>> already > >>> > >>>>>>>>> installed and latest version > >>> > >>>>>>>>> > >>> > >>>>>>>>> Also from MATLAB gpuDevice hangs. > >>> > >>>>>>>>> > >>> > >>>>>>>>> So we still don't have a working installation. Any help > >>> would > >>> > >>>>> be > >>> > >>>>>>>>> appreciated. > >>> > >>>>>>>>> > >>> > >>>>>>>>> Best, > >>> > >>>>>>>>> Predrag > >>> > >>>>>>>>> > >>> > >>>>>>>>> P.S. Once we have a working installation we can think of > >>> > >>>>> installing > >>> > >>>>>>>>> Caffe and TensorFlow. For now we have to see why the > >>> things > >>> > >>>>> are not > >>> > >>>>>>>>> working. > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>>> > >>> > >>>>>>>>>>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac > >>> > >>>>> > >>> > >>>>>>>>>>> wrote: > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> Dear Autonians, > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> GPU3 is "configured". Namely you can log into it and all > >>> > >>>>> packages > >>> > >>>>>>>>>>> are > >>> > >>>>>>>>>>> installed. However I couldn't get NVIDIA provided CUDA > >>> > >>>>> driver to > >>> > >>>>>>>>>>> recognize GPU cards. They appear to be properly > >>> installed > >>> > >>>>> from the > >>> > >>>>>>>>>>> hardware point of view and you can list them with > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> lshw -class display > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> root at gpu3$ lshw -class display > >>> > >>>>>>>>>>> *-display UNCLAIMED > >>> > >>>>>>>>>>> description: VGA compatible controller > >>> > >>>>>>>>>>> product: NVIDIA Corporation > >>> > >>>>>>>>>>> vendor: NVIDIA Corporation > >>> > >>>>>>>>>>> physical id: 0 > >>> > >>>>>>>>>>> bus info: pci at 0000:02:00.0 > >>> > >>>>>>>>>>> version: a1 > >>> > >>>>>>>>>>> width: 64 bits > >>> > >>>>>>>>>>> clock: 33MHz > >>> > >>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >>> > >>>>> cap_list > >>> > >>>>>>>>>>> configuration: latency=0 > >>> > >>>>>>>>>>> resources: iomemory:383f0-383ef > >>> iomemory:383f0-383ef > >>> > >>>>>>>>>>> memory:cf000000-cfffffff > >>> memory:383fe0000000-383fefffffff > >>> > >>>>>>>>>>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128) > >>> > >>>>>>>>>>> memory:d0000000-d007ffff > >>> > >>>>>>>>>>> *-display UNCLAIMED > >>> > >>>>>>>>>>> description: VGA compatible controller > >>> > >>>>>>>>>>> product: NVIDIA Corporation > >>> > >>>>>>>>>>> vendor: NVIDIA Corporation > >>> > >>>>>>>>>>> physical id: 0 > >>> > >>>>>>>>>>> bus info: pci at 0000:03:00.0 > >>> > >>>>>>>>>>> version: a1 > >>> > >>>>>>>>>>> width: 64 bits > >>> > >>>>>>>>>>> clock: 33MHz > >>> > >>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >>> > >>>>> cap_list > >>> > >>>>>>>>>>> configuration: latency=0 > >>> > >>>>>>>>>>> resources: iomemory:383f0-383ef > >>> iomemory:383f0-383ef > >>> > >>>>>>>>>>> memory:cd000000-cdffffff > >>> memory:383fc0000000-383fcfffffff > >>> > >>>>>>>>>>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128) > >>> > >>>>>>>>>>> memory:ce000000-ce07ffff > >>> > >>>>>>>>>>> *-display > >>> > >>>>>>>>>>> description: VGA compatible controller > >>> > >>>>>>>>>>> product: ASPEED Graphics Family > >>> > >>>>>>>>>>> vendor: ASPEED Technology, Inc. > >>> > >>>>>>>>>>> physical id: 0 > >>> > >>>>>>>>>>> bus info: pci at 0000:06:00.0 > >>> > >>>>>>>>>>> version: 30 > >>> > >>>>>>>>>>> width: 32 bits > >>> > >>>>>>>>>>> clock: 33MHz > >>> > >>>>>>>>>>> capabilities: pm msi vga_controller bus_master > >>> > >>>>> cap_list rom > >>> > >>>>>>>>>>> configuration: driver=ast latency=0 > >>> > >>>>>>>>>>> resources: irq:19 memory:cb000000-cbffffff > >>> > >>>>>>>>>>> memory:cc000000-cc01ffff ioport:4000(size=128) > >>> > >>>>>>>>>>> *-display UNCLAIMED > >>> > >>>>>>>>>>> description: VGA compatible controller > >>> > >>>>>>>>>>> product: NVIDIA Corporation > >>> > >>>>>>>>>>> vendor: NVIDIA Corporation > >>> > >>>>>>>>>>> physical id: 0 > >>> > >>>>>>>>>>> bus info: pci at 0000:82:00.0 > >>> > >>>>>>>>>>> version: a1 > >>> > >>>>>>>>>>> width: 64 bits > >>> > >>>>>>>>>>> clock: 33MHz > >>> > >>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >>> > >>>>> cap_list > >>> > >>>>>>>>>>> configuration: latency=0 > >>> > >>>>>>>>>>> resources: iomemory:387f0-387ef > >>> iomemory:387f0-387ef > >>> > >>>>>>>>>>> memory:fa000000-faffffff > >>> memory:387fe0000000-387fefffffff > >>> > >>>>>>>>>>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128) > >>> > >>>>>>>>>>> memory:fb000000-fb07ffff > >>> > >>>>>>>>>>> *-display UNCLAIMED > >>> > >>>>>>>>>>> description: VGA compatible controller > >>> > >>>>>>>>>>> product: NVIDIA Corporation > >>> > >>>>>>>>>>> vendor: NVIDIA Corporation > >>> > >>>>>>>>>>> physical id: 0 > >>> > >>>>>>>>>>> bus info: pci at 0000:83:00.0 > >>> > >>>>>>>>>>> version: a1 > >>> > >>>>>>>>>>> width: 64 bits > >>> > >>>>>>>>>>> clock: 33MHz > >>> > >>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >>> > >>>>> cap_list > >>> > >>>>>>>>>>> configuration: latency=0 > >>> > >>>>>>>>>>> resources: iomemory:387f0-387ef > >>> iomemory:387f0-387ef > >>> > >>>>>>>>>>> memory:f8000000-f8ffffff > >>> memory:387fc0000000-387fcfffffff > >>> > >>>>>>>>>>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128) > >>> > >>>>>>>>>>> memory:f9000000-f907ffff > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> However what scares the hell out of me is that I don't > >>> see > >>> > >>>>> NVIDIA > >>> > >>>>>>>>>>> driver > >>> > >>>>>>>>>>> loaded > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> lsmod|grep nvidia > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> and the device nodes /dev/nvidia are not created. I am > >>> > >>>>> guessing I > >>> > >>>>>>>>>>> just > >>> > >>>>>>>>>>> missed some trivial step during the CUDA installation > >>> which > >>> > >>>>> is very > >>> > >>>>>>>>>>> involving. I am unfortunately too tired to debug this > >>> > >>>>> tonight. > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> Predrag > >>> > >>>>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> Links: > >>> > >>>> ------ > >>> > >>>> [1] http://findgl.mk > > > > > > Links: > > ------ > > [1] http://findgl.mk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Thu Oct 13 13:58:58 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Thu, 13 Oct 2016 17:58:58 +0000 Subject: GPU3 is "configured" In-Reply-To: References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> <20161013153826.f4agzWkMb%predragp@cs.cmu.edu> <3ad25168d2dc7502872b0cde94950655@imap.srv.cs.cmu.edu> <576ceb12fa4fffb3b72b68a742a9b0b1@imap.srv.cs.cmu.edu> Message-ID: According to the tensorflow site, the conda package doesn't support GPUs. On Thu, Oct 13, 2016, 6:55 PM Predrag Punosevac < predragp at imap.srv.cs.cmu.edu> wrote: > On 2016-10-13 13:51, Dougal Sutherland wrote: > > I actually haven't gotten tensorflow working yet -- the bazel build > > just hangs on me. I think it maybe has to do with home directories > > being on NFS, but I can't figure out bazel at all. I'll try some more > > tonight. > > > > According to one of Princeton guys we could just use Python conda for > TensorFlow. Please check out and use your scratch directory instead of > NFS. > > Quote: > > Hello, Predrag. > > We have caffe 1.00rc3 if you are interested. > > > ftp://ftp.cs.princeton.edu/pub/people/advorkin/SRPM/sd7/caffe-1.00rc3-3.sd7.src.rpm > > TensforFlow and protobuf-3 work great with conda > (http://conda.pydata.org). I just tried and had no problems installing > it for Python 2.7 and 3.5 > > > > Caffe should be workable following the instructions Predrag forwarded. > > > > - Dougal > > > > On Thu, Oct 13, 2016 at 6:39 PM Predrag Punosevac > > wrote: > > > >> Dear Autonians, > >> > >> In the case anybody is interested what happens behind the scenes, > >> Doug > >> got Caffe and TensorFlow to work on > >> GPU3. Please see message below. I also got the very useful feed > >> back > >> from Princeton and Rutgers people. Please check out if you care (you > >> will have to log into Gmail to see the exchange). > >> > >> https://groups.google.com/forum/#!forum/springdale-users > >> > >> I need to think how we move forward with this before start pulling > >> triggers. If somebody is itchy and can't wait please build Caffe and > >> TensorFlow in your scratch directory following below howto. > >> > >> Predrag > >> > >> On 2016-10-13 13:24, Dougal Sutherland wrote: > >>> A note about cudnn: > >>> > >>> There are a bunch of versions of cudnn. They're not > >>> backwards-compatible, and different versions of > >>> caffe/tensorflow/whatever want different ones. > >>> > >>> I currently am using the setup in ~dsutherl/cudnn_files: > >>> > >>> * I have a bunch of versions of the installer there. > >>> * The use-cudnn.sh script, intended to be used like "source > >>> use-cudnn.sh 5.1", will untar the appropriate one into a scratch > >>> directory (if it hasn't already been done) and set > >>> CPATH/LIBRARY_PATH/LD_LIBRARY_PATH appropriately. LD_LIBRARY_PATH > >> is > >>> needed for caffe binaries, since they don't link to the absolute > >> path; > >>> the first two (not sure about the the third) are needed for > >> theano. > >>> Dunno about tensorflow yet. > >>> > >>> So, here's the Caffe setup: > >>> > >>> cd /home/scratch/$USER > >>> git clone https://github.com/BVLC/caffe > >>> cd caffe > >>> cp Makefile.config.example Makefile.config > >>> > >>> # tell it to use openblas; using atlas needs some changes to the > >>> Makefile > >>> sed -i 's/BLAS := atlas/BLAS := open/' Makefile.config > >>> > >>> # configure to use cudnn (optional) > >>> source ~dsutherl/cudnn-files/use-cudnn.sh 5.1 > >>> sed -i 's/# USE_CUDNN := 1/USE_CUDNN := 1/' Makefile.config > >>> perl -i -pe 's|$| '$CUDNN_DIR'/include| if /INCLUDE_DIRS :=/' > >>> Makefile.config > >>> perl -i -pe 's|$| '$CUDNN_DIR'/lib64| if /LIBRARY_DIRS :=/' > >>> Makefile.config > >>> > >>> # build the library > >>> make -j23 > >>> > >>> # to do tests (takes ~10 minutes): > >>> make -j23 test > >>> make runtest > >>> > >>> # Now, to run caffe binaries you'll need to remember to source > >>> use-cudnn if you used cudnn before. > >>> > >>> # To build the python libary: > >>> make py > >>> > >>> # Requirements for the python library: > >>> # Some of the system packages are too old; this installs them in > >> your > >>> scratch directory. > >>> # You'll have to set PYTHONUSERBASE again before running any > >> python > >>> processes that use these libs. > >>> export PYTHONUSERBASE=$HOME/scratch/.local; > >>> export PATH=$PYTHONUSERBASE/bin:"$PATH" # <- optional > >>> pip install --user -r python/requirements.txt > >>> > >>> # Caffe is dumb and doesn't package its python library properly. > >> The > >>> easiest way to use it is: > >>> export PYTHONPATH=/home/scratch/$USER/caffe/python:$PYTHONPATH > >>> python -c 'import caffe' > >>> > >>> On Thu, Oct 13, 2016 at 6:01 PM Dougal Sutherland > >> > >>> wrote: > >>> > >>>> Java fix seemed to work. Now tensorflow wants python-wheel and > >>>> swig. > >>>> > >>>> On Thu, Oct 13, 2016 at 5:08 PM Predrag Punosevac > >>>> wrote: > >>>> > >>>>> On 2016-10-13 11:46, Dougal Sutherland wrote: > >>>>> > >>>>>> Having some trouble with tensorflow, because: > >>>>> > >>>>>> > >>>>> > >>>>>> * it require's Google's bazel build system > >>>>> > >>>>>> > >>>>> > >>>>>> * The bazel installer says > >>>>> > >>>>>> Java version is 1.7.0_111 while at least 1.8 is needed. > >>>>> > >>>>>> * > >>>>> > >>>>>> > >>>>> > >>>>>> * $ java -version > >>>>> > >>>>>> openjdk version "1.8.0_102" > >>>>> > >>>>>> OpenJDK Runtime Environment (build 1.8.0_102-b14) > >>>>> > >>>>>> OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode) > >>>>> > >>>>>> $ javac -version > >>>>> > >>>>>> javac 1.7.0_111 > >>>>> > >>>>>> > >>>>> > >>>>> I just did yum -y install java-1.8.0* which installs openjdk > >> 1.8. > >>>>> Please > >>>>> > >>>>> change your java. Let me know if > >>>>> > >>>>> you want me to install Oracle JDK 1.8 > >>>>> > >>>>> Predrag > >>>>> > >>>>>> On Thu, Oct 13, 2016 at 4:38 PM Predrag Punosevac > >>>>> > >>>>>> wrote: > >>>>> > >>>>>> > >>>>> > >>>>>>> Dougal Sutherland wrote: > >>>>> > >>>>>>> > >>>>> > >>>>>>>> Also, this seemed to work for me so far for protobuf: > >>>>> > >>>>>>>> > >>>>> > >>>>>>>> cd /home/scratch/$USER > >>>>> > >>>>>>>> VER=3.1.0 > >>>>> > >>>>>>>> wget > >>>>> > >>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > https://github.com/google/protobuf/releases/download/v$VER/protobuf-cpp-$VER.tar.gz > >>>>> > >>>>>>>> tar xf protobuf-cpp-$VER.tar.gz > >>>>> > >>>>>>>> cd protobuf-cpp-$VER > >>>>> > >>>>>>>> ./configure --prefix=/home/scratch/$USER > >>>>> > >>>>>>>> make -j12 > >>>>> > >>>>>>>> make -j12 check > >>>>> > >>>>>>>> make install > >>>>> > >>>>>>> > >>>>> > >>>>>>> That is great help! > >>>>> > >>>>>>> > >>>>> > >>>>>>>> > >>>>> > >>>>>>>> You could change --prefix=/usr if making an RPM. > >>>>> > >>>>>>>> > >>>>> > >>>>>>>> On Thu, Oct 13, 2016 at 4:26 PM Dougal Sutherland > >>>>> > >>>>>>> wrote: > >>>>> > >>>>>>>> > >>>>> > >>>>>>>>> Some more packages for caffe: > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> leveldb-devel snappy-devel opencv-devel boost-devel > >>>>> > >>>>>>>>> hdf5-devel gflags-devel glog-devel lmdb-devel > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> (Some of those might be installed already, but at least > >>>>> gflags > >>>>> > >>>>>>> is > >>>>> > >>>>>>>>> definitely missing.) > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> On Thu, Oct 13, 2016 at 3:44 PM Predrag Punosevac < > >>>>> > >>>>>>>>> predragp at imap.srv.cs.cmu.edu> wrote: > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> On 2016-10-12 23:26, Arne Suppe wrote: > >>>>> > >>>>>>>>>> Hmm - I don???t use matlab for deep learning, but gpuDevice > >>>>> > >>>>>>> also hangs > >>>>> > >>>>>>>>>> on my computer with R2016a. > >>>>> > >>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> We would have to escalate this with MathWorks. I have seen > >>>>> work > >>>>> > >>>>>>> around > >>>>> > >>>>>>>>> Internet but it looks like a bug in one of Mathworks > >> provided > >>>>> > >>>>>>> MEX files. > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>>> I was able compile the matrixMul example in the CUDA > >>>>> samples > >>>>> > >>>>>>> and run > >>>>> > >>>>>>>>>> it on gpu3, so I think the build environment is probably > >>>>> all > >>>>> > >>>>>>> set. > >>>>> > >>>>>>>>>> > >>>>> > >>>>>>>>>> As for the openGL, I think its possibly a problem with > >>>>> their > >>>>> > >>>>>>> build > >>>>> > >>>>>>>>>> script findgl.mk [1] [1] [1] which is not familiar with > >>>>> Springdale OS. > >>>>> > >>>>>>> The > >>>>> > >>>>>>>>>> demo_suite directory has a precompiled nbody binary you may > >>>>> > >>>>>>> try, but I > >>>>> > >>>>>>>>>> suspect most users will not need graphics. > >>>>> > >>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> That should not be too hard to fix. Some header files have > >> to > >>>>> be > >>>>> > >>>>>>>>> manually edited. The funny part until 7.2 Princeton people > >>>>> > >>>>>>> didn't bother > >>>>> > >>>>>>>>> to remove RHEL branding which actually made things easier > >> for > >>>>> > >>>>>>> us. > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> Doug is trying right now to compile the latest Caffe, > >>>>> > >>>>>>> TensorFlow, and > >>>>> > >>>>>>>>> protobuf-3. We will try to create an RPM for that so that we > >>>>> > >>>>>>> don't have > >>>>> > >>>>>>>>> to go through this again. I also asked Princeton and Rutgers > >>>>> > >>>>>>> guys if > >>>>> > >>>>>>>>> they > >>>>> > >>>>>>>>> have WIP RPMs to share. > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> Predrag > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>>> Arne > >>>>> > >>>>>>>>>> > >>>>> > >>>>>>>>>> > >>>>> > >>>>>>>>>> > >>>>> > >>>>>>>>>> > >>>>> > >>>>>>>>>>> On Oct 12, 2016, at 10:23 PM, Predrag Punosevac > >>>>> > >>>>>>> > >>>>> > >>>>>>>>>>> wrote: > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> Arne Suppe wrote: > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>>> Hi Predrag, > >>>>> > >>>>>>>>>>>> Don???t know if this applies to you, but I just build a > >>>>> > >>>>>>> machines with > >>>>> > >>>>>>>>>>>> a GTX1080 which has the same PASCAL architecture as the > >>>>> > >>>>>>> Titan. After > >>>>> > >>>>>>>>>>>> installing CUDA 8, I still found I needed to install the > >>>>> > >>>>>>> latest > >>>>> > >>>>>>>>>>>> driver off of the NVIDIA web site to get the card > >>>>> > >>>>>>> recognized. Right > >>>>> > >>>>>>>>>>>> now, I am running 367.44. > >>>>> > >>>>>>>>>>>> > >>>>> > >>>>>>>>>>>> Arne > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> Arne, > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> Thank you so much for this e-mail. Yes it is damn PASCAL > >>>>> > >>>>>>> arhitecture I > >>>>> > >>>>>>>>>>> see lots of people complaining about it on the forums. I > >>>>> > >>>>>>> downloaded > >>>>> > >>>>>>>>>>> and > >>>>> > >>>>>>>>>>> installed driver from > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run&lang=us&type=GeForce > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> That seems to made a real difference. Check out this > >>>>> > >>>>>>> beautiful outputs > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> root at gpu3$ ls nvidia* > >>>>> > >>>>>>>>>>> nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm > >>>>> > >>>>>>>>>>> nvidia-uvm-tools > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> root at gpu3$ lspci | grep -i nvidia > >>>>> > >>>>>>>>>>> 02:00.0 VGA compatible controller: NVIDIA Corporation > >>>>> Device > >>>>> > >>>>>>> 1b00 (rev > >>>>> > >>>>>>>>>>> a1) > >>>>> > >>>>>>>>>>> 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >>>>> a1) > >>>>> > >>>>>>>>>>> 03:00.0 VGA compatible controller: NVIDIA Corporation > >>>>> Device > >>>>> > >>>>>>> 1b00 (rev > >>>>> > >>>>>>>>>>> a1) > >>>>> > >>>>>>>>>>> 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >>>>> a1) > >>>>> > >>>>>>>>>>> 82:00.0 VGA compatible controller: NVIDIA Corporation > >>>>> Device > >>>>> > >>>>>>> 1b00 (rev > >>>>> > >>>>>>>>>>> a1) > >>>>> > >>>>>>>>>>> 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >>>>> a1) > >>>>> > >>>>>>>>>>> 83:00.0 VGA compatible controller: NVIDIA Corporation > >>>>> Device > >>>>> > >>>>>>> 1b00 (rev > >>>>> > >>>>>>>>>>> a1) > >>>>> > >>>>>>>>>>> 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >>>>> a1) > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> root at gpu3$ ls /proc/driver > >>>>> > >>>>>>>>>>> nvidia nvidia-uvm nvram rtc > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> root at gpu3$ lsmod |grep nvidia > >>>>> > >>>>>>>>>>> nvidia_uvm 738901 0 > >>>>> > >>>>>>>>>>> nvidia_drm 43405 0 > >>>>> > >>>>>>>>>>> nvidia_modeset 764432 1 nvidia_drm > >>>>> > >>>>>>>>>>> nvidia 11492947 2 nvidia_modeset,nvidia_uvm > >>>>> > >>>>>>>>>>> drm_kms_helper 125056 2 ast,nvidia_drm > >>>>> > >>>>>>>>>>> drm 349210 5 > >>>>> > >>>>>>> ast,ttm,drm_kms_helper,nvidia_drm > >>>>> > >>>>>>>>>>> i2c_core 40582 7 > >>>>> > >>>>>>>>>>> ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> root at gpu3$ nvidia-smi > >>>>> > >>>>>>>>>>> Wed Oct 12 22:03:27 2016 > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > +-----------------------------------------------------------------------------+ > >>>>> > >>>>>>>>>>> | NVIDIA-SMI 367.57 Driver Version: 367.57 > >>>>> > >>>>>>>>>>> | > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > |-------------------------------+----------------------+----------------------+ > >>>>> > >>>>>>>>>>> | GPU Name Persistence-M| Bus-Id Disp.A | > >>>>> > >>>>>>> Volatile > >>>>> > >>>>>>>>>>> Uncorr. ECC | > >>>>> > >>>>>>>>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | > >>>>> > >>>>>>> GPU-Util > >>>>> > >>>>>>>>>>> Compute M. | > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > |===============================+======================+======================| > >>>>> > >>>>>>>>>>> | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | > >>>>> > >>>>>>>>>>> N/A | > >>>>> > >>>>>>>>>>> | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | > >>>>> > >>>>>>> 0% > >>>>> > >>>>>>>>>>> Default | > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > +-------------------------------+----------------------+----------------------+ > >>>>> > >>>>>>>>>>> | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | > >>>>> > >>>>>>>>>>> N/A | > >>>>> > >>>>>>>>>>> | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | > >>>>> > >>>>>>> 0% > >>>>> > >>>>>>>>>>> Default | > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > +-------------------------------+----------------------+----------------------+ > >>>>> > >>>>>>>>>>> | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | > >>>>> > >>>>>>>>>>> N/A | > >>>>> > >>>>>>>>>>> | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | > >>>>> > >>>>>>> 0% > >>>>> > >>>>>>>>>>> Default | > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > +-------------------------------+----------------------+----------------------+ > >>>>> > >>>>>>>>>>> | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | > >>>>> > >>>>>>>>>>> N/A | > >>>>> > >>>>>>>>>>> | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | > >>>>> > >>>>>>> 0% > >>>>> > >>>>>>>>>>> Default | > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > +-------------------------------+----------------------+----------------------+ > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > +-----------------------------------------------------------------------------+ > >>>>> > >>>>>>>>>>> | Processes: > >>>>> > >>>>>>> GPU > >>>>> > >>>>>>>>>>> Memory | > >>>>> > >>>>>>>>>>> | GPU PID Type Process name > >>>>> > >>>>>>>>>>> Usage > >>>>> > >>>>>>>>>>> | > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > |=============================================================================| > >>>>> > >>>>>>>>>>> | No running processes found > >>>>> > >>>>>>>>>>> | > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>> > >>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > +-----------------------------------------------------------------------------+ > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> /usr/local/cuda/extras/demo_suite/deviceQuery > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> Alignment requirement for Surfaces: Yes > >>>>> > >>>>>>>>>>> Device has ECC support: Disabled > >>>>> > >>>>>>>>>>> Device supports Unified Addressing (UVA): Yes > >>>>> > >>>>>>>>>>> Device PCI Domain ID / Bus ID / location ID: 0 / 131 / > >>>>> 0 > >>>>> > >>>>>>>>>>> Compute Mode: > >>>>> > >>>>>>>>>>> < Default (multiple host threads can use > >>>>> > >>>>>>> ::cudaSetDevice() with > >>>>> > >>>>>>>>>>> device simultaneously) > > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU1) : > >>>>> > >>>>>>>>>>> Yes > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU2) : > >>>>> > >>>>>>>>>>> No > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU3) : > >>>>> > >>>>>>>>>>> No > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU0) : > >>>>> > >>>>>>>>>>> Yes > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU2) : > >>>>> > >>>>>>>>>>> No > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU3) : > >>>>> > >>>>>>>>>>> No > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU0) : > >>>>> > >>>>>>>>>>> No > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU1) : > >>>>> > >>>>>>>>>>> No > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU3) : > >>>>> > >>>>>>>>>>> Yes > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU0) : > >>>>> > >>>>>>>>>>> No > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU1) : > >>>>> > >>>>>>>>>>> No > >>>>> > >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >>>>> (Pascal) > >>>>> > >>>>>>> (GPU2) : > >>>>> > >>>>>>>>>>> Yes > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = > >>>>> 8.0, > >>>>> > >>>>>>> CUDA > >>>>> > >>>>>>>>>>> Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X > >>>>> > >>>>>>> (Pascal), > >>>>> > >>>>>>>>>>> Device1 > >>>>> > >>>>>>>>>>> = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = > >>>>> > >>>>>>> TITAN X > >>>>> > >>>>>>>>>>> (Pascal) > >>>>> > >>>>>>>>>>> Result = PASS > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> Now not everything is rosy > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> root at gpu3$ cd > >>>>> ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody > >>>>> > >>>>>>>>>>> root at gpu3$ make > >>>>> > >>>>>>>>>>>>>> WARNING - libGL.so not found, refer to CUDA Getting > >>>>> > >>>>>>> Started Guide > >>>>> > >>>>>>>>>>> for how to find and install them. <<< > >>>>> > >>>>>>>>>>>>>> WARNING - libGLU.so not found, refer to CUDA Getting > >>>>> > >>>>>>> Started Guide > >>>>> > >>>>>>>>>>> for how to find and install them. <<< > >>>>> > >>>>>>>>>>>>>> WARNING - libX11.so not found, refer to CUDA Getting > >>>>> > >>>>>>> Started Guide > >>>>> > >>>>>>>>>>> for how to find and install them. <<< > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> even though those are installed. For example > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> root at gpu3$ yum whatprovides */libX11.so > >>>>> > >>>>>>>>>>> libX11-devel-1.6.3-2.el7.i686 : Development files for > >>>>> libX11 > >>>>> > >>>>>>>>>>> Repo : core > >>>>> > >>>>>>>>>>> Matched from: > >>>>> > >>>>>>>>>>> Filename : /usr/lib/libX11.so > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> also > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> mesa-libGLU-devel > >>>>> > >>>>>>>>>>> mesa-libGL-devel > >>>>> > >>>>>>>>>>> xorg-x11-drv-nvidia-devel > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> but > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> root at gpu3$ yum -y install mesa-libGLU-devel > >>>>> mesa-libGL-devel > >>>>> > >>>>>>>>>>> xorg-x11-drv-nvidia-devel > >>>>> > >>>>>>>>>>> Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already > >>>>> > >>>>>>> installed and > >>>>> > >>>>>>>>>>> latest version > >>>>> > >>>>>>>>>>> Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 > >>>>> already > >>>>> > >>>>>>>>>>> installed > >>>>> > >>>>>>>>>>> and latest version > >>>>> > >>>>>>>>>>> Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 > >>>>> > >>>>>>> already > >>>>> > >>>>>>>>>>> installed and latest version > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> Also from MATLAB gpuDevice hangs. > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> So we still don't have a working installation. Any help > >>>>> would > >>>>> > >>>>>>> be > >>>>> > >>>>>>>>>>> appreciated. > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> Best, > >>>>> > >>>>>>>>>>> Predrag > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> P.S. Once we have a working installation we can think of > >>>>> > >>>>>>> installing > >>>>> > >>>>>>>>>>> Caffe and TensorFlow. For now we have to see why the > >>>>> things > >>>>> > >>>>>>> are not > >>>>> > >>>>>>>>>>> working. > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac > >>>>> > >>>>>>> > >>>>> > >>>>>>>>>>>>> wrote: > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> Dear Autonians, > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> GPU3 is "configured". Namely you can log into it and all > >>>>> > >>>>>>> packages > >>>>> > >>>>>>>>>>>>> are > >>>>> > >>>>>>>>>>>>> installed. However I couldn't get NVIDIA provided CUDA > >>>>> > >>>>>>> driver to > >>>>> > >>>>>>>>>>>>> recognize GPU cards. They appear to be properly > >>>>> installed > >>>>> > >>>>>>> from the > >>>>> > >>>>>>>>>>>>> hardware point of view and you can list them with > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> lshw -class display > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> root at gpu3$ lshw -class display > >>>>> > >>>>>>>>>>>>> *-display UNCLAIMED > >>>>> > >>>>>>>>>>>>> description: VGA compatible controller > >>>>> > >>>>>>>>>>>>> product: NVIDIA Corporation > >>>>> > >>>>>>>>>>>>> vendor: NVIDIA Corporation > >>>>> > >>>>>>>>>>>>> physical id: 0 > >>>>> > >>>>>>>>>>>>> bus info: pci at 0000:02:00.0 > >>>>> > >>>>>>>>>>>>> version: a1 > >>>>> > >>>>>>>>>>>>> width: 64 bits > >>>>> > >>>>>>>>>>>>> clock: 33MHz > >>>>> > >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >>>>> > >>>>>>> cap_list > >>>>> > >>>>>>>>>>>>> configuration: latency=0 > >>>>> > >>>>>>>>>>>>> resources: iomemory:383f0-383ef > >>>>> iomemory:383f0-383ef > >>>>> > >>>>>>>>>>>>> memory:cf000000-cfffffff > >>>>> memory:383fe0000000-383fefffffff > >>>>> > >>>>>>>>>>>>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128) > >>>>> > >>>>>>>>>>>>> memory:d0000000-d007ffff > >>>>> > >>>>>>>>>>>>> *-display UNCLAIMED > >>>>> > >>>>>>>>>>>>> description: VGA compatible controller > >>>>> > >>>>>>>>>>>>> product: NVIDIA Corporation > >>>>> > >>>>>>>>>>>>> vendor: NVIDIA Corporation > >>>>> > >>>>>>>>>>>>> physical id: 0 > >>>>> > >>>>>>>>>>>>> bus info: pci at 0000:03:00.0 > >>>>> > >>>>>>>>>>>>> version: a1 > >>>>> > >>>>>>>>>>>>> width: 64 bits > >>>>> > >>>>>>>>>>>>> clock: 33MHz > >>>>> > >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >>>>> > >>>>>>> cap_list > >>>>> > >>>>>>>>>>>>> configuration: latency=0 > >>>>> > >>>>>>>>>>>>> resources: iomemory:383f0-383ef > >>>>> iomemory:383f0-383ef > >>>>> > >>>>>>>>>>>>> memory:cd000000-cdffffff > >>>>> memory:383fc0000000-383fcfffffff > >>>>> > >>>>>>>>>>>>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128) > >>>>> > >>>>>>>>>>>>> memory:ce000000-ce07ffff > >>>>> > >>>>>>>>>>>>> *-display > >>>>> > >>>>>>>>>>>>> description: VGA compatible controller > >>>>> > >>>>>>>>>>>>> product: ASPEED Graphics Family > >>>>> > >>>>>>>>>>>>> vendor: ASPEED Technology, Inc. > >>>>> > >>>>>>>>>>>>> physical id: 0 > >>>>> > >>>>>>>>>>>>> bus info: pci at 0000:06:00.0 > >>>>> > >>>>>>>>>>>>> version: 30 > >>>>> > >>>>>>>>>>>>> width: 32 bits > >>>>> > >>>>>>>>>>>>> clock: 33MHz > >>>>> > >>>>>>>>>>>>> capabilities: pm msi vga_controller bus_master > >>>>> > >>>>>>> cap_list rom > >>>>> > >>>>>>>>>>>>> configuration: driver=ast latency=0 > >>>>> > >>>>>>>>>>>>> resources: irq:19 memory:cb000000-cbffffff > >>>>> > >>>>>>>>>>>>> memory:cc000000-cc01ffff ioport:4000(size=128) > >>>>> > >>>>>>>>>>>>> *-display UNCLAIMED > >>>>> > >>>>>>>>>>>>> description: VGA compatible controller > >>>>> > >>>>>>>>>>>>> product: NVIDIA Corporation > >>>>> > >>>>>>>>>>>>> vendor: NVIDIA Corporation > >>>>> > >>>>>>>>>>>>> physical id: 0 > >>>>> > >>>>>>>>>>>>> bus info: pci at 0000:82:00.0 > >>>>> > >>>>>>>>>>>>> version: a1 > >>>>> > >>>>>>>>>>>>> width: 64 bits > >>>>> > >>>>>>>>>>>>> clock: 33MHz > >>>>> > >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >>>>> > >>>>>>> cap_list > >>>>> > >>>>>>>>>>>>> configuration: latency=0 > >>>>> > >>>>>>>>>>>>> resources: iomemory:387f0-387ef > >>>>> iomemory:387f0-387ef > >>>>> > >>>>>>>>>>>>> memory:fa000000-faffffff > >>>>> memory:387fe0000000-387fefffffff > >>>>> > >>>>>>>>>>>>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128) > >>>>> > >>>>>>>>>>>>> memory:fb000000-fb07ffff > >>>>> > >>>>>>>>>>>>> *-display UNCLAIMED > >>>>> > >>>>>>>>>>>>> description: VGA compatible controller > >>>>> > >>>>>>>>>>>>> product: NVIDIA Corporation > >>>>> > >>>>>>>>>>>>> vendor: NVIDIA Corporation > >>>>> > >>>>>>>>>>>>> physical id: 0 > >>>>> > >>>>>>>>>>>>> bus info: pci at 0000:83:00.0 > >>>>> > >>>>>>>>>>>>> version: a1 > >>>>> > >>>>>>>>>>>>> width: 64 bits > >>>>> > >>>>>>>>>>>>> clock: 33MHz > >>>>> > >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >>>>> > >>>>>>> cap_list > >>>>> > >>>>>>>>>>>>> configuration: latency=0 > >>>>> > >>>>>>>>>>>>> resources: iomemory:387f0-387ef > >>>>> iomemory:387f0-387ef > >>>>> > >>>>>>>>>>>>> memory:f8000000-f8ffffff > >>>>> memory:387fc0000000-387fcfffffff > >>>>> > >>>>>>>>>>>>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128) > >>>>> > >>>>>>>>>>>>> memory:f9000000-f907ffff > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> However what scares the hell out of me is that I don't > >>>>> see > >>>>> > >>>>>>> NVIDIA > >>>>> > >>>>>>>>>>>>> driver > >>>>> > >>>>>>>>>>>>> loaded > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> lsmod|grep nvidia > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> and the device nodes /dev/nvidia are not created. I am > >>>>> > >>>>>>> guessing I > >>>>> > >>>>>>>>>>>>> just > >>>>> > >>>>>>>>>>>>> missed some trivial step during the CUDA installation > >>>>> which > >>>>> > >>>>>>> is very > >>>>> > >>>>>>>>>>>>> involving. I am unfortunately too tired to debug this > >>>>> > >>>>>>> tonight. > >>>>> > >>>>>>>>>>>>> > >>>>> > >>>>>>>>>>>>> Predrag > >>>>> > >>>>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>> > >>>>> > >>>>>> > >>>>> > >>>>>> Links: > >>>>> > >>>>>> ------ > >>>>> > >>>>>> [1] http://findgl.mk > >>> > >>> > >>> Links: > >>> ------ > >>> [1] http://findgl.mk > > > > > > Links: > > ------ > > [1] http://findgl.mk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dbayani at andrew.cmu.edu Mon Oct 17 02:13:05 2016 From: dbayani at andrew.cmu.edu (David Bayani) Date: Sun, 16 Oct 2016 23:13:05 -0700 Subject: Auton Lab Website Personnel List Message-ID: Dear Autonians- We will be updating the website's personnel list in the near future. Beyond what was described in the previous website-related email (sent October 5th), no action is needed from any lab members. However, if you would prefer we not list you online for whatever reason (with no more information than what is the current standard for our website's list), feel free to contact me so that we can shift something out. We consider it important to respect personal wishes regarding content release. -Sincerely David B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kandasamy at cmu.edu Mon Oct 17 18:23:47 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Mon, 17 Oct 2016 18:23:47 -0400 Subject: GPU3 is "configured" In-Reply-To: References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> <20161013153826.f4agzWkMb%predragp@cs.cmu.edu> <3ad25168d2dc7502872b0cde94950655@imap.srv.cs.cmu.edu> <576ceb12fa4fffb3b72b68a742a9b0b1@imap.srv.cs.cmu.edu> Message-ID: Hi, Just following up. Has anyone managed to resolve this yet? I still can't run tensorflow on gpu3. samy On Thu, Oct 13, 2016 at 1:58 PM, Dougal Sutherland wrote: > According to the tensorflow site, the conda package doesn't support GPUs. > > On Thu, Oct 13, 2016, 6:55 PM Predrag Punosevac < > predragp at imap.srv.cs.cmu.edu> wrote: > >> On 2016-10-13 13:51, Dougal Sutherland wrote: >> > I actually haven't gotten tensorflow working yet -- the bazel build >> > just hangs on me. I think it maybe has to do with home directories >> > being on NFS, but I can't figure out bazel at all. I'll try some more >> > tonight. >> > >> >> According to one of Princeton guys we could just use Python conda for >> TensorFlow. Please check out and use your scratch directory instead of >> NFS. >> >> Quote: >> >> Hello, Predrag. >> >> We have caffe 1.00rc3 if you are interested. >> >> ftp://ftp.cs.princeton.edu/pub/people/advorkin/SRPM/sd7/ >> caffe-1.00rc3-3.sd7.src.rpm >> >> TensforFlow and protobuf-3 work great with conda >> (http://conda.pydata.org). I just tried and had no problems installing >> it for Python 2.7 and 3.5 >> >> >> > Caffe should be workable following the instructions Predrag forwarded. >> > >> > - Dougal >> > >> > On Thu, Oct 13, 2016 at 6:39 PM Predrag Punosevac >> > wrote: >> > >> >> Dear Autonians, >> >> >> >> In the case anybody is interested what happens behind the scenes, >> >> Doug >> >> got Caffe and TensorFlow to work on >> >> GPU3. Please see message below. I also got the very useful feed >> >> back >> >> from Princeton and Rutgers people. Please check out if you care (you >> >> will have to log into Gmail to see the exchange). >> >> >> >> https://groups.google.com/forum/#!forum/springdale-users >> >> >> >> I need to think how we move forward with this before start pulling >> >> triggers. If somebody is itchy and can't wait please build Caffe and >> >> TensorFlow in your scratch directory following below howto. >> >> >> >> Predrag >> >> >> >> On 2016-10-13 13:24, Dougal Sutherland wrote: >> >>> A note about cudnn: >> >>> >> >>> There are a bunch of versions of cudnn. They're not >> >>> backwards-compatible, and different versions of >> >>> caffe/tensorflow/whatever want different ones. >> >>> >> >>> I currently am using the setup in ~dsutherl/cudnn_files: >> >>> >> >>> * I have a bunch of versions of the installer there. >> >>> * The use-cudnn.sh script, intended to be used like "source >> >>> use-cudnn.sh 5.1", will untar the appropriate one into a scratch >> >>> directory (if it hasn't already been done) and set >> >>> CPATH/LIBRARY_PATH/LD_LIBRARY_PATH appropriately. LD_LIBRARY_PATH >> >> is >> >>> needed for caffe binaries, since they don't link to the absolute >> >> path; >> >>> the first two (not sure about the the third) are needed for >> >> theano. >> >>> Dunno about tensorflow yet. >> >>> >> >>> So, here's the Caffe setup: >> >>> >> >>> cd /home/scratch/$USER >> >>> git clone https://github.com/BVLC/caffe >> >>> cd caffe >> >>> cp Makefile.config.example Makefile.config >> >>> >> >>> # tell it to use openblas; using atlas needs some changes to the >> >>> Makefile >> >>> sed -i 's/BLAS := atlas/BLAS := open/' Makefile.config >> >>> >> >>> # configure to use cudnn (optional) >> >>> source ~dsutherl/cudnn-files/use-cudnn.sh 5.1 >> >>> sed -i 's/# USE_CUDNN := 1/USE_CUDNN := 1/' Makefile.config >> >>> perl -i -pe 's|$| '$CUDNN_DIR'/include| if /INCLUDE_DIRS :=/' >> >>> Makefile.config >> >>> perl -i -pe 's|$| '$CUDNN_DIR'/lib64| if /LIBRARY_DIRS :=/' >> >>> Makefile.config >> >>> >> >>> # build the library >> >>> make -j23 >> >>> >> >>> # to do tests (takes ~10 minutes): >> >>> make -j23 test >> >>> make runtest >> >>> >> >>> # Now, to run caffe binaries you'll need to remember to source >> >>> use-cudnn if you used cudnn before. >> >>> >> >>> # To build the python libary: >> >>> make py >> >>> >> >>> # Requirements for the python library: >> >>> # Some of the system packages are too old; this installs them in >> >> your >> >>> scratch directory. >> >>> # You'll have to set PYTHONUSERBASE again before running any >> >> python >> >>> processes that use these libs. >> >>> export PYTHONUSERBASE=$HOME/scratch/.local; >> >>> export PATH=$PYTHONUSERBASE/bin:"$PATH" # <- optional >> >>> pip install --user -r python/requirements.txt >> >>> >> >>> # Caffe is dumb and doesn't package its python library properly. >> >> The >> >>> easiest way to use it is: >> >>> export PYTHONPATH=/home/scratch/$USER/caffe/python:$PYTHONPATH >> >>> python -c 'import caffe' >> >>> >> >>> On Thu, Oct 13, 2016 at 6:01 PM Dougal Sutherland >> >> >> >>> wrote: >> >>> >> >>>> Java fix seemed to work. Now tensorflow wants python-wheel and >> >>>> swig. >> >>>> >> >>>> On Thu, Oct 13, 2016 at 5:08 PM Predrag Punosevac >> >>>> wrote: >> >>>> >> >>>>> On 2016-10-13 11:46, Dougal Sutherland wrote: >> >>>>> >> >>>>>> Having some trouble with tensorflow, because: >> >>>>> >> >>>>>> >> >>>>> >> >>>>>> * it require's Google's bazel build system >> >>>>> >> >>>>>> >> >>>>> >> >>>>>> * The bazel installer says >> >>>>> >> >>>>>> Java version is 1.7.0_111 while at least 1.8 is needed. >> >>>>> >> >>>>>> * >> >>>>> >> >>>>>> >> >>>>> >> >>>>>> * $ java -version >> >>>>> >> >>>>>> openjdk version "1.8.0_102" >> >>>>> >> >>>>>> OpenJDK Runtime Environment (build 1.8.0_102-b14) >> >>>>> >> >>>>>> OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode) >> >>>>> >> >>>>>> $ javac -version >> >>>>> >> >>>>>> javac 1.7.0_111 >> >>>>> >> >>>>>> >> >>>>> >> >>>>> I just did yum -y install java-1.8.0* which installs openjdk >> >> 1.8. >> >>>>> Please >> >>>>> >> >>>>> change your java. Let me know if >> >>>>> >> >>>>> you want me to install Oracle JDK 1.8 >> >>>>> >> >>>>> Predrag >> >>>>> >> >>>>>> On Thu, Oct 13, 2016 at 4:38 PM Predrag Punosevac >> >>>>> >> >>>>>> wrote: >> >>>>> >> >>>>>> >> >>>>> >> >>>>>>> Dougal Sutherland wrote: >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>>>> Also, this seemed to work for me so far for protobuf: >> >>>>> >> >>>>>>>> >> >>>>> >> >>>>>>>> cd /home/scratch/$USER >> >>>>> >> >>>>>>>> VER=3.1.0 >> >>>>> >> >>>>>>>> wget >> >>>>> >> >>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > https://github.com/google/protobuf/releases/download/v$ >> VER/protobuf-cpp-$VER.tar.gz >> >>>>> >> >>>>>>>> tar xf protobuf-cpp-$VER.tar.gz >> >>>>> >> >>>>>>>> cd protobuf-cpp-$VER >> >>>>> >> >>>>>>>> ./configure --prefix=/home/scratch/$USER >> >>>>> >> >>>>>>>> make -j12 >> >>>>> >> >>>>>>>> make -j12 check >> >>>>> >> >>>>>>>> make install >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>>> That is great help! >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>>>> >> >>>>> >> >>>>>>>> You could change --prefix=/usr if making an RPM. >> >>>>> >> >>>>>>>> >> >>>>> >> >>>>>>>> On Thu, Oct 13, 2016 at 4:26 PM Dougal Sutherland >> >>>>> >> >>>>>>> wrote: >> >>>>> >> >>>>>>>> >> >>>>> >> >>>>>>>>> Some more packages for caffe: >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> leveldb-devel snappy-devel opencv-devel boost-devel >> >>>>> >> >>>>>>>>> hdf5-devel gflags-devel glog-devel lmdb-devel >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> (Some of those might be installed already, but at least >> >>>>> gflags >> >>>>> >> >>>>>>> is >> >>>>> >> >>>>>>>>> definitely missing.) >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> On Thu, Oct 13, 2016 at 3:44 PM Predrag Punosevac < >> >>>>> >> >>>>>>>>> predragp at imap.srv.cs.cmu.edu> wrote: >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> On 2016-10-12 23:26, Arne Suppe wrote: >> >>>>> >> >>>>>>>>>> Hmm - I don???t use matlab for deep learning, but gpuDevice >> >>>>> >> >>>>>>> also hangs >> >>>>> >> >>>>>>>>>> on my computer with R2016a. >> >>>>> >> >>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> We would have to escalate this with MathWorks. I have seen >> >>>>> work >> >>>>> >> >>>>>>> around >> >>>>> >> >>>>>>>>> Internet but it looks like a bug in one of Mathworks >> >> provided >> >>>>> >> >>>>>>> MEX files. >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>>> I was able compile the matrixMul example in the CUDA >> >>>>> samples >> >>>>> >> >>>>>>> and run >> >>>>> >> >>>>>>>>>> it on gpu3, so I think the build environment is probably >> >>>>> all >> >>>>> >> >>>>>>> set. >> >>>>> >> >>>>>>>>>> >> >>>>> >> >>>>>>>>>> As for the openGL, I think its possibly a problem with >> >>>>> their >> >>>>> >> >>>>>>> build >> >>>>> >> >>>>>>>>>> script findgl.mk [1] [1] [1] which is not familiar with >> >>>>> Springdale OS. >> >>>>> >> >>>>>>> The >> >>>>> >> >>>>>>>>>> demo_suite directory has a precompiled nbody binary you may >> >>>>> >> >>>>>>> try, but I >> >>>>> >> >>>>>>>>>> suspect most users will not need graphics. >> >>>>> >> >>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> That should not be too hard to fix. Some header files have >> >> to >> >>>>> be >> >>>>> >> >>>>>>>>> manually edited. The funny part until 7.2 Princeton people >> >>>>> >> >>>>>>> didn't bother >> >>>>> >> >>>>>>>>> to remove RHEL branding which actually made things easier >> >> for >> >>>>> >> >>>>>>> us. >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> Doug is trying right now to compile the latest Caffe, >> >>>>> >> >>>>>>> TensorFlow, and >> >>>>> >> >>>>>>>>> protobuf-3. We will try to create an RPM for that so that we >> >>>>> >> >>>>>>> don't have >> >>>>> >> >>>>>>>>> to go through this again. I also asked Princeton and Rutgers >> >>>>> >> >>>>>>> guys if >> >>>>> >> >>>>>>>>> they >> >>>>> >> >>>>>>>>> have WIP RPMs to share. >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> Predrag >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>>> Arne >> >>>>> >> >>>>>>>>>> >> >>>>> >> >>>>>>>>>> >> >>>>> >> >>>>>>>>>> >> >>>>> >> >>>>>>>>>> >> >>>>> >> >>>>>>>>>>> On Oct 12, 2016, at 10:23 PM, Predrag Punosevac >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>>>>>>> wrote: >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> Arne Suppe wrote: >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>> Hi Predrag, >> >>>>> >> >>>>>>>>>>>> Don???t know if this applies to you, but I just build a >> >>>>> >> >>>>>>> machines with >> >>>>> >> >>>>>>>>>>>> a GTX1080 which has the same PASCAL architecture as the >> >>>>> >> >>>>>>> Titan. After >> >>>>> >> >>>>>>>>>>>> installing CUDA 8, I still found I needed to install the >> >>>>> >> >>>>>>> latest >> >>>>> >> >>>>>>>>>>>> driver off of the NVIDIA web site to get the card >> >>>>> >> >>>>>>> recognized. Right >> >>>>> >> >>>>>>>>>>>> now, I am running 367.44. >> >>>>> >> >>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>> Arne >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> Arne, >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> Thank you so much for this e-mail. Yes it is damn PASCAL >> >>>>> >> >>>>>>> arhitecture I >> >>>>> >> >>>>>>>>>>> see lots of people complaining about it on the forums. I >> >>>>> >> >>>>>>> downloaded >> >>>>> >> >>>>>>>>>>> and >> >>>>> >> >>>>>>>>>>> installed driver from >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > http://www.nvidia.com/content/DriverDownload-March2009/ >> confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA- >> Linux-x86_64-367.57.run&lang=us&type=GeForce >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> That seems to made a real difference. Check out this >> >>>>> >> >>>>>>> beautiful outputs >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> root at gpu3$ ls nvidia* >> >>>>> >> >>>>>>>>>>> nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm >> >>>>> >> >>>>>>>>>>> nvidia-uvm-tools >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> root at gpu3$ lspci | grep -i nvidia >> >>>>> >> >>>>>>>>>>> 02:00.0 VGA compatible controller: NVIDIA Corporation >> >>>>> Device >> >>>>> >> >>>>>>> 1b00 (rev >> >>>>> >> >>>>>>>>>>> a1) >> >>>>> >> >>>>>>>>>>> 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >> >>>>> a1) >> >>>>> >> >>>>>>>>>>> 03:00.0 VGA compatible controller: NVIDIA Corporation >> >>>>> Device >> >>>>> >> >>>>>>> 1b00 (rev >> >>>>> >> >>>>>>>>>>> a1) >> >>>>> >> >>>>>>>>>>> 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >> >>>>> a1) >> >>>>> >> >>>>>>>>>>> 82:00.0 VGA compatible controller: NVIDIA Corporation >> >>>>> Device >> >>>>> >> >>>>>>> 1b00 (rev >> >>>>> >> >>>>>>>>>>> a1) >> >>>>> >> >>>>>>>>>>> 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >> >>>>> a1) >> >>>>> >> >>>>>>>>>>> 83:00.0 VGA compatible controller: NVIDIA Corporation >> >>>>> Device >> >>>>> >> >>>>>>> 1b00 (rev >> >>>>> >> >>>>>>>>>>> a1) >> >>>>> >> >>>>>>>>>>> 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev >> >>>>> a1) >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> root at gpu3$ ls /proc/driver >> >>>>> >> >>>>>>>>>>> nvidia nvidia-uvm nvram rtc >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> root at gpu3$ lsmod |grep nvidia >> >>>>> >> >>>>>>>>>>> nvidia_uvm 738901 0 >> >>>>> >> >>>>>>>>>>> nvidia_drm 43405 0 >> >>>>> >> >>>>>>>>>>> nvidia_modeset 764432 1 nvidia_drm >> >>>>> >> >>>>>>>>>>> nvidia 11492947 2 nvidia_modeset,nvidia_uvm >> >>>>> >> >>>>>>>>>>> drm_kms_helper 125056 2 ast,nvidia_drm >> >>>>> >> >>>>>>>>>>> drm 349210 5 >> >>>>> >> >>>>>>> ast,ttm,drm_kms_helper,nvidia_drm >> >>>>> >> >>>>>>>>>>> i2c_core 40582 7 >> >>>>> >> >>>>>>>>>>> ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> root at gpu3$ nvidia-smi >> >>>>> >> >>>>>>>>>>> Wed Oct 12 22:03:27 2016 >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > +----------------------------------------------------------- >> ------------------+ >> >>>>> >> >>>>>>>>>>> | NVIDIA-SMI 367.57 Driver Version: 367.57 >> >>>>> >> >>>>>>>>>>> | >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > |-------------------------------+----------------------+---- >> ------------------+ >> >>>>> >> >>>>>>>>>>> | GPU Name Persistence-M| Bus-Id Disp.A | >> >>>>> >> >>>>>>> Volatile >> >>>>> >> >>>>>>>>>>> Uncorr. ECC | >> >>>>> >> >>>>>>>>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | >> >>>>> >> >>>>>>> GPU-Util >> >>>>> >> >>>>>>>>>>> Compute M. | >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > |===============================+======================+==== >> ==================| >> >>>>> >> >>>>>>>>>>> | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | >> >>>>> >> >>>>>>>>>>> N/A | >> >>>>> >> >>>>>>>>>>> | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | >> >>>>> >> >>>>>>> 0% >> >>>>> >> >>>>>>>>>>> Default | >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > +-------------------------------+----------------------+---- >> ------------------+ >> >>>>> >> >>>>>>>>>>> | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | >> >>>>> >> >>>>>>>>>>> N/A | >> >>>>> >> >>>>>>>>>>> | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | >> >>>>> >> >>>>>>> 0% >> >>>>> >> >>>>>>>>>>> Default | >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > +-------------------------------+----------------------+---- >> ------------------+ >> >>>>> >> >>>>>>>>>>> | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | >> >>>>> >> >>>>>>>>>>> N/A | >> >>>>> >> >>>>>>>>>>> | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | >> >>>>> >> >>>>>>> 0% >> >>>>> >> >>>>>>>>>>> Default | >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > +-------------------------------+----------------------+---- >> ------------------+ >> >>>>> >> >>>>>>>>>>> | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | >> >>>>> >> >>>>>>>>>>> N/A | >> >>>>> >> >>>>>>>>>>> | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | >> >>>>> >> >>>>>>> 0% >> >>>>> >> >>>>>>>>>>> Default | >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > +-------------------------------+----------------------+---- >> ------------------+ >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > +----------------------------------------------------------- >> ------------------+ >> >>>>> >> >>>>>>>>>>> | Processes: >> >>>>> >> >>>>>>> GPU >> >>>>> >> >>>>>>>>>>> Memory | >> >>>>> >> >>>>>>>>>>> | GPU PID Type Process name >> >>>>> >> >>>>>>>>>>> Usage >> >>>>> >> >>>>>>>>>>> | >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > |=========================================================== >> ==================| >> >>>>> >> >>>>>>>>>>> | No running processes found >> >>>>> >> >>>>>>>>>>> | >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > +----------------------------------------------------------- >> ------------------+ >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> /usr/local/cuda/extras/demo_suite/deviceQuery >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> Alignment requirement for Surfaces: Yes >> >>>>> >> >>>>>>>>>>> Device has ECC support: Disabled >> >>>>> >> >>>>>>>>>>> Device supports Unified Addressing (UVA): Yes >> >>>>> >> >>>>>>>>>>> Device PCI Domain ID / Bus ID / location ID: 0 / 131 / >> >>>>> 0 >> >>>>> >> >>>>>>>>>>> Compute Mode: >> >>>>> >> >>>>>>>>>>> < Default (multiple host threads can use >> >>>>> >> >>>>>>> ::cudaSetDevice() with >> >>>>> >> >>>>>>>>>>> device simultaneously) > >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU1) : >> >>>>> >> >>>>>>>>>>> Yes >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU2) : >> >>>>> >> >>>>>>>>>>> No >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU3) : >> >>>>> >> >>>>>>>>>>> No >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU0) : >> >>>>> >> >>>>>>>>>>> Yes >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU2) : >> >>>>> >> >>>>>>>>>>> No >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU3) : >> >>>>> >> >>>>>>>>>>> No >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU0) : >> >>>>> >> >>>>>>>>>>> No >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU1) : >> >>>>> >> >>>>>>>>>>> No >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU3) : >> >>>>> >> >>>>>>>>>>> Yes >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU0) : >> >>>>> >> >>>>>>>>>>> No >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU1) : >> >>>>> >> >>>>>>>>>>> No >> >>>>> >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X >> >>>>> (Pascal) >> >>>>> >> >>>>>>> (GPU2) : >> >>>>> >> >>>>>>>>>>> Yes >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = >> >>>>> 8.0, >> >>>>> >> >>>>>>> CUDA >> >>>>> >> >>>>>>>>>>> Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X >> >>>>> >> >>>>>>> (Pascal), >> >>>>> >> >>>>>>>>>>> Device1 >> >>>>> >> >>>>>>>>>>> = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = >> >>>>> >> >>>>>>> TITAN X >> >>>>> >> >>>>>>>>>>> (Pascal) >> >>>>> >> >>>>>>>>>>> Result = PASS >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> Now not everything is rosy >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> root at gpu3$ cd >> >>>>> ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody >> >>>>> >> >>>>>>>>>>> root at gpu3$ make >> >>>>> >> >>>>>>>>>>>>>> WARNING - libGL.so not found, refer to CUDA Getting >> >>>>> >> >>>>>>> Started Guide >> >>>>> >> >>>>>>>>>>> for how to find and install them. <<< >> >>>>> >> >>>>>>>>>>>>>> WARNING - libGLU.so not found, refer to CUDA Getting >> >>>>> >> >>>>>>> Started Guide >> >>>>> >> >>>>>>>>>>> for how to find and install them. <<< >> >>>>> >> >>>>>>>>>>>>>> WARNING - libX11.so not found, refer to CUDA Getting >> >>>>> >> >>>>>>> Started Guide >> >>>>> >> >>>>>>>>>>> for how to find and install them. <<< >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> even though those are installed. For example >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> root at gpu3$ yum whatprovides */libX11.so >> >>>>> >> >>>>>>>>>>> libX11-devel-1.6.3-2.el7.i686 : Development files for >> >>>>> libX11 >> >>>>> >> >>>>>>>>>>> Repo : core >> >>>>> >> >>>>>>>>>>> Matched from: >> >>>>> >> >>>>>>>>>>> Filename : /usr/lib/libX11.so >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> also >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> mesa-libGLU-devel >> >>>>> >> >>>>>>>>>>> mesa-libGL-devel >> >>>>> >> >>>>>>>>>>> xorg-x11-drv-nvidia-devel >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> but >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> root at gpu3$ yum -y install mesa-libGLU-devel >> >>>>> mesa-libGL-devel >> >>>>> >> >>>>>>>>>>> xorg-x11-drv-nvidia-devel >> >>>>> >> >>>>>>>>>>> Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already >> >>>>> >> >>>>>>> installed and >> >>>>> >> >>>>>>>>>>> latest version >> >>>>> >> >>>>>>>>>>> Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 >> >>>>> already >> >>>>> >> >>>>>>>>>>> installed >> >>>>> >> >>>>>>>>>>> and latest version >> >>>>> >> >>>>>>>>>>> Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 >> >>>>> >> >>>>>>> already >> >>>>> >> >>>>>>>>>>> installed and latest version >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> Also from MATLAB gpuDevice hangs. >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> So we still don't have a working installation. Any help >> >>>>> would >> >>>>> >> >>>>>>> be >> >>>>> >> >>>>>>>>>>> appreciated. >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> Best, >> >>>>> >> >>>>>>>>>>> Predrag >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> P.S. Once we have a working installation we can think of >> >>>>> >> >>>>>>> installing >> >>>>> >> >>>>>>>>>>> Caffe and TensorFlow. For now we have to see why the >> >>>>> things >> >>>>> >> >>>>>>> are not >> >>>>> >> >>>>>>>>>>> working. >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac >> >>>>> >> >>>>>>> >> >> >>>>> >> >>>>>>>>>>>>> wrote: >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> Dear Autonians, >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> GPU3 is "configured". Namely you can log into it and all >> >>>>> >> >>>>>>> packages >> >>>>> >> >>>>>>>>>>>>> are >> >>>>> >> >>>>>>>>>>>>> installed. However I couldn't get NVIDIA provided CUDA >> >>>>> >> >>>>>>> driver to >> >>>>> >> >>>>>>>>>>>>> recognize GPU cards. They appear to be properly >> >>>>> installed >> >>>>> >> >>>>>>> from the >> >>>>> >> >>>>>>>>>>>>> hardware point of view and you can list them with >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> lshw -class display >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> root at gpu3$ lshw -class display >> >>>>> >> >>>>>>>>>>>>> *-display UNCLAIMED >> >>>>> >> >>>>>>>>>>>>> description: VGA compatible controller >> >>>>> >> >>>>>>>>>>>>> product: NVIDIA Corporation >> >>>>> >> >>>>>>>>>>>>> vendor: NVIDIA Corporation >> >>>>> >> >>>>>>>>>>>>> physical id: 0 >> >>>>> >> >>>>>>>>>>>>> bus info: pci at 0000:02:00.0 >> >>>>> >> >>>>>>>>>>>>> version: a1 >> >>>>> >> >>>>>>>>>>>>> width: 64 bits >> >>>>> >> >>>>>>>>>>>>> clock: 33MHz >> >>>>> >> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >> >>>>> >> >>>>>>> cap_list >> >>>>> >> >>>>>>>>>>>>> configuration: latency=0 >> >>>>> >> >>>>>>>>>>>>> resources: iomemory:383f0-383ef >> >>>>> iomemory:383f0-383ef >> >>>>> >> >>>>>>>>>>>>> memory:cf000000-cfffffff >> >>>>> memory:383fe0000000-383fefffffff >> >>>>> >> >>>>>>>>>>>>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128) >> >>>>> >> >>>>>>>>>>>>> memory:d0000000-d007ffff >> >>>>> >> >>>>>>>>>>>>> *-display UNCLAIMED >> >>>>> >> >>>>>>>>>>>>> description: VGA compatible controller >> >>>>> >> >>>>>>>>>>>>> product: NVIDIA Corporation >> >>>>> >> >>>>>>>>>>>>> vendor: NVIDIA Corporation >> >>>>> >> >>>>>>>>>>>>> physical id: 0 >> >>>>> >> >>>>>>>>>>>>> bus info: pci at 0000:03:00.0 >> >>>>> >> >>>>>>>>>>>>> version: a1 >> >>>>> >> >>>>>>>>>>>>> width: 64 bits >> >>>>> >> >>>>>>>>>>>>> clock: 33MHz >> >>>>> >> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >> >>>>> >> >>>>>>> cap_list >> >>>>> >> >>>>>>>>>>>>> configuration: latency=0 >> >>>>> >> >>>>>>>>>>>>> resources: iomemory:383f0-383ef >> >>>>> iomemory:383f0-383ef >> >>>>> >> >>>>>>>>>>>>> memory:cd000000-cdffffff >> >>>>> memory:383fc0000000-383fcfffffff >> >>>>> >> >>>>>>>>>>>>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128) >> >>>>> >> >>>>>>>>>>>>> memory:ce000000-ce07ffff >> >>>>> >> >>>>>>>>>>>>> *-display >> >>>>> >> >>>>>>>>>>>>> description: VGA compatible controller >> >>>>> >> >>>>>>>>>>>>> product: ASPEED Graphics Family >> >>>>> >> >>>>>>>>>>>>> vendor: ASPEED Technology, Inc. >> >>>>> >> >>>>>>>>>>>>> physical id: 0 >> >>>>> >> >>>>>>>>>>>>> bus info: pci at 0000:06:00.0 >> >>>>> >> >>>>>>>>>>>>> version: 30 >> >>>>> >> >>>>>>>>>>>>> width: 32 bits >> >>>>> >> >>>>>>>>>>>>> clock: 33MHz >> >>>>> >> >>>>>>>>>>>>> capabilities: pm msi vga_controller bus_master >> >>>>> >> >>>>>>> cap_list rom >> >>>>> >> >>>>>>>>>>>>> configuration: driver=ast latency=0 >> >>>>> >> >>>>>>>>>>>>> resources: irq:19 memory:cb000000-cbffffff >> >>>>> >> >>>>>>>>>>>>> memory:cc000000-cc01ffff ioport:4000(size=128) >> >>>>> >> >>>>>>>>>>>>> *-display UNCLAIMED >> >>>>> >> >>>>>>>>>>>>> description: VGA compatible controller >> >>>>> >> >>>>>>>>>>>>> product: NVIDIA Corporation >> >>>>> >> >>>>>>>>>>>>> vendor: NVIDIA Corporation >> >>>>> >> >>>>>>>>>>>>> physical id: 0 >> >>>>> >> >>>>>>>>>>>>> bus info: pci at 0000:82:00.0 >> >>>>> >> >>>>>>>>>>>>> version: a1 >> >>>>> >> >>>>>>>>>>>>> width: 64 bits >> >>>>> >> >>>>>>>>>>>>> clock: 33MHz >> >>>>> >> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >> >>>>> >> >>>>>>> cap_list >> >>>>> >> >>>>>>>>>>>>> configuration: latency=0 >> >>>>> >> >>>>>>>>>>>>> resources: iomemory:387f0-387ef >> >>>>> iomemory:387f0-387ef >> >>>>> >> >>>>>>>>>>>>> memory:fa000000-faffffff >> >>>>> memory:387fe0000000-387fefffffff >> >>>>> >> >>>>>>>>>>>>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128) >> >>>>> >> >>>>>>>>>>>>> memory:fb000000-fb07ffff >> >>>>> >> >>>>>>>>>>>>> *-display UNCLAIMED >> >>>>> >> >>>>>>>>>>>>> description: VGA compatible controller >> >>>>> >> >>>>>>>>>>>>> product: NVIDIA Corporation >> >>>>> >> >>>>>>>>>>>>> vendor: NVIDIA Corporation >> >>>>> >> >>>>>>>>>>>>> physical id: 0 >> >>>>> >> >>>>>>>>>>>>> bus info: pci at 0000:83:00.0 >> >>>>> >> >>>>>>>>>>>>> version: a1 >> >>>>> >> >>>>>>>>>>>>> width: 64 bits >> >>>>> >> >>>>>>>>>>>>> clock: 33MHz >> >>>>> >> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller >> >>>>> >> >>>>>>> cap_list >> >>>>> >> >>>>>>>>>>>>> configuration: latency=0 >> >>>>> >> >>>>>>>>>>>>> resources: iomemory:387f0-387ef >> >>>>> iomemory:387f0-387ef >> >>>>> >> >>>>>>>>>>>>> memory:f8000000-f8ffffff >> >>>>> memory:387fc0000000-387fcfffffff >> >>>>> >> >>>>>>>>>>>>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128) >> >>>>> >> >>>>>>>>>>>>> memory:f9000000-f907ffff >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> However what scares the hell out of me is that I don't >> >>>>> see >> >>>>> >> >>>>>>> NVIDIA >> >>>>> >> >>>>>>>>>>>>> driver >> >>>>> >> >>>>>>>>>>>>> loaded >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> lsmod|grep nvidia >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> and the device nodes /dev/nvidia are not created. I am >> >>>>> >> >>>>>>> guessing I >> >>>>> >> >>>>>>>>>>>>> just >> >>>>> >> >>>>>>>>>>>>> missed some trivial step during the CUDA installation >> >>>>> which >> >>>>> >> >>>>>>> is very >> >>>>> >> >>>>>>>>>>>>> involving. I am unfortunately too tired to debug this >> >>>>> >> >>>>>>> tonight. >> >>>>> >> >>>>>>>>>>>>> >> >>>>> >> >>>>>>>>>>>>> Predrag >> >>>>> >> >>>>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>>>> >> >>>>> >> >>>>>> Links: >> >>>>> >> >>>>>> ------ >> >>>>> >> >>>>>> [1] http://findgl.mk >> >>> >> >>> >> >>> Links: >> >>> ------ >> >>> [1] http://findgl.mk >> > >> > >> > Links: >> > ------ >> > [1] http://findgl.mk >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at cs.cmu.edu Mon Oct 17 20:37:31 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Mon, 17 Oct 2016 20:37:31 -0400 Subject: GPU3 is "configured" In-Reply-To: References: <20161012222658.bee4yVvYw%predragp@cs.cmu.edu> <20161013022332.Coy6tY6j7%predragp@cs.cmu.edu> <22C1E88D-D614-4509-B0EC-3EB2357C0C3A@andrew.cmu.edu> <20161013153826.f4agzWkMb%predragp@cs.cmu.edu> <3ad25168d2dc7502872b0cde94950655@imap.srv.cs.cmu.edu> <576ceb12fa4fffb3b72b68a742a9b0b1@imap.srv.cs.cmu.edu> Message-ID: <20161018003731.k_VYAK8Xs%predragp@cs.cmu.edu> Kirthevasan Kandasamy wrote: > Hi, > > Just following up. Has anyone managed to resolve this yet? > I still can't run tensorflow on gpu3. > > samy I will not have time to look this back before Wednesday. Predrag > > On Thu, Oct 13, 2016 at 1:58 PM, Dougal Sutherland wrote: > > > According to the tensorflow site, the conda package doesn't support GPUs. > > > > On Thu, Oct 13, 2016, 6:55 PM Predrag Punosevac < > > predragp at imap.srv.cs.cmu.edu> wrote: > > > >> On 2016-10-13 13:51, Dougal Sutherland wrote: > >> > I actually haven't gotten tensorflow working yet -- the bazel build > >> > just hangs on me. I think it maybe has to do with home directories > >> > being on NFS, but I can't figure out bazel at all. I'll try some more > >> > tonight. > >> > > >> > >> According to one of Princeton guys we could just use Python conda for > >> TensorFlow. Please check out and use your scratch directory instead of > >> NFS. > >> > >> Quote: > >> > >> Hello, Predrag. > >> > >> We have caffe 1.00rc3 if you are interested. > >> > >> ftp://ftp.cs.princeton.edu/pub/people/advorkin/SRPM/sd7/ > >> caffe-1.00rc3-3.sd7.src.rpm > >> > >> TensforFlow and protobuf-3 work great with conda > >> (http://conda.pydata.org). I just tried and had no problems installing > >> it for Python 2.7 and 3.5 > >> > >> > >> > Caffe should be workable following the instructions Predrag forwarded. > >> > > >> > - Dougal > >> > > >> > On Thu, Oct 13, 2016 at 6:39 PM Predrag Punosevac > >> > wrote: > >> > > >> >> Dear Autonians, > >> >> > >> >> In the case anybody is interested what happens behind the scenes, > >> >> Doug > >> >> got Caffe and TensorFlow to work on > >> >> GPU3. Please see message below. I also got the very useful feed > >> >> back > >> >> from Princeton and Rutgers people. Please check out if you care (you > >> >> will have to log into Gmail to see the exchange). > >> >> > >> >> https://groups.google.com/forum/#!forum/springdale-users > >> >> > >> >> I need to think how we move forward with this before start pulling > >> >> triggers. If somebody is itchy and can't wait please build Caffe and > >> >> TensorFlow in your scratch directory following below howto. > >> >> > >> >> Predrag > >> >> > >> >> On 2016-10-13 13:24, Dougal Sutherland wrote: > >> >>> A note about cudnn: > >> >>> > >> >>> There are a bunch of versions of cudnn. They're not > >> >>> backwards-compatible, and different versions of > >> >>> caffe/tensorflow/whatever want different ones. > >> >>> > >> >>> I currently am using the setup in ~dsutherl/cudnn_files: > >> >>> > >> >>> * I have a bunch of versions of the installer there. > >> >>> * The use-cudnn.sh script, intended to be used like "source > >> >>> use-cudnn.sh 5.1", will untar the appropriate one into a scratch > >> >>> directory (if it hasn't already been done) and set > >> >>> CPATH/LIBRARY_PATH/LD_LIBRARY_PATH appropriately. LD_LIBRARY_PATH > >> >> is > >> >>> needed for caffe binaries, since they don't link to the absolute > >> >> path; > >> >>> the first two (not sure about the the third) are needed for > >> >> theano. > >> >>> Dunno about tensorflow yet. > >> >>> > >> >>> So, here's the Caffe setup: > >> >>> > >> >>> cd /home/scratch/$USER > >> >>> git clone https://github.com/BVLC/caffe > >> >>> cd caffe > >> >>> cp Makefile.config.example Makefile.config > >> >>> > >> >>> # tell it to use openblas; using atlas needs some changes to the > >> >>> Makefile > >> >>> sed -i 's/BLAS := atlas/BLAS := open/' Makefile.config > >> >>> > >> >>> # configure to use cudnn (optional) > >> >>> source ~dsutherl/cudnn-files/use-cudnn.sh 5.1 > >> >>> sed -i 's/# USE_CUDNN := 1/USE_CUDNN := 1/' Makefile.config > >> >>> perl -i -pe 's|$| '$CUDNN_DIR'/include| if /INCLUDE_DIRS :=/' > >> >>> Makefile.config > >> >>> perl -i -pe 's|$| '$CUDNN_DIR'/lib64| if /LIBRARY_DIRS :=/' > >> >>> Makefile.config > >> >>> > >> >>> # build the library > >> >>> make -j23 > >> >>> > >> >>> # to do tests (takes ~10 minutes): > >> >>> make -j23 test > >> >>> make runtest > >> >>> > >> >>> # Now, to run caffe binaries you'll need to remember to source > >> >>> use-cudnn if you used cudnn before. > >> >>> > >> >>> # To build the python libary: > >> >>> make py > >> >>> > >> >>> # Requirements for the python library: > >> >>> # Some of the system packages are too old; this installs them in > >> >> your > >> >>> scratch directory. > >> >>> # You'll have to set PYTHONUSERBASE again before running any > >> >> python > >> >>> processes that use these libs. > >> >>> export PYTHONUSERBASE=$HOME/scratch/.local; > >> >>> export PATH=$PYTHONUSERBASE/bin:"$PATH" # <- optional > >> >>> pip install --user -r python/requirements.txt > >> >>> > >> >>> # Caffe is dumb and doesn't package its python library properly. > >> >> The > >> >>> easiest way to use it is: > >> >>> export PYTHONPATH=/home/scratch/$USER/caffe/python:$PYTHONPATH > >> >>> python -c 'import caffe' > >> >>> > >> >>> On Thu, Oct 13, 2016 at 6:01 PM Dougal Sutherland > >> >> > >> >>> wrote: > >> >>> > >> >>>> Java fix seemed to work. Now tensorflow wants python-wheel and > >> >>>> swig. > >> >>>> > >> >>>> On Thu, Oct 13, 2016 at 5:08 PM Predrag Punosevac > >> >>>> wrote: > >> >>>> > >> >>>>> On 2016-10-13 11:46, Dougal Sutherland wrote: > >> >>>>> > >> >>>>>> Having some trouble with tensorflow, because: > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>>>> * it require's Google's bazel build system > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>>>> * The bazel installer says > >> >>>>> > >> >>>>>> Java version is 1.7.0_111 while at least 1.8 is needed. > >> >>>>> > >> >>>>>> * > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>>>> * $ java -version > >> >>>>> > >> >>>>>> openjdk version "1.8.0_102" > >> >>>>> > >> >>>>>> OpenJDK Runtime Environment (build 1.8.0_102-b14) > >> >>>>> > >> >>>>>> OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode) > >> >>>>> > >> >>>>>> $ javac -version > >> >>>>> > >> >>>>>> javac 1.7.0_111 > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>>> I just did yum -y install java-1.8.0* which installs openjdk > >> >> 1.8. > >> >>>>> Please > >> >>>>> > >> >>>>> change your java. Let me know if > >> >>>>> > >> >>>>> you want me to install Oracle JDK 1.8 > >> >>>>> > >> >>>>> Predrag > >> >>>>> > >> >>>>>> On Thu, Oct 13, 2016 at 4:38 PM Predrag Punosevac > >> >>>>> > >> >>>>>> wrote: > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>>>>> Dougal Sutherland wrote: > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>>>> Also, this seemed to work for me so far for protobuf: > >> >>>>> > >> >>>>>>>> > >> >>>>> > >> >>>>>>>> cd /home/scratch/$USER > >> >>>>> > >> >>>>>>>> VER=3.1.0 > >> >>>>> > >> >>>>>>>> wget > >> >>>>> > >> >>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > https://github.com/google/protobuf/releases/download/v$ > >> VER/protobuf-cpp-$VER.tar.gz > >> >>>>> > >> >>>>>>>> tar xf protobuf-cpp-$VER.tar.gz > >> >>>>> > >> >>>>>>>> cd protobuf-cpp-$VER > >> >>>>> > >> >>>>>>>> ./configure --prefix=/home/scratch/$USER > >> >>>>> > >> >>>>>>>> make -j12 > >> >>>>> > >> >>>>>>>> make -j12 check > >> >>>>> > >> >>>>>>>> make install > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>>> That is great help! > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>>>> > >> >>>>> > >> >>>>>>>> You could change --prefix=/usr if making an RPM. > >> >>>>> > >> >>>>>>>> > >> >>>>> > >> >>>>>>>> On Thu, Oct 13, 2016 at 4:26 PM Dougal Sutherland > >> >>>>> > >> >>>>>>> wrote: > >> >>>>> > >> >>>>>>>> > >> >>>>> > >> >>>>>>>>> Some more packages for caffe: > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> leveldb-devel snappy-devel opencv-devel boost-devel > >> >>>>> > >> >>>>>>>>> hdf5-devel gflags-devel glog-devel lmdb-devel > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> (Some of those might be installed already, but at least > >> >>>>> gflags > >> >>>>> > >> >>>>>>> is > >> >>>>> > >> >>>>>>>>> definitely missing.) > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> On Thu, Oct 13, 2016 at 3:44 PM Predrag Punosevac < > >> >>>>> > >> >>>>>>>>> predragp at imap.srv.cs.cmu.edu> wrote: > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> On 2016-10-12 23:26, Arne Suppe wrote: > >> >>>>> > >> >>>>>>>>>> Hmm - I don???t use matlab for deep learning, but gpuDevice > >> >>>>> > >> >>>>>>> also hangs > >> >>>>> > >> >>>>>>>>>> on my computer with R2016a. > >> >>>>> > >> >>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> We would have to escalate this with MathWorks. I have seen > >> >>>>> work > >> >>>>> > >> >>>>>>> around > >> >>>>> > >> >>>>>>>>> Internet but it looks like a bug in one of Mathworks > >> >> provided > >> >>>>> > >> >>>>>>> MEX files. > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>>> I was able compile the matrixMul example in the CUDA > >> >>>>> samples > >> >>>>> > >> >>>>>>> and run > >> >>>>> > >> >>>>>>>>>> it on gpu3, so I think the build environment is probably > >> >>>>> all > >> >>>>> > >> >>>>>>> set. > >> >>>>> > >> >>>>>>>>>> > >> >>>>> > >> >>>>>>>>>> As for the openGL, I think its possibly a problem with > >> >>>>> their > >> >>>>> > >> >>>>>>> build > >> >>>>> > >> >>>>>>>>>> script findgl.mk [1] [1] [1] which is not familiar with > >> >>>>> Springdale OS. > >> >>>>> > >> >>>>>>> The > >> >>>>> > >> >>>>>>>>>> demo_suite directory has a precompiled nbody binary you may > >> >>>>> > >> >>>>>>> try, but I > >> >>>>> > >> >>>>>>>>>> suspect most users will not need graphics. > >> >>>>> > >> >>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> That should not be too hard to fix. Some header files have > >> >> to > >> >>>>> be > >> >>>>> > >> >>>>>>>>> manually edited. The funny part until 7.2 Princeton people > >> >>>>> > >> >>>>>>> didn't bother > >> >>>>> > >> >>>>>>>>> to remove RHEL branding which actually made things easier > >> >> for > >> >>>>> > >> >>>>>>> us. > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> Doug is trying right now to compile the latest Caffe, > >> >>>>> > >> >>>>>>> TensorFlow, and > >> >>>>> > >> >>>>>>>>> protobuf-3. We will try to create an RPM for that so that we > >> >>>>> > >> >>>>>>> don't have > >> >>>>> > >> >>>>>>>>> to go through this again. I also asked Princeton and Rutgers > >> >>>>> > >> >>>>>>> guys if > >> >>>>> > >> >>>>>>>>> they > >> >>>>> > >> >>>>>>>>> have WIP RPMs to share. > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> Predrag > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>>> Arne > >> >>>>> > >> >>>>>>>>>> > >> >>>>> > >> >>>>>>>>>> > >> >>>>> > >> >>>>>>>>>> > >> >>>>> > >> >>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> On Oct 12, 2016, at 10:23 PM, Predrag Punosevac > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>>>>>>> wrote: > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> Arne Suppe wrote: > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>> Hi Predrag, > >> >>>>> > >> >>>>>>>>>>>> Don???t know if this applies to you, but I just build a > >> >>>>> > >> >>>>>>> machines with > >> >>>>> > >> >>>>>>>>>>>> a GTX1080 which has the same PASCAL architecture as the > >> >>>>> > >> >>>>>>> Titan. After > >> >>>>> > >> >>>>>>>>>>>> installing CUDA 8, I still found I needed to install the > >> >>>>> > >> >>>>>>> latest > >> >>>>> > >> >>>>>>>>>>>> driver off of the NVIDIA web site to get the card > >> >>>>> > >> >>>>>>> recognized. Right > >> >>>>> > >> >>>>>>>>>>>> now, I am running 367.44. > >> >>>>> > >> >>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>> Arne > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> Arne, > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> Thank you so much for this e-mail. Yes it is damn PASCAL > >> >>>>> > >> >>>>>>> arhitecture I > >> >>>>> > >> >>>>>>>>>>> see lots of people complaining about it on the forums. I > >> >>>>> > >> >>>>>>> downloaded > >> >>>>> > >> >>>>>>>>>>> and > >> >>>>> > >> >>>>>>>>>>> installed driver from > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > http://www.nvidia.com/content/DriverDownload-March2009/ > >> confirmation.php?url=/XFree86/Linux-x86_64/367.57/NVIDIA- > >> Linux-x86_64-367.57.run&lang=us&type=GeForce > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> That seems to made a real difference. Check out this > >> >>>>> > >> >>>>>>> beautiful outputs > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ ls nvidia* > >> >>>>> > >> >>>>>>>>>>> nvidia0 nvidia1 nvidia2 nvidia3 nvidiactl nvidia-uvm > >> >>>>> > >> >>>>>>>>>>> nvidia-uvm-tools > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ lspci | grep -i nvidia > >> >>>>> > >> >>>>>>>>>>> 02:00.0 VGA compatible controller: NVIDIA Corporation > >> >>>>> Device > >> >>>>> > >> >>>>>>> 1b00 (rev > >> >>>>> > >> >>>>>>>>>>> a1) > >> >>>>> > >> >>>>>>>>>>> 02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >> >>>>> a1) > >> >>>>> > >> >>>>>>>>>>> 03:00.0 VGA compatible controller: NVIDIA Corporation > >> >>>>> Device > >> >>>>> > >> >>>>>>> 1b00 (rev > >> >>>>> > >> >>>>>>>>>>> a1) > >> >>>>> > >> >>>>>>>>>>> 03:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >> >>>>> a1) > >> >>>>> > >> >>>>>>>>>>> 82:00.0 VGA compatible controller: NVIDIA Corporation > >> >>>>> Device > >> >>>>> > >> >>>>>>> 1b00 (rev > >> >>>>> > >> >>>>>>>>>>> a1) > >> >>>>> > >> >>>>>>>>>>> 82:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >> >>>>> a1) > >> >>>>> > >> >>>>>>>>>>> 83:00.0 VGA compatible controller: NVIDIA Corporation > >> >>>>> Device > >> >>>>> > >> >>>>>>> 1b00 (rev > >> >>>>> > >> >>>>>>>>>>> a1) > >> >>>>> > >> >>>>>>>>>>> 83:00.1 Audio device: NVIDIA Corporation Device 10ef (rev > >> >>>>> a1) > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ ls /proc/driver > >> >>>>> > >> >>>>>>>>>>> nvidia nvidia-uvm nvram rtc > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ lsmod |grep nvidia > >> >>>>> > >> >>>>>>>>>>> nvidia_uvm 738901 0 > >> >>>>> > >> >>>>>>>>>>> nvidia_drm 43405 0 > >> >>>>> > >> >>>>>>>>>>> nvidia_modeset 764432 1 nvidia_drm > >> >>>>> > >> >>>>>>>>>>> nvidia 11492947 2 nvidia_modeset,nvidia_uvm > >> >>>>> > >> >>>>>>>>>>> drm_kms_helper 125056 2 ast,nvidia_drm > >> >>>>> > >> >>>>>>>>>>> drm 349210 5 > >> >>>>> > >> >>>>>>> ast,ttm,drm_kms_helper,nvidia_drm > >> >>>>> > >> >>>>>>>>>>> i2c_core 40582 7 > >> >>>>> > >> >>>>>>>>>>> ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ nvidia-smi > >> >>>>> > >> >>>>>>>>>>> Wed Oct 12 22:03:27 2016 > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > +----------------------------------------------------------- > >> ------------------+ > >> >>>>> > >> >>>>>>>>>>> | NVIDIA-SMI 367.57 Driver Version: 367.57 > >> >>>>> > >> >>>>>>>>>>> | > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > |-------------------------------+----------------------+---- > >> ------------------+ > >> >>>>> > >> >>>>>>>>>>> | GPU Name Persistence-M| Bus-Id Disp.A | > >> >>>>> > >> >>>>>>> Volatile > >> >>>>> > >> >>>>>>>>>>> Uncorr. ECC | > >> >>>>> > >> >>>>>>>>>>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | > >> >>>>> > >> >>>>>>> GPU-Util > >> >>>>> > >> >>>>>>>>>>> Compute M. | > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > |===============================+======================+==== > >> ==================| > >> >>>>> > >> >>>>>>>>>>> | 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | > >> >>>>> > >> >>>>>>>>>>> N/A | > >> >>>>> > >> >>>>>>>>>>> | 23% 32C P0 56W / 250W | 0MiB / 12189MiB | > >> >>>>> > >> >>>>>>> 0% > >> >>>>> > >> >>>>>>>>>>> Default | > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > +-------------------------------+----------------------+---- > >> ------------------+ > >> >>>>> > >> >>>>>>>>>>> | 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | > >> >>>>> > >> >>>>>>>>>>> N/A | > >> >>>>> > >> >>>>>>>>>>> | 23% 36C P0 57W / 250W | 0MiB / 12189MiB | > >> >>>>> > >> >>>>>>> 0% > >> >>>>> > >> >>>>>>>>>>> Default | > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > +-------------------------------+----------------------+---- > >> ------------------+ > >> >>>>> > >> >>>>>>>>>>> | 2 TITAN X (Pascal) Off | 0000:82:00.0 Off | > >> >>>>> > >> >>>>>>>>>>> N/A | > >> >>>>> > >> >>>>>>>>>>> | 23% 35C P0 57W / 250W | 0MiB / 12189MiB | > >> >>>>> > >> >>>>>>> 0% > >> >>>>> > >> >>>>>>>>>>> Default | > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > +-------------------------------+----------------------+---- > >> ------------------+ > >> >>>>> > >> >>>>>>>>>>> | 3 TITAN X (Pascal) Off | 0000:83:00.0 Off | > >> >>>>> > >> >>>>>>>>>>> N/A | > >> >>>>> > >> >>>>>>>>>>> | 0% 35C P0 56W / 250W | 0MiB / 12189MiB | > >> >>>>> > >> >>>>>>> 0% > >> >>>>> > >> >>>>>>>>>>> Default | > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > +-------------------------------+----------------------+---- > >> ------------------+ > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > +----------------------------------------------------------- > >> ------------------+ > >> >>>>> > >> >>>>>>>>>>> | Processes: > >> >>>>> > >> >>>>>>> GPU > >> >>>>> > >> >>>>>>>>>>> Memory | > >> >>>>> > >> >>>>>>>>>>> | GPU PID Type Process name > >> >>>>> > >> >>>>>>>>>>> Usage > >> >>>>> > >> >>>>>>>>>>> | > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > |=========================================================== > >> ==================| > >> >>>>> > >> >>>>>>>>>>> | No running processes found > >> >>>>> > >> >>>>>>>>>>> | > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > +----------------------------------------------------------- > >> ------------------+ > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> /usr/local/cuda/extras/demo_suite/deviceQuery > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> Alignment requirement for Surfaces: Yes > >> >>>>> > >> >>>>>>>>>>> Device has ECC support: Disabled > >> >>>>> > >> >>>>>>>>>>> Device supports Unified Addressing (UVA): Yes > >> >>>>> > >> >>>>>>>>>>> Device PCI Domain ID / Bus ID / location ID: 0 / 131 / > >> >>>>> 0 > >> >>>>> > >> >>>>>>>>>>> Compute Mode: > >> >>>>> > >> >>>>>>>>>>> < Default (multiple host threads can use > >> >>>>> > >> >>>>>>> ::cudaSetDevice() with > >> >>>>> > >> >>>>>>>>>>> device simultaneously) > > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU1) : > >> >>>>> > >> >>>>>>>>>>> Yes > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU2) : > >> >>>>> > >> >>>>>>>>>>> No > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU0) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU3) : > >> >>>>> > >> >>>>>>>>>>> No > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU0) : > >> >>>>> > >> >>>>>>>>>>> Yes > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU2) : > >> >>>>> > >> >>>>>>>>>>> No > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU1) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU3) : > >> >>>>> > >> >>>>>>>>>>> No > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU0) : > >> >>>>> > >> >>>>>>>>>>> No > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU1) : > >> >>>>> > >> >>>>>>>>>>> No > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU2) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU3) : > >> >>>>> > >> >>>>>>>>>>> Yes > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU0) : > >> >>>>> > >> >>>>>>>>>>> No > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU1) : > >> >>>>> > >> >>>>>>>>>>> No > >> >>>>> > >> >>>>>>>>>>>> Peer access from TITAN X (Pascal) (GPU3) -> TITAN X > >> >>>>> (Pascal) > >> >>>>> > >> >>>>>>> (GPU2) : > >> >>>>> > >> >>>>>>>>>>> Yes > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = > >> >>>>> 8.0, > >> >>>>> > >> >>>>>>> CUDA > >> >>>>> > >> >>>>>>>>>>> Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X > >> >>>>> > >> >>>>>>> (Pascal), > >> >>>>> > >> >>>>>>>>>>> Device1 > >> >>>>> > >> >>>>>>>>>>> = TITAN X (Pascal), Device2 = TITAN X (Pascal), Device3 = > >> >>>>> > >> >>>>>>> TITAN X > >> >>>>> > >> >>>>>>>>>>> (Pascal) > >> >>>>> > >> >>>>>>>>>>> Result = PASS > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> Now not everything is rosy > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ cd > >> >>>>> ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ make > >> >>>>> > >> >>>>>>>>>>>>>> WARNING - libGL.so not found, refer to CUDA Getting > >> >>>>> > >> >>>>>>> Started Guide > >> >>>>> > >> >>>>>>>>>>> for how to find and install them. <<< > >> >>>>> > >> >>>>>>>>>>>>>> WARNING - libGLU.so not found, refer to CUDA Getting > >> >>>>> > >> >>>>>>> Started Guide > >> >>>>> > >> >>>>>>>>>>> for how to find and install them. <<< > >> >>>>> > >> >>>>>>>>>>>>>> WARNING - libX11.so not found, refer to CUDA Getting > >> >>>>> > >> >>>>>>> Started Guide > >> >>>>> > >> >>>>>>>>>>> for how to find and install them. <<< > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> even though those are installed. For example > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ yum whatprovides */libX11.so > >> >>>>> > >> >>>>>>>>>>> libX11-devel-1.6.3-2.el7.i686 : Development files for > >> >>>>> libX11 > >> >>>>> > >> >>>>>>>>>>> Repo : core > >> >>>>> > >> >>>>>>>>>>> Matched from: > >> >>>>> > >> >>>>>>>>>>> Filename : /usr/lib/libX11.so > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> also > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> mesa-libGLU-devel > >> >>>>> > >> >>>>>>>>>>> mesa-libGL-devel > >> >>>>> > >> >>>>>>>>>>> xorg-x11-drv-nvidia-devel > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> but > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> root at gpu3$ yum -y install mesa-libGLU-devel > >> >>>>> mesa-libGL-devel > >> >>>>> > >> >>>>>>>>>>> xorg-x11-drv-nvidia-devel > >> >>>>> > >> >>>>>>>>>>> Package mesa-libGLU-devel-9.0.0-4.el7.x86_64 already > >> >>>>> > >> >>>>>>> installed and > >> >>>>> > >> >>>>>>>>>>> latest version > >> >>>>> > >> >>>>>>>>>>> Package mesa-libGL-devel-10.6.5-3.20150824.el7.x86_64 > >> >>>>> already > >> >>>>> > >> >>>>>>>>>>> installed > >> >>>>> > >> >>>>>>>>>>> and latest version > >> >>>>> > >> >>>>>>>>>>> Package 1:xorg-x11-drv-nvidia-devel-367.48-1.el7.x86_64 > >> >>>>> > >> >>>>>>> already > >> >>>>> > >> >>>>>>>>>>> installed and latest version > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> Also from MATLAB gpuDevice hangs. > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> So we still don't have a working installation. Any help > >> >>>>> would > >> >>>>> > >> >>>>>>> be > >> >>>>> > >> >>>>>>>>>>> appreciated. > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> Best, > >> >>>>> > >> >>>>>>>>>>> Predrag > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> P.S. Once we have a working installation we can think of > >> >>>>> > >> >>>>>>> installing > >> >>>>> > >> >>>>>>>>>>> Caffe and TensorFlow. For now we have to see why the > >> >>>>> things > >> >>>>> > >> >>>>>>> are not > >> >>>>> > >> >>>>>>>>>>> working. > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> On Oct 12, 2016, at 6:26 PM, Predrag Punosevac > >> >>>>> > >> >>>>>>> > >> > >> >>>>> > >> >>>>>>>>>>>>> wrote: > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> Dear Autonians, > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> GPU3 is "configured". Namely you can log into it and all > >> >>>>> > >> >>>>>>> packages > >> >>>>> > >> >>>>>>>>>>>>> are > >> >>>>> > >> >>>>>>>>>>>>> installed. However I couldn't get NVIDIA provided CUDA > >> >>>>> > >> >>>>>>> driver to > >> >>>>> > >> >>>>>>>>>>>>> recognize GPU cards. They appear to be properly > >> >>>>> installed > >> >>>>> > >> >>>>>>> from the > >> >>>>> > >> >>>>>>>>>>>>> hardware point of view and you can list them with > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> lshw -class display > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> root at gpu3$ lshw -class display > >> >>>>> > >> >>>>>>>>>>>>> *-display UNCLAIMED > >> >>>>> > >> >>>>>>>>>>>>> description: VGA compatible controller > >> >>>>> > >> >>>>>>>>>>>>> product: NVIDIA Corporation > >> >>>>> > >> >>>>>>>>>>>>> vendor: NVIDIA Corporation > >> >>>>> > >> >>>>>>>>>>>>> physical id: 0 > >> >>>>> > >> >>>>>>>>>>>>> bus info: pci at 0000:02:00.0 > >> >>>>> > >> >>>>>>>>>>>>> version: a1 > >> >>>>> > >> >>>>>>>>>>>>> width: 64 bits > >> >>>>> > >> >>>>>>>>>>>>> clock: 33MHz > >> >>>>> > >> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >> >>>>> > >> >>>>>>> cap_list > >> >>>>> > >> >>>>>>>>>>>>> configuration: latency=0 > >> >>>>> > >> >>>>>>>>>>>>> resources: iomemory:383f0-383ef > >> >>>>> iomemory:383f0-383ef > >> >>>>> > >> >>>>>>>>>>>>> memory:cf000000-cfffffff > >> >>>>> memory:383fe0000000-383fefffffff > >> >>>>> > >> >>>>>>>>>>>>> memory:383ff0000000-383ff1ffffff ioport:6000(size=128) > >> >>>>> > >> >>>>>>>>>>>>> memory:d0000000-d007ffff > >> >>>>> > >> >>>>>>>>>>>>> *-display UNCLAIMED > >> >>>>> > >> >>>>>>>>>>>>> description: VGA compatible controller > >> >>>>> > >> >>>>>>>>>>>>> product: NVIDIA Corporation > >> >>>>> > >> >>>>>>>>>>>>> vendor: NVIDIA Corporation > >> >>>>> > >> >>>>>>>>>>>>> physical id: 0 > >> >>>>> > >> >>>>>>>>>>>>> bus info: pci at 0000:03:00.0 > >> >>>>> > >> >>>>>>>>>>>>> version: a1 > >> >>>>> > >> >>>>>>>>>>>>> width: 64 bits > >> >>>>> > >> >>>>>>>>>>>>> clock: 33MHz > >> >>>>> > >> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >> >>>>> > >> >>>>>>> cap_list > >> >>>>> > >> >>>>>>>>>>>>> configuration: latency=0 > >> >>>>> > >> >>>>>>>>>>>>> resources: iomemory:383f0-383ef > >> >>>>> iomemory:383f0-383ef > >> >>>>> > >> >>>>>>>>>>>>> memory:cd000000-cdffffff > >> >>>>> memory:383fc0000000-383fcfffffff > >> >>>>> > >> >>>>>>>>>>>>> memory:383fd0000000-383fd1ffffff ioport:5000(size=128) > >> >>>>> > >> >>>>>>>>>>>>> memory:ce000000-ce07ffff > >> >>>>> > >> >>>>>>>>>>>>> *-display > >> >>>>> > >> >>>>>>>>>>>>> description: VGA compatible controller > >> >>>>> > >> >>>>>>>>>>>>> product: ASPEED Graphics Family > >> >>>>> > >> >>>>>>>>>>>>> vendor: ASPEED Technology, Inc. > >> >>>>> > >> >>>>>>>>>>>>> physical id: 0 > >> >>>>> > >> >>>>>>>>>>>>> bus info: pci at 0000:06:00.0 > >> >>>>> > >> >>>>>>>>>>>>> version: 30 > >> >>>>> > >> >>>>>>>>>>>>> width: 32 bits > >> >>>>> > >> >>>>>>>>>>>>> clock: 33MHz > >> >>>>> > >> >>>>>>>>>>>>> capabilities: pm msi vga_controller bus_master > >> >>>>> > >> >>>>>>> cap_list rom > >> >>>>> > >> >>>>>>>>>>>>> configuration: driver=ast latency=0 > >> >>>>> > >> >>>>>>>>>>>>> resources: irq:19 memory:cb000000-cbffffff > >> >>>>> > >> >>>>>>>>>>>>> memory:cc000000-cc01ffff ioport:4000(size=128) > >> >>>>> > >> >>>>>>>>>>>>> *-display UNCLAIMED > >> >>>>> > >> >>>>>>>>>>>>> description: VGA compatible controller > >> >>>>> > >> >>>>>>>>>>>>> product: NVIDIA Corporation > >> >>>>> > >> >>>>>>>>>>>>> vendor: NVIDIA Corporation > >> >>>>> > >> >>>>>>>>>>>>> physical id: 0 > >> >>>>> > >> >>>>>>>>>>>>> bus info: pci at 0000:82:00.0 > >> >>>>> > >> >>>>>>>>>>>>> version: a1 > >> >>>>> > >> >>>>>>>>>>>>> width: 64 bits > >> >>>>> > >> >>>>>>>>>>>>> clock: 33MHz > >> >>>>> > >> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >> >>>>> > >> >>>>>>> cap_list > >> >>>>> > >> >>>>>>>>>>>>> configuration: latency=0 > >> >>>>> > >> >>>>>>>>>>>>> resources: iomemory:387f0-387ef > >> >>>>> iomemory:387f0-387ef > >> >>>>> > >> >>>>>>>>>>>>> memory:fa000000-faffffff > >> >>>>> memory:387fe0000000-387fefffffff > >> >>>>> > >> >>>>>>>>>>>>> memory:387ff0000000-387ff1ffffff ioport:e000(size=128) > >> >>>>> > >> >>>>>>>>>>>>> memory:fb000000-fb07ffff > >> >>>>> > >> >>>>>>>>>>>>> *-display UNCLAIMED > >> >>>>> > >> >>>>>>>>>>>>> description: VGA compatible controller > >> >>>>> > >> >>>>>>>>>>>>> product: NVIDIA Corporation > >> >>>>> > >> >>>>>>>>>>>>> vendor: NVIDIA Corporation > >> >>>>> > >> >>>>>>>>>>>>> physical id: 0 > >> >>>>> > >> >>>>>>>>>>>>> bus info: pci at 0000:83:00.0 > >> >>>>> > >> >>>>>>>>>>>>> version: a1 > >> >>>>> > >> >>>>>>>>>>>>> width: 64 bits > >> >>>>> > >> >>>>>>>>>>>>> clock: 33MHz > >> >>>>> > >> >>>>>>>>>>>>> capabilities: pm msi pciexpress vga_controller > >> >>>>> > >> >>>>>>> cap_list > >> >>>>> > >> >>>>>>>>>>>>> configuration: latency=0 > >> >>>>> > >> >>>>>>>>>>>>> resources: iomemory:387f0-387ef > >> >>>>> iomemory:387f0-387ef > >> >>>>> > >> >>>>>>>>>>>>> memory:f8000000-f8ffffff > >> >>>>> memory:387fc0000000-387fcfffffff > >> >>>>> > >> >>>>>>>>>>>>> memory:387fd0000000-387fd1ffffff ioport:d000(size=128) > >> >>>>> > >> >>>>>>>>>>>>> memory:f9000000-f907ffff > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> However what scares the hell out of me is that I don't > >> >>>>> see > >> >>>>> > >> >>>>>>> NVIDIA > >> >>>>> > >> >>>>>>>>>>>>> driver > >> >>>>> > >> >>>>>>>>>>>>> loaded > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> lsmod|grep nvidia > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> and the device nodes /dev/nvidia are not created. I am > >> >>>>> > >> >>>>>>> guessing I > >> >>>>> > >> >>>>>>>>>>>>> just > >> >>>>> > >> >>>>>>>>>>>>> missed some trivial step during the CUDA installation > >> >>>>> which > >> >>>>> > >> >>>>>>> is very > >> >>>>> > >> >>>>>>>>>>>>> involving. I am unfortunately too tired to debug this > >> >>>>> > >> >>>>>>> tonight. > >> >>>>> > >> >>>>>>>>>>>>> > >> >>>>> > >> >>>>>>>>>>>>> Predrag > >> >>>>> > >> >>>>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>>>> > >> >>>>> > >> >>>>>> Links: > >> >>>>> > >> >>>>>> ------ > >> >>>>> > >> >>>>>> [1] http://findgl.mk > >> >>> > >> >>> > >> >>> Links: > >> >>> ------ > >> >>> [1] http://findgl.mk > >> > > >> > > >> > Links: > >> > ------ > >> > [1] http://findgl.mk > >> > > From predragp at cs.cmu.edu Tue Oct 18 11:36:01 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Tue, 18 Oct 2016 11:36:01 -0400 Subject: Critical ssh patches Message-ID: <20161018153601.uLUtv9wCy%predragp@cs.cmu.edu> Dear Autonians, I had to apply critical ssh patches to our infrastructure servers which cased the short few minutes interruption on ssh gateways which needed to be rebooted for patches to apply. No further interruptions are anticipated both lop1 and bash are now available and fully functional. If Auton Lab desktop behaves strangely (it should not but just in case it does) please reboot it to restart OpenVPN and remount NFS shares. Best, Predrag P.S. This ssh problems have for now only be noticed on OpenBSD and fixed in the non-portable OpenSSH version. The Linux fix is probably few days away. From predragp at cs.cmu.edu Tue Oct 18 13:54:27 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Tue, 18 Oct 2016 13:54:27 -0400 Subject: GPU3 CUDA downgrade Message-ID: <20161018175427.WWY0b754S%predragp@cs.cmu.edu> Dear Autonians, I would like to schedule CUDA downgrade from 8.0 to 7.5 on GPU3 for tomorrow at 11:00 AM. If you need machine up to finish a job please speak now. We feel that there are some reasons to believe that our problems with compiling TensorFlow are due to newest version of CUDA. I will try to downgrade to 7.5 which we use on GPU1 and GPU2 to see if we can make a progress with this. We also hope it might fix MATLAB problem. I just receive extra RAM for GPU1, GPU2, and GPU3 so once GPU3 is back on line it should have 256GB of RAM. Predrag From predragp at cs.cmu.edu Tue Oct 18 15:45:48 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Tue, 18 Oct 2016 15:45:48 -0400 Subject: CMU clock drifting Message-ID: <20161018194548.mKFDPsvcY%predragp@cs.cmu.edu> Dear Autonians, I have being trying to figure out what was wrong with AFS and Kerbersos on Jeff's computer for over a week now. What I found out is such a subtle problem that I would like to share with you as it is affecting everyone on campus. Namely the clock on Jeff's desktop was drifting so Kerberos server would not give him a ticket. I wouldn't expect that to happen as I am running ntpd daemon as I do on all our servers and virtual machines. Well I learned hard way that Carnegie Mellon University is blocking clock synchronization on their firewalls expecting people to run isc-dhcp clients which can alter clock synchronization pool. I have the list of their ntpd servers now and I will create Auton Lab ntpd server which pool their machines and pass correct time to our machines. So the moral of the story is something behaves very strange please check the clock first. Best, Predrag From predragp at cs.cmu.edu Wed Oct 19 13:22:09 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Wed, 19 Oct 2016 13:22:09 -0400 Subject: GPU3 back in business Message-ID: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> Dear Autonians, I have added additional 128 GB of RAM to GPU3 and downgraded CUDA to 7.5. The good news is that CUDA downgrade has fixed MATLAB problem. You can use MATLAB now on GPU3. I am looking at the TensorFlow right now. Predrag From predragp at cs.cmu.edu Wed Oct 19 16:10:53 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Wed, 19 Oct 2016 16:10:53 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> Message-ID: <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Dougal Sutherland wrote: > I tried for a while. I failed. > Damn this doesn't look good. I guess back to the drawing board. Thanks for the quick feed back. Predrag > Version 0.10.0 fails immediately on build: "The specified --crosstool_top > '@local_config_cuda//crosstool:crosstool' is not a valid cc_toolchain_suite > rule." Apparently this is because 0.10 required an older version of bazel ( > https://github.com/tensorflow/tensorflow/issues/4368), and I don't have the > energy to install an old version of bazel. > > Version 0.11.0rc0 gets almost done and then complains about no such file or > directory for libcudart.so.7.5 (which is there, where I told tensorflow it > was...). > > Non-release versions from git fail immediately because they call git -C to > get version info, which is only in git 1.9 (we have 1.8). > > > Some other notes: > - I made a symlink from ~/.cache/bazel to /home/scratch/$USER/.cache/bazel, > because bazel is the worst. (It complains about doing things on NFS, and > hung for me [clock-related?], and I can't find a global config file or > anything to change that in; it seems like there might be one, but their > documentation is terrible.) > > - I wasn't able to use the actual Titan X compute capability of 6.1, > because that requires cuda 8; I used 5.2 instead. Probably not a huge deal, > but I don't know. > > - I tried explicitly including /usr/local/cuda/lib64 in LD_LIBRARY_PATH and > set CUDA_HOME to /usr/local/cuda before building, hoping that would help > with the 0.11.0rc0 problem, but it didn't. From kandasamy at cmu.edu Fri Oct 21 13:14:11 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Fri, 21 Oct 2016 13:14:11 -0400 Subject: GPU3 back in business In-Reply-To: <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: Predrag, Any updates on gpu3? I have tried both tensorflow and chainer and in both cases the problem seems to be with cuda On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac wrote: > Dougal Sutherland wrote: > > > I tried for a while. I failed. > > > > Damn this doesn't look good. I guess back to the drawing board. Thanks > for the quick feed back. > > Predrag > > > Version 0.10.0 fails immediately on build: "The specified --crosstool_top > > '@local_config_cuda//crosstool:crosstool' is not a valid > cc_toolchain_suite > > rule." Apparently this is because 0.10 required an older version of > bazel ( > > https://github.com/tensorflow/tensorflow/issues/4368), and I don't have > the > > energy to install an old version of bazel. > > > > Version 0.11.0rc0 gets almost done and then complains about no such file > or > > directory for libcudart.so.7.5 (which is there, where I told tensorflow > it > > was...). > > > > Non-release versions from git fail immediately because they call git -C > to > > get version info, which is only in git 1.9 (we have 1.8). > > > > > > Some other notes: > > - I made a symlink from ~/.cache/bazel to /home/scratch/$USER/.cache/ > bazel, > > because bazel is the worst. (It complains about doing things on NFS, and > > hung for me [clock-related?], and I can't find a global config file or > > anything to change that in; it seems like there might be one, but their > > documentation is terrible.) > > > > - I wasn't able to use the actual Titan X compute capability of 6.1, > > because that requires cuda 8; I used 5.2 instead. Probably not a huge > deal, > > but I don't know. > > > > - I tried explicitly including /usr/local/cuda/lib64 in LD_LIBRARY_PATH > and > > set CUDA_HOME to /usr/local/cuda before building, hoping that would help > > with the 0.11.0rc0 problem, but it didn't. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Fri Oct 21 14:03:21 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Fri, 21 Oct 2016 18:03:21 +0000 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda 8.0 install, and it built fine. So additionally installing 7.5 was probably not necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute architecture that the Titan Xs use, so Theano at least needs to be manually told to use an older architecture. A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I think it should work fine with the cudnn in my scratch directory. You should probably install it to scratch, either running this first to put libraries your scratch directory or using a virtualenv or something: export PYTHONUSERBASE=/home/scratch/$USER/.local You'll need this to use the library and probably to install it: export LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" To install: pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl (remove --user if you're using a virtualenv) (A request: I'm submitting to ICLR in two weeks, and for some of the models I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't run a ton of stuff on gpu3 unless you're working on a deadline too. Steps to install it, for the future: - Install bazel in your home directory: - wget https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh - bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER --base=/home/scratch/$USER/.bazel - Configure bazel to build in scratch. There's probably a better way to do this, but this works: - mkdir /home/scratch/$USER/.cache - ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel - Build tensorflow. Note that builds from git checkouts don't work, because they assume a newer version of git than is on gpu3: - cd /home/scratch/$USER - wget - tar xf - cd tensorflow-0.11.0rc0 - ./configure - This is an interactive script that doesn't seem to let you pass arguments or anything. It's obnoxious. - Use the default python - don't use cloud platform or hadoop file system - use the default site-packages path if it asks - build with GPU support - default gcc - default Cuda SDK version - specify /usr/local/cuda-8.0 - default cudnn version - specify $CUDNN_DIR from use-cudnn.sh, e.g. /home/scratch/dsutherl/cudnn-8.0-5.1/cuda - Pascal Titan Xs have compute capability 6.1 - bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package - bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ - A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the directory you specified above. - Dougal On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy wrote: Predrag, Any updates on gpu3? I have tried both tensorflow and chainer and in both cases the problem seems to be with cuda On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac wrote: Dougal Sutherland wrote: > I tried for a while. I failed. > Damn this doesn't look good. I guess back to the drawing board. Thanks for the quick feed back. Predrag > Version 0.10.0 fails immediately on build: "The specified --crosstool_top > '@local_config_cuda//crosstool:crosstool' is not a valid cc_toolchain_suite > rule." Apparently this is because 0.10 required an older version of bazel ( > https://github.com/tensorflow/tensorflow/issues/4368), and I don't have the > energy to install an old version of bazel. > > Version 0.11.0rc0 gets almost done and then complains about no such file or > directory for libcudart.so.7.5 (which is there, where I told tensorflow it > was...). > > Non-release versions from git fail immediately because they call git -C to > get version info, which is only in git 1.9 (we have 1.8). > > > Some other notes: > - I made a symlink from ~/.cache/bazel to /home/scratch/$USER/.cache/bazel, > because bazel is the worst. (It complains about doing things on NFS, and > hung for me [clock-related?], and I can't find a global config file or > anything to change that in; it seems like there might be one, but their > documentation is terrible.) > > - I wasn't able to use the actual Titan X compute capability of 6.1, > because that requires cuda 8; I used 5.2 instead. Probably not a huge deal, > but I don't know. > > - I tried explicitly including /usr/local/cuda/lib64 in LD_LIBRARY_PATH and > set CUDA_HOME to /usr/local/cuda before building, hoping that would help > with the 0.11.0rc0 problem, but it didn't. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Fri Oct 21 15:08:09 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Fri, 21 Oct 2016 19:08:09 +0000 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: I installed it in my scratch directory (not sure if there's a global install?). The main thing was to put its cache on scratch; it got really upset when the cache directory was on NFS. (Instructions at the bottom of my previous email.) On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos wrote: > That's great! Thanks Dougal. > > As I remember bazel was not installed correctly previously on GPU3. Do > you know what went wrong with it before and why it is good now? > > Thanks, > Barnabas > ====================== > Barnabas Poczos, PhD > Assistant Professor > Machine Learning Department > Carnegie Mellon University > > > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland > wrote: > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda > 8.0 > > install, and it built fine. So additionally installing 7.5 was probably > not > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute > architecture > > that the Titan Xs use, so Theano at least needs to be manually told to > use > > an older architecture. > > > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I > think > > it should work fine with the cudnn in my scratch directory. > > > > You should probably install it to scratch, either running this first to > put > > libraries your scratch directory or using a virtualenv or something: > > export PYTHONUSERBASE=/home/scratch/$USER/.local > > > > You'll need this to use the library and probably to install it: > > export > > > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" > > > > To install: > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl > > (remove --user if you're using a virtualenv) > > > > (A request: I'm submitting to ICLR in two weeks, and for some of the > models > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't > > run a ton of stuff on gpu3 unless you're working on a deadline too. > > > > > > > > Steps to install it, for the future: > > > > Install bazel in your home directory: > > > > wget > > > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER > > --base=/home/scratch/$USER/.bazel > > > > Configure bazel to build in scratch. There's probably a better way to do > > this, but this works: > > > > mkdir /home/scratch/$USER/.cache > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > > > > Build tensorflow. Note that builds from git checkouts don't work, because > > they assume a newer version of git than is on gpu3: > > > > cd /home/scratch/$USER > > wget > > tar xf > > cd tensorflow-0.11.0rc0 > > ./configure > > > > This is an interactive script that doesn't seem to let you pass > arguments or > > anything. It's obnoxious. > > Use the default python > > don't use cloud platform or hadoop file system > > use the default site-packages path if it asks > > build with GPU support > > default gcc > > default Cuda SDK version > > specify /usr/local/cuda-8.0 > > default cudnn version > > specify $CUDNN_DIR from use-cudnn.sh, e.g. > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > > Pascal Titan Xs have compute capability 6.1 > > > > bazel build -c opt --config=cuda > > //tensorflow/tools/pip_package:build_pip_package > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the > > directory you specified above. > > > > > > - Dougal > > > > > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy > > > wrote: > >> > >> Predrag, > >> > >> Any updates on gpu3? > >> I have tried both tensorflow and chainer and in both cases the problem > >> seems to be with cuda > >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac > > >> wrote: > >>> > >>> Dougal Sutherland wrote: > >>> > >>> > I tried for a while. I failed. > >>> > > >>> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks > >>> for the quick feed back. > >>> > >>> Predrag > >>> > >>> > Version 0.10.0 fails immediately on build: "The specified > >>> > --crosstool_top > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid > >>> > cc_toolchain_suite > >>> > rule." Apparently this is because 0.10 required an older version of > >>> > bazel ( > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't > have > >>> > the > >>> > energy to install an old version of bazel. > >>> > > >>> > Version 0.11.0rc0 gets almost done and then complains about no such > >>> > file or > >>> > directory for libcudart.so.7.5 (which is there, where I told > tensorflow > >>> > it > >>> > was...). > >>> > > >>> > Non-release versions from git fail immediately because they call git > -C > >>> > to > >>> > get version info, which is only in git 1.9 (we have 1.8). > >>> > > >>> > > >>> > Some other notes: > >>> > - I made a symlink from ~/.cache/bazel to > >>> > /home/scratch/$USER/.cache/bazel, > >>> > because bazel is the worst. (It complains about doing things on NFS, > >>> > and > >>> > hung for me [clock-related?], and I can't find a global config file > or > >>> > anything to change that in; it seems like there might be one, but > their > >>> > documentation is terrible.) > >>> > > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge > >>> > deal, > >>> > but I don't know. > >>> > > >>> > - I tried explicitly including /usr/local/cuda/lib64 in > LD_LIBRARY_PATH > >>> > and > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would > >>> > help > >>> > with the 0.11.0rc0 problem, but it didn't. > >> > >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kandasamy at cmu.edu Fri Oct 21 15:10:50 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Fri, 21 Oct 2016 15:10:50 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: Thanks Dougal. I'll take a look atthis and get back to you. So are you suggesting that this is an issue with TitanX's not being compatible with 7.5? On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland wrote: > I installed it in my scratch directory (not sure if there's a global > install?). The main thing was to put its cache on scratch; it got really > upset when the cache directory was on NFS. (Instructions at the bottom of > my previous email.) > > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos wrote: > >> That's great! Thanks Dougal. >> >> As I remember bazel was not installed correctly previously on GPU3. Do >> you know what went wrong with it before and why it is good now? >> >> Thanks, >> Barnabas >> ====================== >> Barnabas Poczos, PhD >> Assistant Professor >> Machine Learning Department >> Carnegie Mellon University >> >> >> On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland >> wrote: >> > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda >> 8.0 >> > install, and it built fine. So additionally installing 7.5 was probably >> not >> > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute >> architecture >> > that the Titan Xs use, so Theano at least needs to be manually told to >> use >> > an older architecture. >> > >> > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I >> think >> > it should work fine with the cudnn in my scratch directory. >> > >> > You should probably install it to scratch, either running this first to >> put >> > libraries your scratch directory or using a virtualenv or something: >> > export PYTHONUSERBASE=/home/scratch/$USER/.local >> > >> > You'll need this to use the library and probably to install it: >> > export >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/ >> lib64:"$LD_LIBRARY_PATH" >> > >> > To install: >> > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl >> > (remove --user if you're using a virtualenv) >> > >> > (A request: I'm submitting to ICLR in two weeks, and for some of the >> models >> > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please >> don't >> > run a ton of stuff on gpu3 unless you're working on a deadline too. >> > >> > >> > >> > Steps to install it, for the future: >> > >> > Install bazel in your home directory: >> > >> > wget >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/ >> bazel-0.3.2-installer-linux-x86_64.sh >> > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER >> > --base=/home/scratch/$USER/.bazel >> > >> > Configure bazel to build in scratch. There's probably a better way to do >> > this, but this works: >> > >> > mkdir /home/scratch/$USER/.cache >> > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel >> > >> > Build tensorflow. Note that builds from git checkouts don't work, >> because >> > they assume a newer version of git than is on gpu3: >> > >> > cd /home/scratch/$USER >> > wget >> > tar xf >> > cd tensorflow-0.11.0rc0 >> > ./configure >> > >> > This is an interactive script that doesn't seem to let you pass >> arguments or >> > anything. It's obnoxious. >> > Use the default python >> > don't use cloud platform or hadoop file system >> > use the default site-packages path if it asks >> > build with GPU support >> > default gcc >> > default Cuda SDK version >> > specify /usr/local/cuda-8.0 >> > default cudnn version >> > specify $CUDNN_DIR from use-cudnn.sh, e.g. >> > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda >> > Pascal Titan Xs have compute capability 6.1 >> > >> > bazel build -c opt --config=cuda >> > //tensorflow/tools/pip_package:build_pip_package >> > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ >> > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the >> > directory you specified above. >> > >> > >> > - Dougal >> > >> > >> > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy < >> kandasamy at cmu.edu> >> > wrote: >> >> >> >> Predrag, >> >> >> >> Any updates on gpu3? >> >> I have tried both tensorflow and chainer and in both cases the problem >> >> seems to be with cuda >> >> >> >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac < >> predragp at cs.cmu.edu> >> >> wrote: >> >>> >> >>> Dougal Sutherland wrote: >> >>> >> >>> > I tried for a while. I failed. >> >>> > >> >>> >> >>> Damn this doesn't look good. I guess back to the drawing board. Thanks >> >>> for the quick feed back. >> >>> >> >>> Predrag >> >>> >> >>> > Version 0.10.0 fails immediately on build: "The specified >> >>> > --crosstool_top >> >>> > '@local_config_cuda//crosstool:crosstool' is not a valid >> >>> > cc_toolchain_suite >> >>> > rule." Apparently this is because 0.10 required an older version of >> >>> > bazel ( >> >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't >> have >> >>> > the >> >>> > energy to install an old version of bazel. >> >>> > >> >>> > Version 0.11.0rc0 gets almost done and then complains about no such >> >>> > file or >> >>> > directory for libcudart.so.7.5 (which is there, where I told >> tensorflow >> >>> > it >> >>> > was...). >> >>> > >> >>> > Non-release versions from git fail immediately because they call >> git -C >> >>> > to >> >>> > get version info, which is only in git 1.9 (we have 1.8). >> >>> > >> >>> > >> >>> > Some other notes: >> >>> > - I made a symlink from ~/.cache/bazel to >> >>> > /home/scratch/$USER/.cache/bazel, >> >>> > because bazel is the worst. (It complains about doing things on NFS, >> >>> > and >> >>> > hung for me [clock-related?], and I can't find a global config file >> or >> >>> > anything to change that in; it seems like there might be one, but >> their >> >>> > documentation is terrible.) >> >>> > >> >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, >> >>> > because that requires cuda 8; I used 5.2 instead. Probably not a >> huge >> >>> > deal, >> >>> > but I don't know. >> >>> > >> >>> > - I tried explicitly including /usr/local/cuda/lib64 in >> LD_LIBRARY_PATH >> >>> > and >> >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would >> >>> > help >> >>> > with the 0.11.0rc0 problem, but it didn't. >> >> >> >> >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bapoczos at cs.cmu.edu Fri Oct 21 15:04:08 2016 From: bapoczos at cs.cmu.edu (Barnabas Poczos) Date: Fri, 21 Oct 2016 15:04:08 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: That's great! Thanks Dougal. As I remember bazel was not installed correctly previously on GPU3. Do you know what went wrong with it before and why it is good now? Thanks, Barnabas ====================== Barnabas Poczos, PhD Assistant Professor Machine Learning Department Carnegie Mellon University On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland wrote: > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda 8.0 > install, and it built fine. So additionally installing 7.5 was probably not > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute architecture > that the Titan Xs use, so Theano at least needs to be manually told to use > an older architecture. > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I think > it should work fine with the cudnn in my scratch directory. > > You should probably install it to scratch, either running this first to put > libraries your scratch directory or using a virtualenv or something: > export PYTHONUSERBASE=/home/scratch/$USER/.local > > You'll need this to use the library and probably to install it: > export > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" > > To install: > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl > (remove --user if you're using a virtualenv) > > (A request: I'm submitting to ICLR in two weeks, and for some of the models > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't > run a ton of stuff on gpu3 unless you're working on a deadline too. > > > > Steps to install it, for the future: > > Install bazel in your home directory: > > wget > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER > --base=/home/scratch/$USER/.bazel > > Configure bazel to build in scratch. There's probably a better way to do > this, but this works: > > mkdir /home/scratch/$USER/.cache > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > > Build tensorflow. Note that builds from git checkouts don't work, because > they assume a newer version of git than is on gpu3: > > cd /home/scratch/$USER > wget > tar xf > cd tensorflow-0.11.0rc0 > ./configure > > This is an interactive script that doesn't seem to let you pass arguments or > anything. It's obnoxious. > Use the default python > don't use cloud platform or hadoop file system > use the default site-packages path if it asks > build with GPU support > default gcc > default Cuda SDK version > specify /usr/local/cuda-8.0 > default cudnn version > specify $CUDNN_DIR from use-cudnn.sh, e.g. > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > Pascal Titan Xs have compute capability 6.1 > > bazel build -c opt --config=cuda > //tensorflow/tools/pip_package:build_pip_package > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the > directory you specified above. > > > - Dougal > > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy > wrote: >> >> Predrag, >> >> Any updates on gpu3? >> I have tried both tensorflow and chainer and in both cases the problem >> seems to be with cuda >> >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac >> wrote: >>> >>> Dougal Sutherland wrote: >>> >>> > I tried for a while. I failed. >>> > >>> >>> Damn this doesn't look good. I guess back to the drawing board. Thanks >>> for the quick feed back. >>> >>> Predrag >>> >>> > Version 0.10.0 fails immediately on build: "The specified >>> > --crosstool_top >>> > '@local_config_cuda//crosstool:crosstool' is not a valid >>> > cc_toolchain_suite >>> > rule." Apparently this is because 0.10 required an older version of >>> > bazel ( >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't have >>> > the >>> > energy to install an old version of bazel. >>> > >>> > Version 0.11.0rc0 gets almost done and then complains about no such >>> > file or >>> > directory for libcudart.so.7.5 (which is there, where I told tensorflow >>> > it >>> > was...). >>> > >>> > Non-release versions from git fail immediately because they call git -C >>> > to >>> > get version info, which is only in git 1.9 (we have 1.8). >>> > >>> > >>> > Some other notes: >>> > - I made a symlink from ~/.cache/bazel to >>> > /home/scratch/$USER/.cache/bazel, >>> > because bazel is the worst. (It complains about doing things on NFS, >>> > and >>> > hung for me [clock-related?], and I can't find a global config file or >>> > anything to change that in; it seems like there might be one, but their >>> > documentation is terrible.) >>> > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge >>> > deal, >>> > but I don't know. >>> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in LD_LIBRARY_PATH >>> > and >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would >>> > help >>> > with the 0.11.0rc0 problem, but it didn't. >> >> > From dougal at gmail.com Fri Oct 21 15:17:13 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Fri, 21 Oct 2016 19:17:13 +0000 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: They do work with 7.5 if you specify an older compute architecture; it's just that their actual compute capability of 6.1 isn't supported by cuda 7.5. Thank is thrown off by this, for example, but it can be fixed by telling it to pass compute capability 5.2 (for example) to nvcc. I don't think that this was my problem with building tensorflow on 7.5; I'm not sure what that was. On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy wrote: > Thanks Dougal. I'll take a look atthis and get back to you. > So are you suggesting that this is an issue with TitanX's not being > compatible with 7.5? > > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland > wrote: > > I installed it in my scratch directory (not sure if there's a global > install?). The main thing was to put its cache on scratch; it got really > upset when the cache directory was on NFS. (Instructions at the bottom of > my previous email.) > > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos wrote: > > That's great! Thanks Dougal. > > As I remember bazel was not installed correctly previously on GPU3. Do > you know what went wrong with it before and why it is good now? > > Thanks, > Barnabas > ====================== > Barnabas Poczos, PhD > Assistant Professor > Machine Learning Department > Carnegie Mellon University > > > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland > wrote: > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda > 8.0 > > install, and it built fine. So additionally installing 7.5 was probably > not > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute > architecture > > that the Titan Xs use, so Theano at least needs to be manually told to > use > > an older architecture. > > > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I > think > > it should work fine with the cudnn in my scratch directory. > > > > You should probably install it to scratch, either running this first to > put > > libraries your scratch directory or using a virtualenv or something: > > export PYTHONUSERBASE=/home/scratch/$USER/.local > > > > You'll need this to use the library and probably to install it: > > export > > > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" > > > > To install: > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl > > (remove --user if you're using a virtualenv) > > > > (A request: I'm submitting to ICLR in two weeks, and for some of the > models > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't > > run a ton of stuff on gpu3 unless you're working on a deadline too. > > > > > > > > Steps to install it, for the future: > > > > Install bazel in your home directory: > > > > wget > > > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER > > --base=/home/scratch/$USER/.bazel > > > > Configure bazel to build in scratch. There's probably a better way to do > > this, but this works: > > > > mkdir /home/scratch/$USER/.cache > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > > > > Build tensorflow. Note that builds from git checkouts don't work, because > > they assume a newer version of git than is on gpu3: > > > > cd /home/scratch/$USER > > wget > > tar xf > > cd tensorflow-0.11.0rc0 > > ./configure > > > > This is an interactive script that doesn't seem to let you pass > arguments or > > anything. It's obnoxious. > > Use the default python > > don't use cloud platform or hadoop file system > > use the default site-packages path if it asks > > build with GPU support > > default gcc > > default Cuda SDK version > > specify /usr/local/cuda-8.0 > > default cudnn version > > specify $CUDNN_DIR from use-cudnn.sh, e.g. > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > > Pascal Titan Xs have compute capability 6.1 > > > > bazel build -c opt --config=cuda > > //tensorflow/tools/pip_package:build_pip_package > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the > > directory you specified above. > > > > > > - Dougal > > > > > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy > > > wrote: > >> > >> Predrag, > >> > >> Any updates on gpu3? > >> I have tried both tensorflow and chainer and in both cases the problem > >> seems to be with cuda > >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac > > >> wrote: > >>> > >>> Dougal Sutherland wrote: > >>> > >>> > I tried for a while. I failed. > >>> > > >>> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks > >>> for the quick feed back. > >>> > >>> Predrag > >>> > >>> > Version 0.10.0 fails immediately on build: "The specified > >>> > --crosstool_top > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid > >>> > cc_toolchain_suite > >>> > rule." Apparently this is because 0.10 required an older version of > >>> > bazel ( > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't > have > >>> > the > >>> > energy to install an old version of bazel. > >>> > > >>> > Version 0.11.0rc0 gets almost done and then complains about no such > >>> > file or > >>> > directory for libcudart.so.7.5 (which is there, where I told > tensorflow > >>> > it > >>> > was...). > >>> > > >>> > Non-release versions from git fail immediately because they call git > -C > >>> > to > >>> > get version info, which is only in git 1.9 (we have 1.8). > >>> > > >>> > > >>> > Some other notes: > >>> > - I made a symlink from ~/.cache/bazel to > >>> > /home/scratch/$USER/.cache/bazel, > >>> > because bazel is the worst. (It complains about doing things on NFS, > >>> > and > >>> > hung for me [clock-related?], and I can't find a global config file > or > >>> > anything to change that in; it seems like there might be one, but > their > >>> > documentation is terrible.) > >>> > > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge > >>> > deal, > >>> > but I don't know. > >>> > > >>> > - I tried explicitly including /usr/local/cuda/lib64 in > LD_LIBRARY_PATH > >>> > and > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would > >>> > help > >>> > with the 0.11.0rc0 problem, but it didn't. > >> > >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kandasamy at cmu.edu Fri Oct 21 15:20:13 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Fri, 21 Oct 2016 15:20:13 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: I didn't understand half of what you said :P But I'll give this a shot and get abck to you if I run into any issues. On Fri, Oct 21, 2016 at 3:17 PM, Dougal Sutherland wrote: > They do work with 7.5 if you specify an older compute architecture; it's > just that their actual compute capability of 6.1 isn't supported by cuda > 7.5. Thank is thrown off by this, for example, but it can be fixed by > telling it to pass compute capability 5.2 (for example) to nvcc. I don't > think that this was my problem with building tensorflow on 7.5; I'm not > sure what that was. > > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy > wrote: > >> Thanks Dougal. I'll take a look atthis and get back to you. >> So are you suggesting that this is an issue with TitanX's not being >> compatible with 7.5? >> >> On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland >> wrote: >> >> I installed it in my scratch directory (not sure if there's a global >> install?). The main thing was to put its cache on scratch; it got really >> upset when the cache directory was on NFS. (Instructions at the bottom of >> my previous email.) >> >> On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos >> wrote: >> >> That's great! Thanks Dougal. >> >> As I remember bazel was not installed correctly previously on GPU3. Do >> you know what went wrong with it before and why it is good now? >> >> Thanks, >> Barnabas >> ====================== >> Barnabas Poczos, PhD >> Assistant Professor >> Machine Learning Department >> Carnegie Mellon University >> >> >> On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland >> wrote: >> > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda >> 8.0 >> > install, and it built fine. So additionally installing 7.5 was probably >> not >> > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute >> architecture >> > that the Titan Xs use, so Theano at least needs to be manually told to >> use >> > an older architecture. >> > >> > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I >> think >> > it should work fine with the cudnn in my scratch directory. >> > >> > You should probably install it to scratch, either running this first to >> put >> > libraries your scratch directory or using a virtualenv or something: >> > export PYTHONUSERBASE=/home/scratch/$USER/.local >> > >> > You'll need this to use the library and probably to install it: >> > export >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/ >> lib64:"$LD_LIBRARY_PATH" >> > >> > To install: >> > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl >> > (remove --user if you're using a virtualenv) >> > >> > (A request: I'm submitting to ICLR in two weeks, and for some of the >> models >> > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please >> don't >> > run a ton of stuff on gpu3 unless you're working on a deadline too. >> > >> > >> > >> > Steps to install it, for the future: >> > >> > Install bazel in your home directory: >> > >> > wget >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/ >> bazel-0.3.2-installer-linux-x86_64.sh >> > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER >> > --base=/home/scratch/$USER/.bazel >> > >> > Configure bazel to build in scratch. There's probably a better way to do >> > this, but this works: >> > >> > mkdir /home/scratch/$USER/.cache >> > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel >> > >> > Build tensorflow. Note that builds from git checkouts don't work, >> because >> > they assume a newer version of git than is on gpu3: >> > >> > cd /home/scratch/$USER >> > wget >> > tar xf >> > cd tensorflow-0.11.0rc0 >> > ./configure >> > >> > This is an interactive script that doesn't seem to let you pass >> arguments or >> > anything. It's obnoxious. >> > Use the default python >> > don't use cloud platform or hadoop file system >> > use the default site-packages path if it asks >> > build with GPU support >> > default gcc >> > default Cuda SDK version >> > specify /usr/local/cuda-8.0 >> > default cudnn version >> > specify $CUDNN_DIR from use-cudnn.sh, e.g. >> > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda >> > Pascal Titan Xs have compute capability 6.1 >> > >> > bazel build -c opt --config=cuda >> > //tensorflow/tools/pip_package:build_pip_package >> > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ >> > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the >> > directory you specified above. >> > >> > >> > - Dougal >> > >> > >> > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy < >> kandasamy at cmu.edu> >> > wrote: >> >> >> >> Predrag, >> >> >> >> Any updates on gpu3? >> >> I have tried both tensorflow and chainer and in both cases the problem >> >> seems to be with cuda >> >> >> >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac < >> predragp at cs.cmu.edu> >> >> wrote: >> >>> >> >>> Dougal Sutherland wrote: >> >>> >> >>> > I tried for a while. I failed. >> >>> > >> >>> >> >>> Damn this doesn't look good. I guess back to the drawing board. Thanks >> >>> for the quick feed back. >> >>> >> >>> Predrag >> >>> >> >>> > Version 0.10.0 fails immediately on build: "The specified >> >>> > --crosstool_top >> >>> > '@local_config_cuda//crosstool:crosstool' is not a valid >> >>> > cc_toolchain_suite >> >>> > rule." Apparently this is because 0.10 required an older version of >> >>> > bazel ( >> >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't >> have >> >>> > the >> >>> > energy to install an old version of bazel. >> >>> > >> >>> > Version 0.11.0rc0 gets almost done and then complains about no such >> >>> > file or >> >>> > directory for libcudart.so.7.5 (which is there, where I told >> tensorflow >> >>> > it >> >>> > was...). >> >>> > >> >>> > Non-release versions from git fail immediately because they call >> git -C >> >>> > to >> >>> > get version info, which is only in git 1.9 (we have 1.8). >> >>> > >> >>> > >> >>> > Some other notes: >> >>> > - I made a symlink from ~/.cache/bazel to >> >>> > /home/scratch/$USER/.cache/bazel, >> >>> > because bazel is the worst. (It complains about doing things on NFS, >> >>> > and >> >>> > hung for me [clock-related?], and I can't find a global config file >> or >> >>> > anything to change that in; it seems like there might be one, but >> their >> >>> > documentation is terrible.) >> >>> > >> >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, >> >>> > because that requires cuda 8; I used 5.2 instead. Probably not a >> huge >> >>> > deal, >> >>> > but I don't know. >> >>> > >> >>> > - I tried explicitly including /usr/local/cuda/lib64 in >> LD_LIBRARY_PATH >> >>> > and >> >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would >> >>> > help >> >>> > with the 0.11.0rc0 problem, but it didn't. >> >> >> >> >> > >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Fri Oct 21 15:27:32 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Fri, 21 Oct 2016 19:27:32 +0000 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: Heh. :) An explanation: - Different nvidia gpu architectures are called "compute capabilities". This is a number that describes the behavior of the card: the maximum size of various things, which API functions it supports, etc. There's a reference here , but it shouldn't really matter. - When CUDA compiles code, it targets a certain architecture, since it needs to know what features to use and whatnot. I *think* that if you compile for compute capability x, it will work on a card with compute capability y approximately iff x <= y. - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to compile for 6.1 it crashes. - Theano by default tries to compile for the capability of the card, but can be configured to compile for a different capability. - Tensorflow asks for a list of capabilities to compile for when you build it in the first place. On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland wrote: > They do work with 7.5 if you specify an older compute architecture; it's > just that their actual compute capability of 6.1 isn't supported by cuda > 7.5. Thank is thrown off by this, for example, but it can be fixed by > telling it to pass compute capability 5.2 (for example) to nvcc. I don't > think that this was my problem with building tensorflow on 7.5; I'm not > sure what that was. > > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy > wrote: > > Thanks Dougal. I'll take a look atthis and get back to you. > So are you suggesting that this is an issue with TitanX's not being > compatible with 7.5? > > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland > wrote: > > I installed it in my scratch directory (not sure if there's a global > install?). The main thing was to put its cache on scratch; it got really > upset when the cache directory was on NFS. (Instructions at the bottom of > my previous email.) > > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos wrote: > > That's great! Thanks Dougal. > > As I remember bazel was not installed correctly previously on GPU3. Do > you know what went wrong with it before and why it is good now? > > Thanks, > Barnabas > ====================== > Barnabas Poczos, PhD > Assistant Professor > Machine Learning Department > Carnegie Mellon University > > > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland > wrote: > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda > 8.0 > > install, and it built fine. So additionally installing 7.5 was probably > not > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute > architecture > > that the Titan Xs use, so Theano at least needs to be manually told to > use > > an older architecture. > > > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I > think > > it should work fine with the cudnn in my scratch directory. > > > > You should probably install it to scratch, either running this first to > put > > libraries your scratch directory or using a virtualenv or something: > > export PYTHONUSERBASE=/home/scratch/$USER/.local > > > > You'll need this to use the library and probably to install it: > > export > > > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" > > > > To install: > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl > > (remove --user if you're using a virtualenv) > > > > (A request: I'm submitting to ICLR in two weeks, and for some of the > models > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't > > run a ton of stuff on gpu3 unless you're working on a deadline too. > > > > > > > > Steps to install it, for the future: > > > > Install bazel in your home directory: > > > > wget > > > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER > > --base=/home/scratch/$USER/.bazel > > > > Configure bazel to build in scratch. There's probably a better way to do > > this, but this works: > > > > mkdir /home/scratch/$USER/.cache > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > > > > Build tensorflow. Note that builds from git checkouts don't work, because > > they assume a newer version of git than is on gpu3: > > > > cd /home/scratch/$USER > > wget > > tar xf > > cd tensorflow-0.11.0rc0 > > ./configure > > > > This is an interactive script that doesn't seem to let you pass > arguments or > > anything. It's obnoxious. > > Use the default python > > don't use cloud platform or hadoop file system > > use the default site-packages path if it asks > > build with GPU support > > default gcc > > default Cuda SDK version > > specify /usr/local/cuda-8.0 > > default cudnn version > > specify $CUDNN_DIR from use-cudnn.sh, e.g. > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > > Pascal Titan Xs have compute capability 6.1 > > > > bazel build -c opt --config=cuda > > //tensorflow/tools/pip_package:build_pip_package > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the > > directory you specified above. > > > > > > - Dougal > > > > > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy > > > wrote: > >> > >> Predrag, > >> > >> Any updates on gpu3? > >> I have tried both tensorflow and chainer and in both cases the problem > >> seems to be with cuda > >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac > > >> wrote: > >>> > >>> Dougal Sutherland wrote: > >>> > >>> > I tried for a while. I failed. > >>> > > >>> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks > >>> for the quick feed back. > >>> > >>> Predrag > >>> > >>> > Version 0.10.0 fails immediately on build: "The specified > >>> > --crosstool_top > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid > >>> > cc_toolchain_suite > >>> > rule." Apparently this is because 0.10 required an older version of > >>> > bazel ( > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't > have > >>> > the > >>> > energy to install an old version of bazel. > >>> > > >>> > Version 0.11.0rc0 gets almost done and then complains about no such > >>> > file or > >>> > directory for libcudart.so.7.5 (which is there, where I told > tensorflow > >>> > it > >>> > was...). > >>> > > >>> > Non-release versions from git fail immediately because they call git > -C > >>> > to > >>> > get version info, which is only in git 1.9 (we have 1.8). > >>> > > >>> > > >>> > Some other notes: > >>> > - I made a symlink from ~/.cache/bazel to > >>> > /home/scratch/$USER/.cache/bazel, > >>> > because bazel is the worst. (It complains about doing things on NFS, > >>> > and > >>> > hung for me [clock-related?], and I can't find a global config file > or > >>> > anything to change that in; it seems like there might be one, but > their > >>> > documentation is terrible.) > >>> > > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge > >>> > deal, > >>> > but I don't know. > >>> > > >>> > - I tried explicitly including /usr/local/cuda/lib64 in > LD_LIBRARY_PATH > >>> > and > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would > >>> > help > >>> > with the 0.11.0rc0 problem, but it didn't. > >> > >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at cs.cmu.edu Fri Oct 21 15:37:27 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Fri, 21 Oct 2016 15:37:27 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> Message-ID: <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> Dougal Sutherland wrote: Sorry that I am late for the party. This is my interpretation of what we should do. 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live with it. Barnabas please OK this. I will work with MathWorks for this to be fixed for 2017a release. 2. Then I could install TensorFlow compiled by Dougal system wide. Please Dugal after I upgrade back to 8.0 recompile it again using CUDA 8.0. I could give you the root password so that you can compile and install directly. 3. If everyone is OK with above I will pull the trigger on GPU3 at 4:30PM and upgrade to 8.0 4. MATLAB will be broken on GPU2 as well after I put Titan cards during the October 25 power outrage. Predrag > Heh. :) > > An explanation: > > - Different nvidia gpu architectures are called "compute capabilities". > This is a number that describes the behavior of the card: the maximum size > of various things, which API functions it supports, etc. There's a > reference here > , > but it shouldn't really matter. > - When CUDA compiles code, it targets a certain architecture, since it > needs to know what features to use and whatnot. I *think* that if you > compile for compute capability x, it will work on a card with compute > capability y approximately iff x <= y. > - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. > - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to > compile for 6.1 it crashes. > - Theano by default tries to compile for the capability of the card, but > can be configured to compile for a different capability. > - Tensorflow asks for a list of capabilities to compile for when you > build it in the first place. > > > On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland wrote: > > > They do work with 7.5 if you specify an older compute architecture; it's > > just that their actual compute capability of 6.1 isn't supported by cuda > > 7.5. Thank is thrown off by this, for example, but it can be fixed by > > telling it to pass compute capability 5.2 (for example) to nvcc. I don't > > think that this was my problem with building tensorflow on 7.5; I'm not > > sure what that was. > > > > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy > > wrote: > > > > Thanks Dougal. I'll take a look atthis and get back to you. > > So are you suggesting that this is an issue with TitanX's not being > > compatible with 7.5? > > > > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland > > wrote: > > > > I installed it in my scratch directory (not sure if there's a global > > install?). The main thing was to put its cache on scratch; it got really > > upset when the cache directory was on NFS. (Instructions at the bottom of > > my previous email.) > > > > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos wrote: > > > > That's great! Thanks Dougal. > > > > As I remember bazel was not installed correctly previously on GPU3. Do > > you know what went wrong with it before and why it is good now? > > > > Thanks, > > Barnabas > > ====================== > > Barnabas Poczos, PhD > > Assistant Professor > > Machine Learning Department > > Carnegie Mellon University > > > > > > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland > > wrote: > > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda > > 8.0 > > > install, and it built fine. So additionally installing 7.5 was probably > > not > > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute > > architecture > > > that the Titan Xs use, so Theano at least needs to be manually told to > > use > > > an older architecture. > > > > > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I > > think > > > it should work fine with the cudnn in my scratch directory. > > > > > > You should probably install it to scratch, either running this first to > > put > > > libraries your scratch directory or using a virtualenv or something: > > > export PYTHONUSERBASE=/home/scratch/$USER/.local > > > > > > You'll need this to use the library and probably to install it: > > > export > > > > > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" > > > > > > To install: > > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl > > > (remove --user if you're using a virtualenv) > > > > > > (A request: I'm submitting to ICLR in two weeks, and for some of the > > models > > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't > > > run a ton of stuff on gpu3 unless you're working on a deadline too. > > > > > > > > > > > > Steps to install it, for the future: > > > > > > Install bazel in your home directory: > > > > > > wget > > > > > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh > > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER > > > --base=/home/scratch/$USER/.bazel > > > > > > Configure bazel to build in scratch. There's probably a better way to do > > > this, but this works: > > > > > > mkdir /home/scratch/$USER/.cache > > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > > > > > > Build tensorflow. Note that builds from git checkouts don't work, because > > > they assume a newer version of git than is on gpu3: > > > > > > cd /home/scratch/$USER > > > wget > > > tar xf > > > cd tensorflow-0.11.0rc0 > > > ./configure > > > > > > This is an interactive script that doesn't seem to let you pass > > arguments or > > > anything. It's obnoxious. > > > Use the default python > > > don't use cloud platform or hadoop file system > > > use the default site-packages path if it asks > > > build with GPU support > > > default gcc > > > default Cuda SDK version > > > specify /usr/local/cuda-8.0 > > > default cudnn version > > > specify $CUDNN_DIR from use-cudnn.sh, e.g. > > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > > > Pascal Titan Xs have compute capability 6.1 > > > > > > bazel build -c opt --config=cuda > > > //tensorflow/tools/pip_package:build_pip_package > > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the > > > directory you specified above. > > > > > > > > > - Dougal > > > > > > > > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy > > > > > wrote: > > >> > > >> Predrag, > > >> > > >> Any updates on gpu3? > > >> I have tried both tensorflow and chainer and in both cases the problem > > >> seems to be with cuda > > >> > > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac > > > > >> wrote: > > >>> > > >>> Dougal Sutherland wrote: > > >>> > > >>> > I tried for a while. I failed. > > >>> > > > >>> > > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks > > >>> for the quick feed back. > > >>> > > >>> Predrag > > >>> > > >>> > Version 0.10.0 fails immediately on build: "The specified > > >>> > --crosstool_top > > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid > > >>> > cc_toolchain_suite > > >>> > rule." Apparently this is because 0.10 required an older version of > > >>> > bazel ( > > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't > > have > > >>> > the > > >>> > energy to install an old version of bazel. > > >>> > > > >>> > Version 0.11.0rc0 gets almost done and then complains about no such > > >>> > file or > > >>> > directory for libcudart.so.7.5 (which is there, where I told > > tensorflow > > >>> > it > > >>> > was...). > > >>> > > > >>> > Non-release versions from git fail immediately because they call git > > -C > > >>> > to > > >>> > get version info, which is only in git 1.9 (we have 1.8). > > >>> > > > >>> > > > >>> > Some other notes: > > >>> > - I made a symlink from ~/.cache/bazel to > > >>> > /home/scratch/$USER/.cache/bazel, > > >>> > because bazel is the worst. (It complains about doing things on NFS, > > >>> > and > > >>> > hung for me [clock-related?], and I can't find a global config file > > or > > >>> > anything to change that in; it seems like there might be one, but > > their > > >>> > documentation is terrible.) > > >>> > > > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, > > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge > > >>> > deal, > > >>> > but I don't know. > > >>> > > > >>> > - I tried explicitly including /usr/local/cuda/lib64 in > > LD_LIBRARY_PATH > > >>> > and > > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would > > >>> > help > > >>> > with the 0.11.0rc0 problem, but it didn't. > > >> > > >> > > > > > > > > > From bapoczos at cs.cmu.edu Fri Oct 21 15:44:02 2016 From: bapoczos at cs.cmu.edu (Barnabas Poczos) Date: Fri, 21 Oct 2016 15:44:02 -0400 Subject: GPU3 back in business In-Reply-To: <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> Message-ID: Hi Predrag, If there is no other solution, then I think it is OK not to have Matlab on GPU2 and GPU3. Tensorflow has higher priority on these nodes. Best, Barnabas ====================== Barnabas Poczos, PhD Assistant Professor Machine Learning Department Carnegie Mellon University On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac wrote: > Dougal Sutherland wrote: > > > Sorry that I am late for the party. This is my interpretation of what we > should do. > > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live > with it. Barnabas please OK this. I will work with MathWorks for this to > be fixed for 2017a release. > > 2. Then I could install TensorFlow compiled by Dougal system wide. > Please Dugal after I upgrade back to 8.0 recompile it again using CUDA > 8.0. I could give you the root password so that you can compile and > install directly. > > 3. If everyone is OK with above I will pull the trigger on GPU3 at > 4:30PM and upgrade to 8.0 > > 4. MATLAB will be broken on GPU2 as well after I put Titan cards during > the October 25 power outrage. > > Predrag > > > > > > >> Heh. :) >> >> An explanation: >> >> - Different nvidia gpu architectures are called "compute capabilities". >> This is a number that describes the behavior of the card: the maximum size >> of various things, which API functions it supports, etc. There's a >> reference here >> , >> but it shouldn't really matter. >> - When CUDA compiles code, it targets a certain architecture, since it >> needs to know what features to use and whatnot. I *think* that if you >> compile for compute capability x, it will work on a card with compute >> capability y approximately iff x <= y. >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to >> compile for 6.1 it crashes. >> - Theano by default tries to compile for the capability of the card, but >> can be configured to compile for a different capability. >> - Tensorflow asks for a list of capabilities to compile for when you >> build it in the first place. >> >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland wrote: >> >> > They do work with 7.5 if you specify an older compute architecture; it's >> > just that their actual compute capability of 6.1 isn't supported by cuda >> > 7.5. Thank is thrown off by this, for example, but it can be fixed by >> > telling it to pass compute capability 5.2 (for example) to nvcc. I don't >> > think that this was my problem with building tensorflow on 7.5; I'm not >> > sure what that was. >> > >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy >> > wrote: >> > >> > Thanks Dougal. I'll take a look atthis and get back to you. >> > So are you suggesting that this is an issue with TitanX's not being >> > compatible with 7.5? >> > >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland >> > wrote: >> > >> > I installed it in my scratch directory (not sure if there's a global >> > install?). The main thing was to put its cache on scratch; it got really >> > upset when the cache directory was on NFS. (Instructions at the bottom of >> > my previous email.) >> > >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos wrote: >> > >> > That's great! Thanks Dougal. >> > >> > As I remember bazel was not installed correctly previously on GPU3. Do >> > you know what went wrong with it before and why it is good now? >> > >> > Thanks, >> > Barnabas >> > ====================== >> > Barnabas Poczos, PhD >> > Assistant Professor >> > Machine Learning Department >> > Carnegie Mellon University >> > >> > >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland >> > wrote: >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda >> > 8.0 >> > > install, and it built fine. So additionally installing 7.5 was probably >> > not >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute >> > architecture >> > > that the Titan Xs use, so Theano at least needs to be manually told to >> > use >> > > an older architecture. >> > > >> > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I >> > think >> > > it should work fine with the cudnn in my scratch directory. >> > > >> > > You should probably install it to scratch, either running this first to >> > put >> > > libraries your scratch directory or using a virtualenv or something: >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local >> > > >> > > You'll need this to use the library and probably to install it: >> > > export >> > > >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" >> > > >> > > To install: >> > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl >> > > (remove --user if you're using a virtualenv) >> > > >> > > (A request: I'm submitting to ICLR in two weeks, and for some of the >> > models >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't >> > > run a ton of stuff on gpu3 unless you're working on a deadline too. >> > > >> > > >> > > >> > > Steps to install it, for the future: >> > > >> > > Install bazel in your home directory: >> > > >> > > wget >> > > >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh >> > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER >> > > --base=/home/scratch/$USER/.bazel >> > > >> > > Configure bazel to build in scratch. There's probably a better way to do >> > > this, but this works: >> > > >> > > mkdir /home/scratch/$USER/.cache >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel >> > > >> > > Build tensorflow. Note that builds from git checkouts don't work, because >> > > they assume a newer version of git than is on gpu3: >> > > >> > > cd /home/scratch/$USER >> > > wget >> > > tar xf >> > > cd tensorflow-0.11.0rc0 >> > > ./configure >> > > >> > > This is an interactive script that doesn't seem to let you pass >> > arguments or >> > > anything. It's obnoxious. >> > > Use the default python >> > > don't use cloud platform or hadoop file system >> > > use the default site-packages path if it asks >> > > build with GPU support >> > > default gcc >> > > default Cuda SDK version >> > > specify /usr/local/cuda-8.0 >> > > default cudnn version >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g. >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda >> > > Pascal Titan Xs have compute capability 6.1 >> > > >> > > bazel build -c opt --config=cuda >> > > //tensorflow/tools/pip_package:build_pip_package >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the >> > > directory you specified above. >> > > >> > > >> > > - Dougal >> > > >> > > >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy > > > >> > > wrote: >> > >> >> > >> Predrag, >> > >> >> > >> Any updates on gpu3? >> > >> I have tried both tensorflow and chainer and in both cases the problem >> > >> seems to be with cuda >> > >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac > > > >> > >> wrote: >> > >>> >> > >>> Dougal Sutherland wrote: >> > >>> >> > >>> > I tried for a while. I failed. >> > >>> > >> > >>> >> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks >> > >>> for the quick feed back. >> > >>> >> > >>> Predrag >> > >>> >> > >>> > Version 0.10.0 fails immediately on build: "The specified >> > >>> > --crosstool_top >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid >> > >>> > cc_toolchain_suite >> > >>> > rule." Apparently this is because 0.10 required an older version of >> > >>> > bazel ( >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't >> > have >> > >>> > the >> > >>> > energy to install an old version of bazel. >> > >>> > >> > >>> > Version 0.11.0rc0 gets almost done and then complains about no such >> > >>> > file or >> > >>> > directory for libcudart.so.7.5 (which is there, where I told >> > tensorflow >> > >>> > it >> > >>> > was...). >> > >>> > >> > >>> > Non-release versions from git fail immediately because they call git >> > -C >> > >>> > to >> > >>> > get version info, which is only in git 1.9 (we have 1.8). >> > >>> > >> > >>> > >> > >>> > Some other notes: >> > >>> > - I made a symlink from ~/.cache/bazel to >> > >>> > /home/scratch/$USER/.cache/bazel, >> > >>> > because bazel is the worst. (It complains about doing things on NFS, >> > >>> > and >> > >>> > hung for me [clock-related?], and I can't find a global config file >> > or >> > >>> > anything to change that in; it seems like there might be one, but >> > their >> > >>> > documentation is terrible.) >> > >>> > >> > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge >> > >>> > deal, >> > >>> > but I don't know. >> > >>> > >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in >> > LD_LIBRARY_PATH >> > >>> > and >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would >> > >>> > help >> > >>> > with the 0.11.0rc0 problem, but it didn't. >> > >> >> > >> >> > > >> > >> > >> > From predragp at cs.cmu.edu Fri Oct 21 15:50:32 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Fri, 21 Oct 2016 15:50:32 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> Message-ID: <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> Barnabas Poczos wrote: > Hi Predrag, > > If there is no other solution, then I think it is OK not to have > Matlab on GPU2 and GPU3. > Tensorflow has higher priority on these nodes. We could possibly have multiple CUDA libraries for different versions but that is going to bite us for the rear end quickly. People who want to use MATLAB with GPUs will have to live with GPU1 probably until Spring release of MATLAB. Predrag > > Best, > Barnabas > > > > > ====================== > Barnabas Poczos, PhD > Assistant Professor > Machine Learning Department > Carnegie Mellon University > > > On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac wrote: > > Dougal Sutherland wrote: > > > > > > Sorry that I am late for the party. This is my interpretation of what we > > should do. > > > > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live > > with it. Barnabas please OK this. I will work with MathWorks for this to > > be fixed for 2017a release. > > > > 2. Then I could install TensorFlow compiled by Dougal system wide. > > Please Dugal after I upgrade back to 8.0 recompile it again using CUDA > > 8.0. I could give you the root password so that you can compile and > > install directly. > > > > 3. If everyone is OK with above I will pull the trigger on GPU3 at > > 4:30PM and upgrade to 8.0 > > > > 4. MATLAB will be broken on GPU2 as well after I put Titan cards during > > the October 25 power outrage. > > > > Predrag > > > > > > > > > > > > > >> Heh. :) > >> > >> An explanation: > >> > >> - Different nvidia gpu architectures are called "compute capabilities". > >> This is a number that describes the behavior of the card: the maximum size > >> of various things, which API functions it supports, etc. There's a > >> reference here > >> , > >> but it shouldn't really matter. > >> - When CUDA compiles code, it targets a certain architecture, since it > >> needs to know what features to use and whatnot. I *think* that if you > >> compile for compute capability x, it will work on a card with compute > >> capability y approximately iff x <= y. > >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. > >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to > >> compile for 6.1 it crashes. > >> - Theano by default tries to compile for the capability of the card, but > >> can be configured to compile for a different capability. > >> - Tensorflow asks for a list of capabilities to compile for when you > >> build it in the first place. > >> > >> > >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland wrote: > >> > >> > They do work with 7.5 if you specify an older compute architecture; it's > >> > just that their actual compute capability of 6.1 isn't supported by cuda > >> > 7.5. Thank is thrown off by this, for example, but it can be fixed by > >> > telling it to pass compute capability 5.2 (for example) to nvcc. I don't > >> > think that this was my problem with building tensorflow on 7.5; I'm not > >> > sure what that was. > >> > > >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy > >> > wrote: > >> > > >> > Thanks Dougal. I'll take a look atthis and get back to you. > >> > So are you suggesting that this is an issue with TitanX's not being > >> > compatible with 7.5? > >> > > >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland > >> > wrote: > >> > > >> > I installed it in my scratch directory (not sure if there's a global > >> > install?). The main thing was to put its cache on scratch; it got really > >> > upset when the cache directory was on NFS. (Instructions at the bottom of > >> > my previous email.) > >> > > >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos wrote: > >> > > >> > That's great! Thanks Dougal. > >> > > >> > As I remember bazel was not installed correctly previously on GPU3. Do > >> > you know what went wrong with it before and why it is good now? > >> > > >> > Thanks, > >> > Barnabas > >> > ====================== > >> > Barnabas Poczos, PhD > >> > Assistant Professor > >> > Machine Learning Department > >> > Carnegie Mellon University > >> > > >> > > >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland > >> > wrote: > >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda > >> > 8.0 > >> > > install, and it built fine. So additionally installing 7.5 was probably > >> > not > >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute > >> > architecture > >> > > that the Titan Xs use, so Theano at least needs to be manually told to > >> > use > >> > > an older architecture. > >> > > > >> > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I > >> > think > >> > > it should work fine with the cudnn in my scratch directory. > >> > > > >> > > You should probably install it to scratch, either running this first to > >> > put > >> > > libraries your scratch directory or using a virtualenv or something: > >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local > >> > > > >> > > You'll need this to use the library and probably to install it: > >> > > export > >> > > > >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" > >> > > > >> > > To install: > >> > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl > >> > > (remove --user if you're using a virtualenv) > >> > > > >> > > (A request: I'm submitting to ICLR in two weeks, and for some of the > >> > models > >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't > >> > > run a ton of stuff on gpu3 unless you're working on a deadline too. > >> > > > >> > > > >> > > > >> > > Steps to install it, for the future: > >> > > > >> > > Install bazel in your home directory: > >> > > > >> > > wget > >> > > > >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh > >> > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER > >> > > --base=/home/scratch/$USER/.bazel > >> > > > >> > > Configure bazel to build in scratch. There's probably a better way to do > >> > > this, but this works: > >> > > > >> > > mkdir /home/scratch/$USER/.cache > >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > >> > > > >> > > Build tensorflow. Note that builds from git checkouts don't work, because > >> > > they assume a newer version of git than is on gpu3: > >> > > > >> > > cd /home/scratch/$USER > >> > > wget > >> > > tar xf > >> > > cd tensorflow-0.11.0rc0 > >> > > ./configure > >> > > > >> > > This is an interactive script that doesn't seem to let you pass > >> > arguments or > >> > > anything. It's obnoxious. > >> > > Use the default python > >> > > don't use cloud platform or hadoop file system > >> > > use the default site-packages path if it asks > >> > > build with GPU support > >> > > default gcc > >> > > default Cuda SDK version > >> > > specify /usr/local/cuda-8.0 > >> > > default cudnn version > >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g. > >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > >> > > Pascal Titan Xs have compute capability 6.1 > >> > > > >> > > bazel build -c opt --config=cuda > >> > > //tensorflow/tools/pip_package:build_pip_package > >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the > >> > > directory you specified above. > >> > > > >> > > > >> > > - Dougal > >> > > > >> > > > >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy >> > > > >> > > wrote: > >> > >> > >> > >> Predrag, > >> > >> > >> > >> Any updates on gpu3? > >> > >> I have tried both tensorflow and chainer and in both cases the problem > >> > >> seems to be with cuda > >> > >> > >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac >> > > > >> > >> wrote: > >> > >>> > >> > >>> Dougal Sutherland wrote: > >> > >>> > >> > >>> > I tried for a while. I failed. > >> > >>> > > >> > >>> > >> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks > >> > >>> for the quick feed back. > >> > >>> > >> > >>> Predrag > >> > >>> > >> > >>> > Version 0.10.0 fails immediately on build: "The specified > >> > >>> > --crosstool_top > >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid > >> > >>> > cc_toolchain_suite > >> > >>> > rule." Apparently this is because 0.10 required an older version of > >> > >>> > bazel ( > >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't > >> > have > >> > >>> > the > >> > >>> > energy to install an old version of bazel. > >> > >>> > > >> > >>> > Version 0.11.0rc0 gets almost done and then complains about no such > >> > >>> > file or > >> > >>> > directory for libcudart.so.7.5 (which is there, where I told > >> > tensorflow > >> > >>> > it > >> > >>> > was...). > >> > >>> > > >> > >>> > Non-release versions from git fail immediately because they call git > >> > -C > >> > >>> > to > >> > >>> > get version info, which is only in git 1.9 (we have 1.8). > >> > >>> > > >> > >>> > > >> > >>> > Some other notes: > >> > >>> > - I made a symlink from ~/.cache/bazel to > >> > >>> > /home/scratch/$USER/.cache/bazel, > >> > >>> > because bazel is the worst. (It complains about doing things on NFS, > >> > >>> > and > >> > >>> > hung for me [clock-related?], and I can't find a global config file > >> > or > >> > >>> > anything to change that in; it seems like there might be one, but > >> > their > >> > >>> > documentation is terrible.) > >> > >>> > > >> > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, > >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge > >> > >>> > deal, > >> > >>> > but I don't know. > >> > >>> > > >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in > >> > LD_LIBRARY_PATH > >> > >>> > and > >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would > >> > >>> > help > >> > >>> > with the 0.11.0rc0 problem, but it didn't. > >> > >> > >> > >> > >> > > > >> > > >> > > >> > From bapoczos at cs.cmu.edu Fri Oct 21 15:54:08 2016 From: bapoczos at cs.cmu.edu (Barnabas Poczos) Date: Fri, 21 Oct 2016 15:54:08 -0400 Subject: GPU3 back in business In-Reply-To: <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> Message-ID: Sounds good. Let us have tensorflow system wide on all GPU nodes. We can worry about Matlab later. Best, B ====================== Barnabas Poczos, PhD Assistant Professor Machine Learning Department Carnegie Mellon University On Fri, Oct 21, 2016 at 3:50 PM, Predrag Punosevac wrote: > Barnabas Poczos wrote: > >> Hi Predrag, >> >> If there is no other solution, then I think it is OK not to have >> Matlab on GPU2 and GPU3. >> Tensorflow has higher priority on these nodes. > > We could possibly have multiple CUDA libraries for different versions > but that is going to bite us for the rear end quickly. People who want > to use MATLAB with GPUs will have to live with GPU1 probably until > Spring release of MATLAB. > > Predrag > >> >> Best, >> Barnabas >> >> >> >> >> ====================== >> Barnabas Poczos, PhD >> Assistant Professor >> Machine Learning Department >> Carnegie Mellon University >> >> >> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac wrote: >> > Dougal Sutherland wrote: >> > >> > >> > Sorry that I am late for the party. This is my interpretation of what we >> > should do. >> > >> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live >> > with it. Barnabas please OK this. I will work with MathWorks for this to >> > be fixed for 2017a release. >> > >> > 2. Then I could install TensorFlow compiled by Dougal system wide. >> > Please Dugal after I upgrade back to 8.0 recompile it again using CUDA >> > 8.0. I could give you the root password so that you can compile and >> > install directly. >> > >> > 3. If everyone is OK with above I will pull the trigger on GPU3 at >> > 4:30PM and upgrade to 8.0 >> > >> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards during >> > the October 25 power outrage. >> > >> > Predrag >> > >> > >> > >> > >> > >> > >> >> Heh. :) >> >> >> >> An explanation: >> >> >> >> - Different nvidia gpu architectures are called "compute capabilities". >> >> This is a number that describes the behavior of the card: the maximum size >> >> of various things, which API functions it supports, etc. There's a >> >> reference here >> >> , >> >> but it shouldn't really matter. >> >> - When CUDA compiles code, it targets a certain architecture, since it >> >> needs to know what features to use and whatnot. I *think* that if you >> >> compile for compute capability x, it will work on a card with compute >> >> capability y approximately iff x <= y. >> >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. >> >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to >> >> compile for 6.1 it crashes. >> >> - Theano by default tries to compile for the capability of the card, but >> >> can be configured to compile for a different capability. >> >> - Tensorflow asks for a list of capabilities to compile for when you >> >> build it in the first place. >> >> >> >> >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland wrote: >> >> >> >> > They do work with 7.5 if you specify an older compute architecture; it's >> >> > just that their actual compute capability of 6.1 isn't supported by cuda >> >> > 7.5. Thank is thrown off by this, for example, but it can be fixed by >> >> > telling it to pass compute capability 5.2 (for example) to nvcc. I don't >> >> > think that this was my problem with building tensorflow on 7.5; I'm not >> >> > sure what that was. >> >> > >> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy >> >> > wrote: >> >> > >> >> > Thanks Dougal. I'll take a look atthis and get back to you. >> >> > So are you suggesting that this is an issue with TitanX's not being >> >> > compatible with 7.5? >> >> > >> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland >> >> > wrote: >> >> > >> >> > I installed it in my scratch directory (not sure if there's a global >> >> > install?). The main thing was to put its cache on scratch; it got really >> >> > upset when the cache directory was on NFS. (Instructions at the bottom of >> >> > my previous email.) >> >> > >> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos wrote: >> >> > >> >> > That's great! Thanks Dougal. >> >> > >> >> > As I remember bazel was not installed correctly previously on GPU3. Do >> >> > you know what went wrong with it before and why it is good now? >> >> > >> >> > Thanks, >> >> > Barnabas >> >> > ====================== >> >> > Barnabas Poczos, PhD >> >> > Assistant Professor >> >> > Machine Learning Department >> >> > Carnegie Mellon University >> >> > >> >> > >> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland >> >> > wrote: >> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda >> >> > 8.0 >> >> > > install, and it built fine. So additionally installing 7.5 was probably >> >> > not >> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute >> >> > architecture >> >> > > that the Titan Xs use, so Theano at least needs to be manually told to >> >> > use >> >> > > an older architecture. >> >> > > >> >> > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I >> >> > think >> >> > > it should work fine with the cudnn in my scratch directory. >> >> > > >> >> > > You should probably install it to scratch, either running this first to >> >> > put >> >> > > libraries your scratch directory or using a virtualenv or something: >> >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local >> >> > > >> >> > > You'll need this to use the library and probably to install it: >> >> > > export >> >> > > >> >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" >> >> > > >> >> > > To install: >> >> > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl >> >> > > (remove --user if you're using a virtualenv) >> >> > > >> >> > > (A request: I'm submitting to ICLR in two weeks, and for some of the >> >> > models >> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't >> >> > > run a ton of stuff on gpu3 unless you're working on a deadline too. >> >> > > >> >> > > >> >> > > >> >> > > Steps to install it, for the future: >> >> > > >> >> > > Install bazel in your home directory: >> >> > > >> >> > > wget >> >> > > >> >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh >> >> > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER >> >> > > --base=/home/scratch/$USER/.bazel >> >> > > >> >> > > Configure bazel to build in scratch. There's probably a better way to do >> >> > > this, but this works: >> >> > > >> >> > > mkdir /home/scratch/$USER/.cache >> >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel >> >> > > >> >> > > Build tensorflow. Note that builds from git checkouts don't work, because >> >> > > they assume a newer version of git than is on gpu3: >> >> > > >> >> > > cd /home/scratch/$USER >> >> > > wget >> >> > > tar xf >> >> > > cd tensorflow-0.11.0rc0 >> >> > > ./configure >> >> > > >> >> > > This is an interactive script that doesn't seem to let you pass >> >> > arguments or >> >> > > anything. It's obnoxious. >> >> > > Use the default python >> >> > > don't use cloud platform or hadoop file system >> >> > > use the default site-packages path if it asks >> >> > > build with GPU support >> >> > > default gcc >> >> > > default Cuda SDK version >> >> > > specify /usr/local/cuda-8.0 >> >> > > default cudnn version >> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g. >> >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda >> >> > > Pascal Titan Xs have compute capability 6.1 >> >> > > >> >> > > bazel build -c opt --config=cuda >> >> > > //tensorflow/tools/pip_package:build_pip_package >> >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ >> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the >> >> > > directory you specified above. >> >> > > >> >> > > >> >> > > - Dougal >> >> > > >> >> > > >> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy > >> > > >> >> > > wrote: >> >> > >> >> >> > >> Predrag, >> >> > >> >> >> > >> Any updates on gpu3? >> >> > >> I have tried both tensorflow and chainer and in both cases the problem >> >> > >> seems to be with cuda >> >> > >> >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac > >> > > >> >> > >> wrote: >> >> > >>> >> >> > >>> Dougal Sutherland wrote: >> >> > >>> >> >> > >>> > I tried for a while. I failed. >> >> > >>> > >> >> > >>> >> >> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks >> >> > >>> for the quick feed back. >> >> > >>> >> >> > >>> Predrag >> >> > >>> >> >> > >>> > Version 0.10.0 fails immediately on build: "The specified >> >> > >>> > --crosstool_top >> >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid >> >> > >>> > cc_toolchain_suite >> >> > >>> > rule." Apparently this is because 0.10 required an older version of >> >> > >>> > bazel ( >> >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't >> >> > have >> >> > >>> > the >> >> > >>> > energy to install an old version of bazel. >> >> > >>> > >> >> > >>> > Version 0.11.0rc0 gets almost done and then complains about no such >> >> > >>> > file or >> >> > >>> > directory for libcudart.so.7.5 (which is there, where I told >> >> > tensorflow >> >> > >>> > it >> >> > >>> > was...). >> >> > >>> > >> >> > >>> > Non-release versions from git fail immediately because they call git >> >> > -C >> >> > >>> > to >> >> > >>> > get version info, which is only in git 1.9 (we have 1.8). >> >> > >>> > >> >> > >>> > >> >> > >>> > Some other notes: >> >> > >>> > - I made a symlink from ~/.cache/bazel to >> >> > >>> > /home/scratch/$USER/.cache/bazel, >> >> > >>> > because bazel is the worst. (It complains about doing things on NFS, >> >> > >>> > and >> >> > >>> > hung for me [clock-related?], and I can't find a global config file >> >> > or >> >> > >>> > anything to change that in; it seems like there might be one, but >> >> > their >> >> > >>> > documentation is terrible.) >> >> > >>> > >> >> > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1, >> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge >> >> > >>> > deal, >> >> > >>> > but I don't know. >> >> > >>> > >> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in >> >> > LD_LIBRARY_PATH >> >> > >>> > and >> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would >> >> > >>> > help >> >> > >>> > with the 0.11.0rc0 problem, but it didn't. >> >> > >> >> >> > >> >> >> > > >> >> > >> >> > >> >> > From kandasamy at cmu.edu Fri Oct 21 16:21:33 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Fri, 21 Oct 2016 16:21:33 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> Message-ID: Hi all, I was planning on using Matlab with GPUs for one of my projects. Can we please keep gpu2 as it is for now? samy On Fri, Oct 21, 2016 at 3:54 PM, Barnabas Poczos wrote: > Sounds good. Let us have tensorflow system wide on all GPU nodes. We > can worry about Matlab later. > > Best, > B > ====================== > Barnabas Poczos, PhD > Assistant Professor > Machine Learning Department > Carnegie Mellon University > > > On Fri, Oct 21, 2016 at 3:50 PM, Predrag Punosevac > wrote: > > Barnabas Poczos wrote: > > > >> Hi Predrag, > >> > >> If there is no other solution, then I think it is OK not to have > >> Matlab on GPU2 and GPU3. > >> Tensorflow has higher priority on these nodes. > > > > We could possibly have multiple CUDA libraries for different versions > > but that is going to bite us for the rear end quickly. People who want > > to use MATLAB with GPUs will have to live with GPU1 probably until > > Spring release of MATLAB. > > > > Predrag > > > >> > >> Best, > >> Barnabas > >> > >> > >> > >> > >> ====================== > >> Barnabas Poczos, PhD > >> Assistant Professor > >> Machine Learning Department > >> Carnegie Mellon University > >> > >> > >> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac > wrote: > >> > Dougal Sutherland wrote: > >> > > >> > > >> > Sorry that I am late for the party. This is my interpretation of what > we > >> > should do. > >> > > >> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live > >> > with it. Barnabas please OK this. I will work with MathWorks for this > to > >> > be fixed for 2017a release. > >> > > >> > 2. Then I could install TensorFlow compiled by Dougal system wide. > >> > Please Dugal after I upgrade back to 8.0 recompile it again using CUDA > >> > 8.0. I could give you the root password so that you can compile and > >> > install directly. > >> > > >> > 3. If everyone is OK with above I will pull the trigger on GPU3 at > >> > 4:30PM and upgrade to 8.0 > >> > > >> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards > during > >> > the October 25 power outrage. > >> > > >> > Predrag > >> > > >> > > >> > > >> > > >> > > >> > > >> >> Heh. :) > >> >> > >> >> An explanation: > >> >> > >> >> - Different nvidia gpu architectures are called "compute > capabilities". > >> >> This is a number that describes the behavior of the card: the > maximum size > >> >> of various things, which API functions it supports, etc. There's a > >> >> reference here > >> >> and_specifications>, > >> >> but it shouldn't really matter. > >> >> - When CUDA compiles code, it targets a certain architecture, > since it > >> >> needs to know what features to use and whatnot. I *think* that if > you > >> >> compile for compute capability x, it will work on a card with > compute > >> >> capability y approximately iff x <= y. > >> >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. > >> >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you > ask to > >> >> compile for 6.1 it crashes. > >> >> - Theano by default tries to compile for the capability of the > card, but > >> >> can be configured to compile for a different capability. > >> >> - Tensorflow asks for a list of capabilities to compile for when > you > >> >> build it in the first place. > >> >> > >> >> > >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland > wrote: > >> >> > >> >> > They do work with 7.5 if you specify an older compute > architecture; it's > >> >> > just that their actual compute capability of 6.1 isn't supported > by cuda > >> >> > 7.5. Thank is thrown off by this, for example, but it can be fixed > by > >> >> > telling it to pass compute capability 5.2 (for example) to nvcc. I > don't > >> >> > think that this was my problem with building tensorflow on 7.5; > I'm not > >> >> > sure what that was. > >> >> > > >> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy < > kandasamy at cmu.edu> > >> >> > wrote: > >> >> > > >> >> > Thanks Dougal. I'll take a look atthis and get back to you. > >> >> > So are you suggesting that this is an issue with TitanX's not being > >> >> > compatible with 7.5? > >> >> > > >> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland < > dougal at gmail.com> > >> >> > wrote: > >> >> > > >> >> > I installed it in my scratch directory (not sure if there's a > global > >> >> > install?). The main thing was to put its cache on scratch; it got > really > >> >> > upset when the cache directory was on NFS. (Instructions at the > bottom of > >> >> > my previous email.) > >> >> > > >> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos > wrote: > >> >> > > >> >> > That's great! Thanks Dougal. > >> >> > > >> >> > As I remember bazel was not installed correctly previously on > GPU3. Do > >> >> > you know what went wrong with it before and why it is good now? > >> >> > > >> >> > Thanks, > >> >> > Barnabas > >> >> > ====================== > >> >> > Barnabas Poczos, PhD > >> >> > Assistant Professor > >> >> > Machine Learning Department > >> >> > Carnegie Mellon University > >> >> > > >> >> > > >> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland < > dougal at gmail.com> > >> >> > wrote: > >> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used > the cuda > >> >> > 8.0 > >> >> > > install, and it built fine. So additionally installing 7.5 was > probably > >> >> > not > >> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute > >> >> > architecture > >> >> > > that the Titan Xs use, so Theano at least needs to be manually > told to > >> >> > use > >> >> > > an older architecture. > >> >> > > > >> >> > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. > I > >> >> > think > >> >> > > it should work fine with the cudnn in my scratch directory. > >> >> > > > >> >> > > You should probably install it to scratch, either running this > first to > >> >> > put > >> >> > > libraries your scratch directory or using a virtualenv or > something: > >> >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local > >> >> > > > >> >> > > You'll need this to use the library and probably to install it: > >> >> > > export > >> >> > > > >> >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/ > lib64:"$LD_LIBRARY_PATH" > >> >> > > > >> >> > > To install: > >> >> > > pip install --user ~dsutherl/tensorflow-0.11. > 0rc0-py2-none-any.whl > >> >> > > (remove --user if you're using a virtualenv) > >> >> > > > >> >> > > (A request: I'm submitting to ICLR in two weeks, and for some of > the > >> >> > models > >> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So > please don't > >> >> > > run a ton of stuff on gpu3 unless you're working on a deadline > too. > >> >> > > > >> >> > > > >> >> > > > >> >> > > Steps to install it, for the future: > >> >> > > > >> >> > > Install bazel in your home directory: > >> >> > > > >> >> > > wget > >> >> > > > >> >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/ > bazel-0.3.2-installer-linux-x86_64.sh > >> >> > > bash bazel-0.3.2-installer-linux-x86_64.sh > --prefix=/home/scratch/$USER > >> >> > > --base=/home/scratch/$USER/.bazel > >> >> > > > >> >> > > Configure bazel to build in scratch. There's probably a better > way to do > >> >> > > this, but this works: > >> >> > > > >> >> > > mkdir /home/scratch/$USER/.cache > >> >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > >> >> > > > >> >> > > Build tensorflow. Note that builds from git checkouts don't > work, because > >> >> > > they assume a newer version of git than is on gpu3: > >> >> > > > >> >> > > cd /home/scratch/$USER > >> >> > > wget > >> >> > > tar xf > >> >> > > cd tensorflow-0.11.0rc0 > >> >> > > ./configure > >> >> > > > >> >> > > This is an interactive script that doesn't seem to let you pass > >> >> > arguments or > >> >> > > anything. It's obnoxious. > >> >> > > Use the default python > >> >> > > don't use cloud platform or hadoop file system > >> >> > > use the default site-packages path if it asks > >> >> > > build with GPU support > >> >> > > default gcc > >> >> > > default Cuda SDK version > >> >> > > specify /usr/local/cuda-8.0 > >> >> > > default cudnn version > >> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g. > >> >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > >> >> > > Pascal Titan Xs have compute capability 6.1 > >> >> > > > >> >> > > bazel build -c opt --config=cuda > >> >> > > //tensorflow/tools/pip_package:build_pip_package > >> >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > >> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put > in the > >> >> > > directory you specified above. > >> >> > > > >> >> > > > >> >> > > - Dougal > >> >> > > > >> >> > > > >> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy < > kandasamy at cmu.edu > >> >> > > > >> >> > > wrote: > >> >> > >> > >> >> > >> Predrag, > >> >> > >> > >> >> > >> Any updates on gpu3? > >> >> > >> I have tried both tensorflow and chainer and in both cases the > problem > >> >> > >> seems to be with cuda > >> >> > >> > >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac < > predragp at cs.cmu.edu > >> >> > > > >> >> > >> wrote: > >> >> > >>> > >> >> > >>> Dougal Sutherland wrote: > >> >> > >>> > >> >> > >>> > I tried for a while. I failed. > >> >> > >>> > > >> >> > >>> > >> >> > >>> Damn this doesn't look good. I guess back to the drawing > board. Thanks > >> >> > >>> for the quick feed back. > >> >> > >>> > >> >> > >>> Predrag > >> >> > >>> > >> >> > >>> > Version 0.10.0 fails immediately on build: "The specified > >> >> > >>> > --crosstool_top > >> >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid > >> >> > >>> > cc_toolchain_suite > >> >> > >>> > rule." Apparently this is because 0.10 required an older > version of > >> >> > >>> > bazel ( > >> >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and > I don't > >> >> > have > >> >> > >>> > the > >> >> > >>> > energy to install an old version of bazel. > >> >> > >>> > > >> >> > >>> > Version 0.11.0rc0 gets almost done and then complains about > no such > >> >> > >>> > file or > >> >> > >>> > directory for libcudart.so.7.5 (which is there, where I told > >> >> > tensorflow > >> >> > >>> > it > >> >> > >>> > was...). > >> >> > >>> > > >> >> > >>> > Non-release versions from git fail immediately because they > call git > >> >> > -C > >> >> > >>> > to > >> >> > >>> > get version info, which is only in git 1.9 (we have 1.8). > >> >> > >>> > > >> >> > >>> > > >> >> > >>> > Some other notes: > >> >> > >>> > - I made a symlink from ~/.cache/bazel to > >> >> > >>> > /home/scratch/$USER/.cache/bazel, > >> >> > >>> > because bazel is the worst. (It complains about doing things > on NFS, > >> >> > >>> > and > >> >> > >>> > hung for me [clock-related?], and I can't find a global > config file > >> >> > or > >> >> > >>> > anything to change that in; it seems like there might be > one, but > >> >> > their > >> >> > >>> > documentation is terrible.) > >> >> > >>> > > >> >> > >>> > - I wasn't able to use the actual Titan X compute capability > of 6.1, > >> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably > not a huge > >> >> > >>> > deal, > >> >> > >>> > but I don't know. > >> >> > >>> > > >> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in > >> >> > LD_LIBRARY_PATH > >> >> > >>> > and > >> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping > that would > >> >> > >>> > help > >> >> > >>> > with the 0.11.0rc0 problem, but it didn't. > >> >> > >> > >> >> > >> > >> >> > > > >> >> > > >> >> > > >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bapoczos at cs.cmu.edu Fri Oct 21 16:44:14 2016 From: bapoczos at cs.cmu.edu (Barnabas Poczos) Date: Fri, 21 Oct 2016 16:44:14 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> Message-ID: Hi Samy, Gpu1 will still have Matlab and 4 K80 GPus (which is technically 8 GPUs). Won't that be enough for now? Best, B ====================== Barnabas Poczos, PhD Assistant Professor Machine Learning Department Carnegie Mellon University On Fri, Oct 21, 2016 at 4:21 PM, Kirthevasan Kandasamy wrote: > Hi all, > > I was planning on using Matlab with GPUs for one of my projects. > Can we please keep gpu2 as it is for now? > > samy > > On Fri, Oct 21, 2016 at 3:54 PM, Barnabas Poczos > wrote: >> >> Sounds good. Let us have tensorflow system wide on all GPU nodes. We >> can worry about Matlab later. >> >> Best, >> B >> ====================== >> Barnabas Poczos, PhD >> Assistant Professor >> Machine Learning Department >> Carnegie Mellon University >> >> >> On Fri, Oct 21, 2016 at 3:50 PM, Predrag Punosevac >> wrote: >> > Barnabas Poczos wrote: >> > >> >> Hi Predrag, >> >> >> >> If there is no other solution, then I think it is OK not to have >> >> Matlab on GPU2 and GPU3. >> >> Tensorflow has higher priority on these nodes. >> > >> > We could possibly have multiple CUDA libraries for different versions >> > but that is going to bite us for the rear end quickly. People who want >> > to use MATLAB with GPUs will have to live with GPU1 probably until >> > Spring release of MATLAB. >> > >> > Predrag >> > >> >> >> >> Best, >> >> Barnabas >> >> >> >> >> >> >> >> >> >> ====================== >> >> Barnabas Poczos, PhD >> >> Assistant Professor >> >> Machine Learning Department >> >> Carnegie Mellon University >> >> >> >> >> >> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac >> >> wrote: >> >> > Dougal Sutherland wrote: >> >> > >> >> > >> >> > Sorry that I am late for the party. This is my interpretation of what >> >> > we >> >> > should do. >> >> > >> >> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to >> >> > live >> >> > with it. Barnabas please OK this. I will work with MathWorks for this >> >> > to >> >> > be fixed for 2017a release. >> >> > >> >> > 2. Then I could install TensorFlow compiled by Dougal system wide. >> >> > Please Dugal after I upgrade back to 8.0 recompile it again using >> >> > CUDA >> >> > 8.0. I could give you the root password so that you can compile and >> >> > install directly. >> >> > >> >> > 3. If everyone is OK with above I will pull the trigger on GPU3 at >> >> > 4:30PM and upgrade to 8.0 >> >> > >> >> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards >> >> > during >> >> > the October 25 power outrage. >> >> > >> >> > Predrag >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> >> Heh. :) >> >> >> >> >> >> An explanation: >> >> >> >> >> >> - Different nvidia gpu architectures are called "compute >> >> >> capabilities". >> >> >> This is a number that describes the behavior of the card: the >> >> >> maximum size >> >> >> of various things, which API functions it supports, etc. There's >> >> >> a >> >> >> reference here >> >> >> >> >> >> , >> >> >> but it shouldn't really matter. >> >> >> - When CUDA compiles code, it targets a certain architecture, >> >> >> since it >> >> >> needs to know what features to use and whatnot. I *think* that if >> >> >> you >> >> >> compile for compute capability x, it will work on a card with >> >> >> compute >> >> >> capability y approximately iff x <= y. >> >> >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. >> >> >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you >> >> >> ask to >> >> >> compile for 6.1 it crashes. >> >> >> - Theano by default tries to compile for the capability of the >> >> >> card, but >> >> >> can be configured to compile for a different capability. >> >> >> - Tensorflow asks for a list of capabilities to compile for when >> >> >> you >> >> >> build it in the first place. >> >> >> >> >> >> >> >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland >> >> >> wrote: >> >> >> >> >> >> > They do work with 7.5 if you specify an older compute >> >> >> > architecture; it's >> >> >> > just that their actual compute capability of 6.1 isn't supported >> >> >> > by cuda >> >> >> > 7.5. Thank is thrown off by this, for example, but it can be fixed >> >> >> > by >> >> >> > telling it to pass compute capability 5.2 (for example) to nvcc. I >> >> >> > don't >> >> >> > think that this was my problem with building tensorflow on 7.5; >> >> >> > I'm not >> >> >> > sure what that was. >> >> >> > >> >> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy >> >> >> > >> >> >> > wrote: >> >> >> > >> >> >> > Thanks Dougal. I'll take a look atthis and get back to you. >> >> >> > So are you suggesting that this is an issue with TitanX's not >> >> >> > being >> >> >> > compatible with 7.5? >> >> >> > >> >> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland >> >> >> > >> >> >> > wrote: >> >> >> > >> >> >> > I installed it in my scratch directory (not sure if there's a >> >> >> > global >> >> >> > install?). The main thing was to put its cache on scratch; it got >> >> >> > really >> >> >> > upset when the cache directory was on NFS. (Instructions at the >> >> >> > bottom of >> >> >> > my previous email.) >> >> >> > >> >> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos >> >> >> > wrote: >> >> >> > >> >> >> > That's great! Thanks Dougal. >> >> >> > >> >> >> > As I remember bazel was not installed correctly previously on >> >> >> > GPU3. Do >> >> >> > you know what went wrong with it before and why it is good now? >> >> >> > >> >> >> > Thanks, >> >> >> > Barnabas >> >> >> > ====================== >> >> >> > Barnabas Poczos, PhD >> >> >> > Assistant Professor >> >> >> > Machine Learning Department >> >> >> > Carnegie Mellon University >> >> >> > >> >> >> > >> >> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland >> >> >> > >> >> >> > wrote: >> >> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used >> >> >> > > the cuda >> >> >> > 8.0 >> >> >> > > install, and it built fine. So additionally installing 7.5 was >> >> >> > > probably >> >> >> > not >> >> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute >> >> >> > architecture >> >> >> > > that the Titan Xs use, so Theano at least needs to be manually >> >> >> > > told to >> >> >> > use >> >> >> > > an older architecture. >> >> >> > > >> >> >> > > A pip package is in >> >> >> > > ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I >> >> >> > think >> >> >> > > it should work fine with the cudnn in my scratch directory. >> >> >> > > >> >> >> > > You should probably install it to scratch, either running this >> >> >> > > first to >> >> >> > put >> >> >> > > libraries your scratch directory or using a virtualenv or >> >> >> > > something: >> >> >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local >> >> >> > > >> >> >> > > You'll need this to use the library and probably to install it: >> >> >> > > export >> >> >> > > >> >> >> > >> >> >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" >> >> >> > > >> >> >> > > To install: >> >> >> > > pip install --user >> >> >> > > ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl >> >> >> > > (remove --user if you're using a virtualenv) >> >> >> > > >> >> >> > > (A request: I'm submitting to ICLR in two weeks, and for some of >> >> >> > > the >> >> >> > models >> >> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So >> >> >> > > please don't >> >> >> > > run a ton of stuff on gpu3 unless you're working on a deadline >> >> >> > > too. >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > Steps to install it, for the future: >> >> >> > > >> >> >> > > Install bazel in your home directory: >> >> >> > > >> >> >> > > wget >> >> >> > > >> >> >> > >> >> >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh >> >> >> > > bash bazel-0.3.2-installer-linux-x86_64.sh >> >> >> > > --prefix=/home/scratch/$USER >> >> >> > > --base=/home/scratch/$USER/.bazel >> >> >> > > >> >> >> > > Configure bazel to build in scratch. There's probably a better >> >> >> > > way to do >> >> >> > > this, but this works: >> >> >> > > >> >> >> > > mkdir /home/scratch/$USER/.cache >> >> >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel >> >> >> > > >> >> >> > > Build tensorflow. Note that builds from git checkouts don't >> >> >> > > work, because >> >> >> > > they assume a newer version of git than is on gpu3: >> >> >> > > >> >> >> > > cd /home/scratch/$USER >> >> >> > > wget >> >> >> > > tar xf >> >> >> > > cd tensorflow-0.11.0rc0 >> >> >> > > ./configure >> >> >> > > >> >> >> > > This is an interactive script that doesn't seem to let you pass >> >> >> > arguments or >> >> >> > > anything. It's obnoxious. >> >> >> > > Use the default python >> >> >> > > don't use cloud platform or hadoop file system >> >> >> > > use the default site-packages path if it asks >> >> >> > > build with GPU support >> >> >> > > default gcc >> >> >> > > default Cuda SDK version >> >> >> > > specify /usr/local/cuda-8.0 >> >> >> > > default cudnn version >> >> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g. >> >> >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda >> >> >> > > Pascal Titan Xs have compute capability 6.1 >> >> >> > > >> >> >> > > bazel build -c opt --config=cuda >> >> >> > > //tensorflow/tools/pip_package:build_pip_package >> >> >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ >> >> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put >> >> >> > > in the >> >> >> > > directory you specified above. >> >> >> > > >> >> >> > > >> >> >> > > - Dougal >> >> >> > > >> >> >> > > >> >> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy >> >> >> > > > >> >> > > >> >> >> > > wrote: >> >> >> > >> >> >> >> > >> Predrag, >> >> >> > >> >> >> >> > >> Any updates on gpu3? >> >> >> > >> I have tried both tensorflow and chainer and in both cases the >> >> >> > >> problem >> >> >> > >> seems to be with cuda >> >> >> > >> >> >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac >> >> >> > >> > >> >> > > >> >> >> > >> wrote: >> >> >> > >>> >> >> >> > >>> Dougal Sutherland wrote: >> >> >> > >>> >> >> >> > >>> > I tried for a while. I failed. >> >> >> > >>> > >> >> >> > >>> >> >> >> > >>> Damn this doesn't look good. I guess back to the drawing >> >> >> > >>> board. Thanks >> >> >> > >>> for the quick feed back. >> >> >> > >>> >> >> >> > >>> Predrag >> >> >> > >>> >> >> >> > >>> > Version 0.10.0 fails immediately on build: "The specified >> >> >> > >>> > --crosstool_top >> >> >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid >> >> >> > >>> > cc_toolchain_suite >> >> >> > >>> > rule." Apparently this is because 0.10 required an older >> >> >> > >>> > version of >> >> >> > >>> > bazel ( >> >> >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I >> >> >> > >>> > don't >> >> >> > have >> >> >> > >>> > the >> >> >> > >>> > energy to install an old version of bazel. >> >> >> > >>> > >> >> >> > >>> > Version 0.11.0rc0 gets almost done and then complains about >> >> >> > >>> > no such >> >> >> > >>> > file or >> >> >> > >>> > directory for libcudart.so.7.5 (which is there, where I told >> >> >> > tensorflow >> >> >> > >>> > it >> >> >> > >>> > was...). >> >> >> > >>> > >> >> >> > >>> > Non-release versions from git fail immediately because they >> >> >> > >>> > call git >> >> >> > -C >> >> >> > >>> > to >> >> >> > >>> > get version info, which is only in git 1.9 (we have 1.8). >> >> >> > >>> > >> >> >> > >>> > >> >> >> > >>> > Some other notes: >> >> >> > >>> > - I made a symlink from ~/.cache/bazel to >> >> >> > >>> > /home/scratch/$USER/.cache/bazel, >> >> >> > >>> > because bazel is the worst. (It complains about doing things >> >> >> > >>> > on NFS, >> >> >> > >>> > and >> >> >> > >>> > hung for me [clock-related?], and I can't find a global >> >> >> > >>> > config file >> >> >> > or >> >> >> > >>> > anything to change that in; it seems like there might be >> >> >> > >>> > one, but >> >> >> > their >> >> >> > >>> > documentation is terrible.) >> >> >> > >>> > >> >> >> > >>> > - I wasn't able to use the actual Titan X compute capability >> >> >> > >>> > of 6.1, >> >> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably >> >> >> > >>> > not a huge >> >> >> > >>> > deal, >> >> >> > >>> > but I don't know. >> >> >> > >>> > >> >> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in >> >> >> > LD_LIBRARY_PATH >> >> >> > >>> > and >> >> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping >> >> >> > >>> > that would >> >> >> > >>> > help >> >> >> > >>> > with the 0.11.0rc0 problem, but it didn't. >> >> >> > >> >> >> >> > >> >> >> >> > > >> >> >> > >> >> >> > >> >> >> > > > From kandasamy at cmu.edu Fri Oct 21 16:46:59 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Fri, 21 Oct 2016 16:46:59 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> Message-ID: Ok, that should be enough. samy On Fri, Oct 21, 2016 at 4:44 PM, Barnabas Poczos wrote: > Hi Samy, > > Gpu1 will still have Matlab and 4 K80 GPus (which is technically 8 > GPUs). Won't that be enough for now? > > Best, > B > ====================== > Barnabas Poczos, PhD > Assistant Professor > Machine Learning Department > Carnegie Mellon University > > > On Fri, Oct 21, 2016 at 4:21 PM, Kirthevasan Kandasamy > wrote: > > Hi all, > > > > I was planning on using Matlab with GPUs for one of my projects. > > Can we please keep gpu2 as it is for now? > > > > samy > > > > On Fri, Oct 21, 2016 at 3:54 PM, Barnabas Poczos > > wrote: > >> > >> Sounds good. Let us have tensorflow system wide on all GPU nodes. We > >> can worry about Matlab later. > >> > >> Best, > >> B > >> ====================== > >> Barnabas Poczos, PhD > >> Assistant Professor > >> Machine Learning Department > >> Carnegie Mellon University > >> > >> > >> On Fri, Oct 21, 2016 at 3:50 PM, Predrag Punosevac > > >> wrote: > >> > Barnabas Poczos wrote: > >> > > >> >> Hi Predrag, > >> >> > >> >> If there is no other solution, then I think it is OK not to have > >> >> Matlab on GPU2 and GPU3. > >> >> Tensorflow has higher priority on these nodes. > >> > > >> > We could possibly have multiple CUDA libraries for different versions > >> > but that is going to bite us for the rear end quickly. People who want > >> > to use MATLAB with GPUs will have to live with GPU1 probably until > >> > Spring release of MATLAB. > >> > > >> > Predrag > >> > > >> >> > >> >> Best, > >> >> Barnabas > >> >> > >> >> > >> >> > >> >> > >> >> ====================== > >> >> Barnabas Poczos, PhD > >> >> Assistant Professor > >> >> Machine Learning Department > >> >> Carnegie Mellon University > >> >> > >> >> > >> >> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac > >> >> wrote: > >> >> > Dougal Sutherland wrote: > >> >> > > >> >> > > >> >> > Sorry that I am late for the party. This is my interpretation of > what > >> >> > we > >> >> > should do. > >> >> > > >> >> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to > >> >> > live > >> >> > with it. Barnabas please OK this. I will work with MathWorks for > this > >> >> > to > >> >> > be fixed for 2017a release. > >> >> > > >> >> > 2. Then I could install TensorFlow compiled by Dougal system wide. > >> >> > Please Dugal after I upgrade back to 8.0 recompile it again using > >> >> > CUDA > >> >> > 8.0. I could give you the root password so that you can compile and > >> >> > install directly. > >> >> > > >> >> > 3. If everyone is OK with above I will pull the trigger on GPU3 at > >> >> > 4:30PM and upgrade to 8.0 > >> >> > > >> >> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards > >> >> > during > >> >> > the October 25 power outrage. > >> >> > > >> >> > Predrag > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> >> Heh. :) > >> >> >> > >> >> >> An explanation: > >> >> >> > >> >> >> - Different nvidia gpu architectures are called "compute > >> >> >> capabilities". > >> >> >> This is a number that describes the behavior of the card: the > >> >> >> maximum size > >> >> >> of various things, which API functions it supports, etc. > There's > >> >> >> a > >> >> >> reference here > >> >> >> > >> >> >> and_specifications>, > >> >> >> but it shouldn't really matter. > >> >> >> - When CUDA compiles code, it targets a certain architecture, > >> >> >> since it > >> >> >> needs to know what features to use and whatnot. I *think* that > if > >> >> >> you > >> >> >> compile for compute capability x, it will work on a card with > >> >> >> compute > >> >> >> capability y approximately iff x <= y. > >> >> >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. > >> >> >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you > >> >> >> ask to > >> >> >> compile for 6.1 it crashes. > >> >> >> - Theano by default tries to compile for the capability of the > >> >> >> card, but > >> >> >> can be configured to compile for a different capability. > >> >> >> - Tensorflow asks for a list of capabilities to compile for > when > >> >> >> you > >> >> >> build it in the first place. > >> >> >> > >> >> >> > >> >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland < > dougal at gmail.com> > >> >> >> wrote: > >> >> >> > >> >> >> > They do work with 7.5 if you specify an older compute > >> >> >> > architecture; it's > >> >> >> > just that their actual compute capability of 6.1 isn't supported > >> >> >> > by cuda > >> >> >> > 7.5. Thank is thrown off by this, for example, but it can be > fixed > >> >> >> > by > >> >> >> > telling it to pass compute capability 5.2 (for example) to > nvcc. I > >> >> >> > don't > >> >> >> > think that this was my problem with building tensorflow on 7.5; > >> >> >> > I'm not > >> >> >> > sure what that was. > >> >> >> > > >> >> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy > >> >> >> > > >> >> >> > wrote: > >> >> >> > > >> >> >> > Thanks Dougal. I'll take a look atthis and get back to you. > >> >> >> > So are you suggesting that this is an issue with TitanX's not > >> >> >> > being > >> >> >> > compatible with 7.5? > >> >> >> > > >> >> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland > >> >> >> > > >> >> >> > wrote: > >> >> >> > > >> >> >> > I installed it in my scratch directory (not sure if there's a > >> >> >> > global > >> >> >> > install?). The main thing was to put its cache on scratch; it > got > >> >> >> > really > >> >> >> > upset when the cache directory was on NFS. (Instructions at the > >> >> >> > bottom of > >> >> >> > my previous email.) > >> >> >> > > >> >> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos > >> >> >> > wrote: > >> >> >> > > >> >> >> > That's great! Thanks Dougal. > >> >> >> > > >> >> >> > As I remember bazel was not installed correctly previously on > >> >> >> > GPU3. Do > >> >> >> > you know what went wrong with it before and why it is good now? > >> >> >> > > >> >> >> > Thanks, > >> >> >> > Barnabas > >> >> >> > ====================== > >> >> >> > Barnabas Poczos, PhD > >> >> >> > Assistant Professor > >> >> >> > Machine Learning Department > >> >> >> > Carnegie Mellon University > >> >> >> > > >> >> >> > > >> >> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland > >> >> >> > > >> >> >> > wrote: > >> >> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used > >> >> >> > > the cuda > >> >> >> > 8.0 > >> >> >> > > install, and it built fine. So additionally installing 7.5 was > >> >> >> > > probably > >> >> >> > not > >> >> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 > compute > >> >> >> > architecture > >> >> >> > > that the Titan Xs use, so Theano at least needs to be manually > >> >> >> > > told to > >> >> >> > use > >> >> >> > > an older architecture. > >> >> >> > > > >> >> >> > > A pip package is in > >> >> >> > > ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I > >> >> >> > think > >> >> >> > > it should work fine with the cudnn in my scratch directory. > >> >> >> > > > >> >> >> > > You should probably install it to scratch, either running this > >> >> >> > > first to > >> >> >> > put > >> >> >> > > libraries your scratch directory or using a virtualenv or > >> >> >> > > something: > >> >> >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local > >> >> >> > > > >> >> >> > > You'll need this to use the library and probably to install > it: > >> >> >> > > export > >> >> >> > > > >> >> >> > > >> >> >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/ > lib64:"$LD_LIBRARY_PATH" > >> >> >> > > > >> >> >> > > To install: > >> >> >> > > pip install --user > >> >> >> > > ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl > >> >> >> > > (remove --user if you're using a virtualenv) > >> >> >> > > > >> >> >> > > (A request: I'm submitting to ICLR in two weeks, and for some > of > >> >> >> > > the > >> >> >> > models > >> >> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So > >> >> >> > > please don't > >> >> >> > > run a ton of stuff on gpu3 unless you're working on a deadline > >> >> >> > > too. > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > Steps to install it, for the future: > >> >> >> > > > >> >> >> > > Install bazel in your home directory: > >> >> >> > > > >> >> >> > > wget > >> >> >> > > > >> >> >> > > >> >> >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/ > bazel-0.3.2-installer-linux-x86_64.sh > >> >> >> > > bash bazel-0.3.2-installer-linux-x86_64.sh > >> >> >> > > --prefix=/home/scratch/$USER > >> >> >> > > --base=/home/scratch/$USER/.bazel > >> >> >> > > > >> >> >> > > Configure bazel to build in scratch. There's probably a better > >> >> >> > > way to do > >> >> >> > > this, but this works: > >> >> >> > > > >> >> >> > > mkdir /home/scratch/$USER/.cache > >> >> >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > >> >> >> > > > >> >> >> > > Build tensorflow. Note that builds from git checkouts don't > >> >> >> > > work, because > >> >> >> > > they assume a newer version of git than is on gpu3: > >> >> >> > > > >> >> >> > > cd /home/scratch/$USER > >> >> >> > > wget > >> >> >> > > tar xf > >> >> >> > > cd tensorflow-0.11.0rc0 > >> >> >> > > ./configure > >> >> >> > > > >> >> >> > > This is an interactive script that doesn't seem to let you > pass > >> >> >> > arguments or > >> >> >> > > anything. It's obnoxious. > >> >> >> > > Use the default python > >> >> >> > > don't use cloud platform or hadoop file system > >> >> >> > > use the default site-packages path if it asks > >> >> >> > > build with GPU support > >> >> >> > > default gcc > >> >> >> > > default Cuda SDK version > >> >> >> > > specify /usr/local/cuda-8.0 > >> >> >> > > default cudnn version > >> >> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g. > >> >> >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > >> >> >> > > Pascal Titan Xs have compute capability 6.1 > >> >> >> > > > >> >> >> > > bazel build -c opt --config=cuda > >> >> >> > > //tensorflow/tools/pip_package:build_pip_package > >> >> >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > >> >> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is > put > >> >> >> > > in the > >> >> >> > > directory you specified above. > >> >> >> > > > >> >> >> > > > >> >> >> > > - Dougal > >> >> >> > > > >> >> >> > > > >> >> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy > >> >> >> > > >> >> >> > > > >> >> >> > > wrote: > >> >> >> > >> > >> >> >> > >> Predrag, > >> >> >> > >> > >> >> >> > >> Any updates on gpu3? > >> >> >> > >> I have tried both tensorflow and chainer and in both cases > the > >> >> >> > >> problem > >> >> >> > >> seems to be with cuda > >> >> >> > >> > >> >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac > >> >> >> > >> >> >> >> > > > >> >> >> > >> wrote: > >> >> >> > >>> > >> >> >> > >>> Dougal Sutherland wrote: > >> >> >> > >>> > >> >> >> > >>> > I tried for a while. I failed. > >> >> >> > >>> > > >> >> >> > >>> > >> >> >> > >>> Damn this doesn't look good. I guess back to the drawing > >> >> >> > >>> board. Thanks > >> >> >> > >>> for the quick feed back. > >> >> >> > >>> > >> >> >> > >>> Predrag > >> >> >> > >>> > >> >> >> > >>> > Version 0.10.0 fails immediately on build: "The specified > >> >> >> > >>> > --crosstool_top > >> >> >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid > >> >> >> > >>> > cc_toolchain_suite > >> >> >> > >>> > rule." Apparently this is because 0.10 required an older > >> >> >> > >>> > version of > >> >> >> > >>> > bazel ( > >> >> >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), > and I > >> >> >> > >>> > don't > >> >> >> > have > >> >> >> > >>> > the > >> >> >> > >>> > energy to install an old version of bazel. > >> >> >> > >>> > > >> >> >> > >>> > Version 0.11.0rc0 gets almost done and then complains > about > >> >> >> > >>> > no such > >> >> >> > >>> > file or > >> >> >> > >>> > directory for libcudart.so.7.5 (which is there, where I > told > >> >> >> > tensorflow > >> >> >> > >>> > it > >> >> >> > >>> > was...). > >> >> >> > >>> > > >> >> >> > >>> > Non-release versions from git fail immediately because > they > >> >> >> > >>> > call git > >> >> >> > -C > >> >> >> > >>> > to > >> >> >> > >>> > get version info, which is only in git 1.9 (we have 1.8). > >> >> >> > >>> > > >> >> >> > >>> > > >> >> >> > >>> > Some other notes: > >> >> >> > >>> > - I made a symlink from ~/.cache/bazel to > >> >> >> > >>> > /home/scratch/$USER/.cache/bazel, > >> >> >> > >>> > because bazel is the worst. (It complains about doing > things > >> >> >> > >>> > on NFS, > >> >> >> > >>> > and > >> >> >> > >>> > hung for me [clock-related?], and I can't find a global > >> >> >> > >>> > config file > >> >> >> > or > >> >> >> > >>> > anything to change that in; it seems like there might be > >> >> >> > >>> > one, but > >> >> >> > their > >> >> >> > >>> > documentation is terrible.) > >> >> >> > >>> > > >> >> >> > >>> > - I wasn't able to use the actual Titan X compute > capability > >> >> >> > >>> > of 6.1, > >> >> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably > >> >> >> > >>> > not a huge > >> >> >> > >>> > deal, > >> >> >> > >>> > but I don't know. > >> >> >> > >>> > > >> >> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in > >> >> >> > LD_LIBRARY_PATH > >> >> >> > >>> > and > >> >> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping > >> >> >> > >>> > that would > >> >> >> > >>> > help > >> >> >> > >>> > with the 0.11.0rc0 problem, but it didn't. > >> >> >> > >> > >> >> >> > >> > >> >> >> > > > >> >> >> > > >> >> >> > > >> >> >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Fri Oct 21 17:07:35 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Fri, 21 Oct 2016 21:07:35 +0000 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> Message-ID: I don't think it would be a bad thing to have both versions of cuda installed and default to 8.0. To use 7.5 for matlab you probably just have to write a wrapper script to set LD_LIBRARY_FLAGS appropriately. On Fri, Oct 21, 2016 at 9:21 PM Kirthevasan Kandasamy wrote: > Hi all, > > I was planning on using Matlab with GPUs for one of my projects. > Can we please keep gpu2 as it is for now? > > samy > > On Fri, Oct 21, 2016 at 3:54 PM, Barnabas Poczos > wrote: > > Sounds good. Let us have tensorflow system wide on all GPU nodes. We > can worry about Matlab later. > > Best, > B > ====================== > Barnabas Poczos, PhD > Assistant Professor > Machine Learning Department > Carnegie Mellon University > > > On Fri, Oct 21, 2016 at 3:50 PM, Predrag Punosevac > wrote: > > Barnabas Poczos wrote: > > > >> Hi Predrag, > >> > >> If there is no other solution, then I think it is OK not to have > >> Matlab on GPU2 and GPU3. > >> Tensorflow has higher priority on these nodes. > > > > We could possibly have multiple CUDA libraries for different versions > > but that is going to bite us for the rear end quickly. People who want > > to use MATLAB with GPUs will have to live with GPU1 probably until > > Spring release of MATLAB. > > > > Predrag > > > >> > >> Best, > >> Barnabas > >> > >> > >> > >> > >> ====================== > >> Barnabas Poczos, PhD > >> Assistant Professor > >> Machine Learning Department > >> Carnegie Mellon University > >> > >> > >> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac > wrote: > >> > Dougal Sutherland wrote: > >> > > >> > > >> > Sorry that I am late for the party. This is my interpretation of what > we > >> > should do. > >> > > >> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live > >> > with it. Barnabas please OK this. I will work with MathWorks for this > to > >> > be fixed for 2017a release. > >> > > >> > 2. Then I could install TensorFlow compiled by Dougal system wide. > >> > Please Dugal after I upgrade back to 8.0 recompile it again using CUDA > >> > 8.0. I could give you the root password so that you can compile and > >> > install directly. > >> > > >> > 3. If everyone is OK with above I will pull the trigger on GPU3 at > >> > 4:30PM and upgrade to 8.0 > >> > > >> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards > during > >> > the October 25 power outrage. > >> > > >> > Predrag > >> > > >> > > >> > > >> > > >> > > >> > > >> >> Heh. :) > >> >> > >> >> An explanation: > >> >> > >> >> - Different nvidia gpu architectures are called "compute > capabilities". > >> >> This is a number that describes the behavior of the card: the > maximum size > >> >> of various things, which API functions it supports, etc. There's a > >> >> reference here > >> >> < > https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications>, > >> >> but it shouldn't really matter. > >> >> - When CUDA compiles code, it targets a certain architecture, > since it > >> >> needs to know what features to use and whatnot. I *think* that if > you > >> >> compile for compute capability x, it will work on a card with > compute > >> >> capability y approximately iff x <= y. > >> >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. > >> >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you > ask to > >> >> compile for 6.1 it crashes. > >> >> - Theano by default tries to compile for the capability of the > card, but > >> >> can be configured to compile for a different capability. > >> >> - Tensorflow asks for a list of capabilities to compile for when > you > >> >> build it in the first place. > >> >> > >> >> > >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland > wrote: > >> >> > >> >> > They do work with 7.5 if you specify an older compute > architecture; it's > >> >> > just that their actual compute capability of 6.1 isn't supported > by cuda > >> >> > 7.5. Thank is thrown off by this, for example, but it can be fixed > by > >> >> > telling it to pass compute capability 5.2 (for example) to nvcc. I > don't > >> >> > think that this was my problem with building tensorflow on 7.5; > I'm not > >> >> > sure what that was. > >> >> > > >> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy < > kandasamy at cmu.edu> > >> >> > wrote: > >> >> > > >> >> > Thanks Dougal. I'll take a look atthis and get back to you. > >> >> > So are you suggesting that this is an issue with TitanX's not being > >> >> > compatible with 7.5? > >> >> > > >> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland < > dougal at gmail.com> > >> >> > wrote: > >> >> > > >> >> > I installed it in my scratch directory (not sure if there's a > global > >> >> > install?). The main thing was to put its cache on scratch; it got > really > >> >> > upset when the cache directory was on NFS. (Instructions at the > bottom of > >> >> > my previous email.) > >> >> > > >> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos > wrote: > >> >> > > >> >> > That's great! Thanks Dougal. > >> >> > > >> >> > As I remember bazel was not installed correctly previously on > GPU3. Do > >> >> > you know what went wrong with it before and why it is good now? > >> >> > > >> >> > Thanks, > >> >> > Barnabas > >> >> > ====================== > >> >> > Barnabas Poczos, PhD > >> >> > Assistant Professor > >> >> > Machine Learning Department > >> >> > Carnegie Mellon University > >> >> > > >> >> > > >> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland < > dougal at gmail.com> > >> >> > wrote: > >> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used > the cuda > >> >> > 8.0 > >> >> > > install, and it built fine. So additionally installing 7.5 was > probably > >> >> > not > >> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute > >> >> > architecture > >> >> > > that the Titan Xs use, so Theano at least needs to be manually > told to > >> >> > use > >> >> > > an older architecture. > >> >> > > > >> >> > > A pip package is in > ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I > >> >> > think > >> >> > > it should work fine with the cudnn in my scratch directory. > >> >> > > > >> >> > > You should probably install it to scratch, either running this > first to > >> >> > put > >> >> > > libraries your scratch directory or using a virtualenv or > something: > >> >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local > >> >> > > > >> >> > > You'll need this to use the library and probably to install it: > >> >> > > export > >> >> > > > >> >> > > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH" > >> >> > > > >> >> > > To install: > >> >> > > pip install --user > ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl > >> >> > > (remove --user if you're using a virtualenv) > >> >> > > > >> >> > > (A request: I'm submitting to ICLR in two weeks, and for some of > the > >> >> > models > >> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So > please don't > >> >> > > run a ton of stuff on gpu3 unless you're working on a deadline > too. > >> >> > > > >> >> > > > >> >> > > > >> >> > > Steps to install it, for the future: > >> >> > > > >> >> > > Install bazel in your home directory: > >> >> > > > >> >> > > wget > >> >> > > > >> >> > > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh > >> >> > > bash bazel-0.3.2-installer-linux-x86_64.sh > --prefix=/home/scratch/$USER > >> >> > > --base=/home/scratch/$USER/.bazel > >> >> > > > >> >> > > Configure bazel to build in scratch. There's probably a better > way to do > >> >> > > this, but this works: > >> >> > > > >> >> > > mkdir /home/scratch/$USER/.cache > >> >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel > >> >> > > > >> >> > > Build tensorflow. Note that builds from git checkouts don't > work, because > >> >> > > they assume a newer version of git than is on gpu3: > >> >> > > > >> >> > > cd /home/scratch/$USER > >> >> > > wget > >> >> > > tar xf > >> >> > > cd tensorflow-0.11.0rc0 > >> >> > > ./configure > >> >> > > > >> >> > > This is an interactive script that doesn't seem to let you pass > >> >> > arguments or > >> >> > > anything. It's obnoxious. > >> >> > > Use the default python > >> >> > > don't use cloud platform or hadoop file system > >> >> > > use the default site-packages path if it asks > >> >> > > build with GPU support > >> >> > > default gcc > >> >> > > default Cuda SDK version > >> >> > > specify /usr/local/cuda-8.0 > >> >> > > default cudnn version > >> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g. > >> >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda > >> >> > > Pascal Titan Xs have compute capability 6.1 > >> >> > > > >> >> > > bazel build -c opt --config=cuda > >> >> > > //tensorflow/tools/pip_package:build_pip_package > >> >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ > >> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put > in the > >> >> > > directory you specified above. > >> >> > > > >> >> > > > >> >> > > - Dougal > >> >> > > > >> >> > > > >> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy < > kandasamy at cmu.edu > >> >> > > > >> >> > > wrote: > >> >> > >> > >> >> > >> Predrag, > >> >> > >> > >> >> > >> Any updates on gpu3? > >> >> > >> I have tried both tensorflow and chainer and in both cases the > problem > >> >> > >> seems to be with cuda > >> >> > >> > >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac < > predragp at cs.cmu.edu > >> >> > > > >> >> > >> wrote: > >> >> > >>> > >> >> > >>> Dougal Sutherland wrote: > >> >> > >>> > >> >> > >>> > I tried for a while. I failed. > >> >> > >>> > > >> >> > >>> > >> >> > >>> Damn this doesn't look good. I guess back to the drawing > board. Thanks > >> >> > >>> for the quick feed back. > >> >> > >>> > >> >> > >>> Predrag > >> >> > >>> > >> >> > >>> > Version 0.10.0 fails immediately on build: "The specified > >> >> > >>> > --crosstool_top > >> >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid > >> >> > >>> > cc_toolchain_suite > >> >> > >>> > rule." Apparently this is because 0.10 required an older > version of > >> >> > >>> > bazel ( > >> >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and > I don't > >> >> > have > >> >> > >>> > the > >> >> > >>> > energy to install an old version of bazel. > >> >> > >>> > > >> >> > >>> > Version 0.11.0rc0 gets almost done and then complains about > no such > >> >> > >>> > file or > >> >> > >>> > directory for libcudart.so.7.5 (which is there, where I told > >> >> > tensorflow > >> >> > >>> > it > >> >> > >>> > was...). > >> >> > >>> > > >> >> > >>> > Non-release versions from git fail immediately because they > call git > >> >> > -C > >> >> > >>> > to > >> >> > >>> > get version info, which is only in git 1.9 (we have 1.8). > >> >> > >>> > > >> >> > >>> > > >> >> > >>> > Some other notes: > >> >> > >>> > - I made a symlink from ~/.cache/bazel to > >> >> > >>> > /home/scratch/$USER/.cache/bazel, > >> >> > >>> > because bazel is the worst. (It complains about doing things > on NFS, > >> >> > >>> > and > >> >> > >>> > hung for me [clock-related?], and I can't find a global > config file > >> >> > or > >> >> > >>> > anything to change that in; it seems like there might be > one, but > >> >> > their > >> >> > >>> > documentation is terrible.) > >> >> > >>> > > >> >> > >>> > - I wasn't able to use the actual Titan X compute capability > of 6.1, > >> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably > not a huge > >> >> > >>> > deal, > >> >> > >>> > but I don't know. > >> >> > >>> > > >> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in > >> >> > LD_LIBRARY_PATH > >> >> > >>> > and > >> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping > that would > >> >> > >>> > help > >> >> > >>> > with the 0.11.0rc0 problem, but it didn't. > >> >> > >> > >> >> > >> > >> >> > > > >> >> > > >> >> > > >> >> > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Fri Oct 21 18:07:03 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Fri, 21 Oct 2016 22:07:03 +0000 Subject: TensorFlow 0.11.0rc0 now globally installed on gpu3 Message-ID: I just installed TensorFlow 0.11.0rc0 system-wide on gpu3. "import tensorflow" should now work without you having to do anything else; no messing with LD_LIBRARY_PATH, installing anything to your local python site, or anything like that anymore. Let me+Predrag know if anything seems broken. We'll also update to a later RC/full release when one is available. - Dougal PS: To do this, I did a global install of cudnn 5.1. If you need to use a different version of cudnn for some software, it *should* still work like it did before; just make sure that your cudnn.h takes priority over the one in /usr/local/cuda/include. But Theano, Caffe, etc all work with cudnn 5.1 now I think. -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at cs.cmu.edu Sat Oct 22 11:22:22 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Sat, 22 Oct 2016 11:22:22 -0400 Subject: Fwd: POSTPONED - October 25th 2016: [Power Outage] SCS Wean Hall machine room Message-ID: <20161022152222.dWNM09cuM%predragp@cs.cmu.edu> Dear Autonians, SCS is again postponing power outage. Since I have received extra RAM and new GPU cards for GPU1 and GPU2. I am planning to offline them on Tuesday starting at 11:00 AM for hopefully no more than three hours unless there is a serious community objection. If you think that my plan is dumb please speak now. Predrag -------- Original Message -------- Subject: POSTPONED - October 25th 2016: [Power Outage] SCS Wean Hall machine room To: Edward J Walter From: Edward Walter Date: Fri, 21 Oct 2016 22:37:54 -0400 The planned power outage for the SCS Wean Hall machine room has been postponed again and will NOT take place on October 25th. We will let you know as soon as we have a new date for this power outage. We anticipate that it will happen in mid-late November. Thank you for your attention and patience, SCS Help Desk On 10/10/2016 09:26 AM, Edward Walter wrote: > The partial power outage for the SCS Wean Hall machine room has been > rescheduled for October 25th. We expect this work to take less than > 24 hours. The outage may run into 48 hours if the electrical > contractor encounters problems related to the planned maintenance > tasks. > > Please contact the SCS Help Desk at x8-4231 or send mail to > help at cs.cmu.edu with any questions or concerns regarding this > maintenance period. > > Thank you for your attention, > > SCS Help Desk > > On 10/03/2016 07:28 AM, Edward Walter wrote: >> The planned power outage for the SCS Wean Hall machine room has >> been postponed and will NOT take place on October 4th. We are >> coordinating with the electrical contractors to get the work >> re-scheduled. We will let you know as soon as we have a new date >> for this power outage. >> >> Thank you. >> >> SCS Help Desk >> >> On 09/12/2016 08:01 AM, Edward Walter wrote: >>> SCS Computing Facilities and FMS are planning a partial power >>> outage in the SCS Wean Hall machine room. We expect this work >>> to begin on Oct 4th, 2016 and to take less than 24 hours. The >>> outage may run into 48 hours in the event that the electrical >>> contractor encounters something unexpected. >>> >>> The following servers or computational clusters will be affected >>> by this power outage: >>> >>> Affected clusters: ACTR.HPC1.CS.CMU.EDU AUTON >>> COMA.HPC1.CS.CMU.EDU CORTEX.ML.CMU.EDU LATEDAYS.ANDREW.CMU.EDU >>> PSYCH-O.HPC1.CS.CMU.EDU ROCKS.IS.CS.CMU.EDU >>> WORKHORSE.LTI.CS.CMU.EDU YODA.GRAPHICS.CS.CMU.EDU >>> >>> >>> Affected servers: OMEPSLID.COMPBIO SLIF.COMPBIO PACIFIC.DB >>> GPUSERVER.PERCEPTION GPUSERVER2.PERCEPTION GPUSERVER3.PERCEPTION >>> GPUSERVER5.PERCEPTION GPUSERVER6.PERCEPTION >>> GPUSERVER7.PERCEPTION DENVER.LTI LOR.LTI MIAMI.LTI SASKIA.ML >>> MARTEN.ML JAN.ML ARNOUT.ML LYSBET.ML FLORIS.ML >>> >>> >>> Please contact the SCS Help Desk at x8-4231 or send mail to >>> help at cs.cmu.edu with any questions or concerns regarding this >>> maintenance period. >>> >>> Thank you for your attention, >>> >>> SCS Help Desk From predragp at imap.srv.cs.cmu.edu Tue Oct 25 13:08:54 2016 From: predragp at imap.srv.cs.cmu.edu (Predrag Punosevac) Date: Tue, 25 Oct 2016 13:08:54 -0400 Subject: GPU2 upgraded Message-ID: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> Dear Autonians, As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the latest and the greatest. Please wait until his e-mail until you hit the machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am working now on GPU1. Predrag P.S. I will escalate MATLAB issue with MathWorks but I don't expect to fixed before R2017a. From dougal at gmail.com Tue Oct 25 14:14:18 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Tue, 25 Oct 2016 18:14:18 +0000 Subject: GPU2 upgraded In-Reply-To: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> Message-ID: The same version of TensorFlow as on gpu3 is now installed on gpu2, along with cudnn; let me know if there are issues. I didn't do a global install of Caffe on either machine, because Caffe is kind of dumb and doesn't really do global installs. If anyone wants this, talk to me and we can figure out what makes sense. - Dougal On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < predragp at imap.srv.cs.cmu.edu> wrote: > Dear Autonians, > > As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards > per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the > latest and the greatest. Please wait until his e-mail until you hit the > machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am > working now on GPU1. > > > Predrag > > P.S. I will escalate MATLAB issue with MathWorks but I don't expect to > fixed before R2017a. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at imap.srv.cs.cmu.edu Tue Oct 25 16:18:18 2016 From: predragp at imap.srv.cs.cmu.edu (Predrag Punosevac) Date: Tue, 25 Oct 2016 16:18:18 -0400 Subject: GPU1 upgraded with caveat Message-ID: <456e00342968479a32953d9b109df5ae@imap.srv.cs.cmu.edu> Dear Autonians, GPU1 is upgraded with the caveat. The RAM is boosted to the total of 256GB. The machine now has 4 Tesla K80 cards. However SuperMicro failed to send me two power dangles https://www.pinterest.com/pin/446208275556474885/ top right corner. So only 2 K80s are usable now. I just got of the phone with Silicon Mechanics people and they are furious just like me about that. Between the cards were over 4K each and power dangle is $3 a peace but it has to be shipped from SuperMicro. Predrag P.S. MATLAB should work as before on GPU1. GPU1 has cuda 7.5 From junieroliva at gmail.com Wed Oct 26 13:03:04 2016 From: junieroliva at gmail.com (Junier Oliva) Date: Wed, 26 Oct 2016 13:03:04 -0400 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> Message-ID: Not sure if this at all related, but is ipython broken for anyone else? It seems to just hang upon launching it on several auton machines (GPU1, GPU2, LOV4, LOW1). Thanks, Junier On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland wrote: > The same version of TensorFlow as on gpu3 is now installed on gpu2, along > with cudnn; let me know if there are issues. > > I didn't do a global install of Caffe on either machine, because Caffe is > kind of dumb and doesn't really do global installs. If anyone wants this, > talk to me and we can figure out what makes sense. > > - Dougal > > On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < > predragp at imap.srv.cs.cmu.edu> wrote: > >> Dear Autonians, >> >> As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards >> per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the >> latest and the greatest. Please wait until his e-mail until you hit the >> machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am >> working now on GPU1. >> >> >> Predrag >> >> P.S. I will escalate MATLAB issue with MathWorks but I don't expect to >> fixed before R2017a. >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Wed Oct 26 13:25:37 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Wed, 26 Oct 2016 17:25:37 +0000 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> Message-ID: Yes, that's happening for me too. I assumed it was just me because a) it doesn't happen for the auton-local account b) it also happens for my anaconda-installed ipython but this started happening for me today. Something very weird happening there. On Wed, Oct 26, 2016 at 6:03 PM Junier Oliva wrote: > Not sure if this at all related, but is ipython broken for anyone else? It > seems to just hang upon launching it on several auton machines (GPU1, GPU2, > LOV4, LOW1). > > Thanks, > Junier > > On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland > wrote: > > The same version of TensorFlow as on gpu3 is now installed on gpu2, along > with cudnn; let me know if there are issues. > > I didn't do a global install of Caffe on either machine, because Caffe is > kind of dumb and doesn't really do global installs. If anyone wants this, > talk to me and we can figure out what makes sense. > > - Dougal > > On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < > predragp at imap.srv.cs.cmu.edu> wrote: > > Dear Autonians, > > As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards > per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the > latest and the greatest. Please wait until his e-mail until you hit the > machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am > working now on GPU1. > > > Predrag > > P.S. I will escalate MATLAB issue with MathWorks but I don't expect to > fixed before R2017a. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at imap.srv.cs.cmu.edu Wed Oct 26 13:37:31 2016 From: predragp at imap.srv.cs.cmu.edu (Predrag Punosevac) Date: Wed, 26 Oct 2016 13:37:31 -0400 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> Message-ID: <3192147ce48f50f2f9a131c1950039e2@imap.srv.cs.cmu.edu> On 2016-10-26 13:25, Dougal Sutherland wrote: > Yes, that's happening for me too. I assumed it was just me because > > a) it doesn't happen for the auton-local account > b) it also happens for my anaconda-installed ipython > > but this started happening for me today. Something very weird > happening there. > I just checked and it hangs for me as well with the regular account but it doesn't hang with auton-local account. The difference is that auton-local stores info on the local drive while my regular account is storing data on NFS shares. Since it is happening across the computing nodes it is not a NFS client issue but NFS server issue. I am clueless at the point why is this happening. I don't want to reboot file server until power outage. I am going to think about this. I am guessing it would be possible to install ipython in your local scratch directory and that one should start. Predrag > On Wed, Oct 26, 2016 at 6:03 PM Junier Oliva > wrote: > >> Not sure if this at all related, but is ipython broken for anyone >> else? It seems to just hang upon launching it on several auton >> machines (GPU1, GPU2, LOV4, LOW1). >> >> Thanks, >> Junier >> >> On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland >> wrote: >> >> The same version of TensorFlow as on gpu3 is now installed on gpu2, >> along with cudnn; let me know if there are issues. >> >> I didn't do a global install of Caffe on either machine, because >> Caffe is kind of dumb and doesn't really do global installs. If >> anyone wants this, talk to me and we can figure out what makes >> sense. >> >> - Dougal >> >> On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac >> wrote: >> Dear Autonians, >> >> As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) >> cards >> per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to >> the >> latest and the greatest. Please wait until his e-mail until you hit >> the >> machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am >> working now on GPU1. >> >> Predrag >> >> P.S. I will escalate MATLAB issue with MathWorks but I don't expect >> to >> fixed before R2017a. From mayifei1012 at gmail.com Wed Oct 26 13:55:45 2016 From: mayifei1012 at gmail.com (yifei ma) Date: Wed, 26 Oct 2016 13:55:45 -0400 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> Message-ID: <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> Second that on foxconn. It launches but the ipython client won't start. Thanks, Yifei On 10/26/2016 01:03 PM, Junier Oliva wrote: > Not sure if this at all related, but is ipython broken for anyone > else? It seems to just hang upon launching it on several auton > machines (GPU1, GPU2, LOV4, LOW1). > > Thanks, > Junier > > On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland > wrote: > > The same version of TensorFlow as on gpu3 is now installed on > gpu2, along with cudnn; let me know if there are issues. > > I didn't do a global install of Caffe on either machine, because > Caffe is kind of dumb and doesn't really do global installs. If > anyone wants this, talk to me and we can figure out what makes sense. > > - Dougal > > On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac > > wrote: > > Dear Autonians, > > As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan > (Pascal) cards > per Barnabas. Dugal is currently upgrading TensorFlow and > Caffe to the > latest and the greatest. Please wait until his e-mail until > you hit the > machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am > working now on GPU1. > > > Predrag > > P.S. I will escalate MATLAB issue with MathWorks but I don't > expect to > fixed before R2017a. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Wed Oct 26 14:08:34 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Wed, 26 Oct 2016 18:08:34 +0000 Subject: GPU2 upgraded In-Reply-To: <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> Message-ID: Here's a workaround that avoids ipython having its config files / etc on nfs, it seems to work for me: export IPYTHONDIR=/home/scratch/$USER/.ipython You can do this in a terminal or put it your .bash_profile / similar to make it permanent. (I guess this means something changed about the nfs server yesterday/today that broke this.) On Wed, Oct 26, 2016 at 7:03 PM yifei ma wrote: > Second that on foxconn. It launches but the ipython client won't start. > > Thanks, > Yifei > > > On 10/26/2016 01:03 PM, Junier Oliva wrote: > > Not sure if this at all related, but is ipython broken for anyone else? It > seems to just hang upon launching it on several auton machines (GPU1, GPU2, > LOV4, LOW1). > > Thanks, > Junier > > On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland > wrote: > > The same version of TensorFlow as on gpu3 is now installed on gpu2, along > with cudnn; let me know if there are issues. > > I didn't do a global install of Caffe on either machine, because Caffe is > kind of dumb and doesn't really do global installs. If anyone wants this, > talk to me and we can figure out what makes sense. > > - Dougal > > On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < > predragp at imap.srv.cs.cmu.edu> wrote: > > Dear Autonians, > > As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards > per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the > latest and the greatest. Please wait until his e-mail until you hit the > machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am > working now on GPU1. > > > Predrag > > P.S. I will escalate MATLAB issue with MathWorks but I don't expect to > fixed before R2017a. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From junieroliva at gmail.com Wed Oct 26 14:20:32 2016 From: junieroliva at gmail.com (Junier Oliva) Date: Wed, 26 Oct 2016 14:20:32 -0400 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> Message-ID: Awesome, seems to work for me too. Thanks!! -Junier On Wed, Oct 26, 2016 at 2:08 PM, Dougal Sutherland wrote: > Here's a workaround that avoids ipython having its config files / etc on > nfs, it seems to work for me: > > export IPYTHONDIR=/home/scratch/$USER/.ipython > > You can do this in a terminal or put it your .bash_profile / similar to > make it permanent. > > (I guess this means something changed about the nfs server yesterday/today > that broke this.) > > > On Wed, Oct 26, 2016 at 7:03 PM yifei ma wrote: > >> Second that on foxconn. It launches but the ipython client won't start. >> >> Thanks, >> Yifei >> >> >> On 10/26/2016 01:03 PM, Junier Oliva wrote: >> >> Not sure if this at all related, but is ipython broken for anyone else? >> It seems to just hang upon launching it on several auton machines (GPU1, >> GPU2, LOV4, LOW1). >> >> Thanks, >> Junier >> >> On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland >> wrote: >> >> The same version of TensorFlow as on gpu3 is now installed on gpu2, along >> with cudnn; let me know if there are issues. >> >> I didn't do a global install of Caffe on either machine, because Caffe is >> kind of dumb and doesn't really do global installs. If anyone wants this, >> talk to me and we can figure out what makes sense. >> >> - Dougal >> >> On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < >> predragp at imap.srv.cs.cmu.edu> wrote: >> >> Dear Autonians, >> >> As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards >> per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the >> latest and the greatest. Please wait until his e-mail until you hit the >> machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am >> working now on GPU1. >> >> >> Predrag >> >> P.S. I will escalate MATLAB issue with MathWorks but I don't expect to >> fixed before R2017a. >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Wed Oct 26 15:11:01 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Wed, 26 Oct 2016 19:11:01 +0000 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> Message-ID: a) You might also want to do export JUPYTER_CONFIG_DIR=/home/scratch/$USER/.jupyter if you use jupyter notebooks and whatnot. b) I'm also not able to import theano when using GPUs on gpu2/gpu3 as of this afternoon. It hangs in a similar way, though I've set the compiledir to be in scratch in my theanorc. I've tracked it down in the debugger to this line or this one (both hang, which one is called depends on theano settings), which call this function or this one . No idea why this started happening or if it's related -- it doesn't seem like it should be hitting nfs at all -- but it seemed to start at the same time. On Wed, Oct 26, 2016 at 7:20 PM Junier Oliva wrote: > Awesome, seems to work for me too. Thanks!! > > -Junier > > On Wed, Oct 26, 2016 at 2:08 PM, Dougal Sutherland > wrote: > > Here's a workaround that avoids ipython having its config files / etc on > nfs, it seems to work for me: > > export IPYTHONDIR=/home/scratch/$USER/.ipython > > You can do this in a terminal or put it your .bash_profile / similar to > make it permanent. > > (I guess this means something changed about the nfs server yesterday/today > that broke this.) > > > On Wed, Oct 26, 2016 at 7:03 PM yifei ma wrote: > > Second that on foxconn. It launches but the ipython client won't start. > > Thanks, > Yifei > > > On 10/26/2016 01:03 PM, Junier Oliva wrote: > > Not sure if this at all related, but is ipython broken for anyone else? It > seems to just hang upon launching it on several auton machines (GPU1, GPU2, > LOV4, LOW1). > > Thanks, > Junier > > On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland > wrote: > > The same version of TensorFlow as on gpu3 is now installed on gpu2, along > with cudnn; let me know if there are issues. > > I didn't do a global install of Caffe on either machine, because Caffe is > kind of dumb and doesn't really do global installs. If anyone wants this, > talk to me and we can figure out what makes sense. > > - Dougal > > On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < > predragp at imap.srv.cs.cmu.edu> wrote: > > Dear Autonians, > > As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards > per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the > latest and the greatest. Please wait until his e-mail until you hit the > machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am > working now on GPU1. > > > Predrag > > P.S. I will escalate MATLAB issue with MathWorks but I don't expect to > fixed before R2017a. > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hanqis at andrew.cmu.edu Wed Oct 26 15:24:22 2016 From: hanqis at andrew.cmu.edu (Hanqi Sun) Date: Wed, 26 Oct 2016 15:24:22 -0400 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> Message-ID: Hi, I am having similar issues with tensorflow as well. When I run a tensorflow program (like python XX.py), it hangs after loading the graph (after executing session.run()). I cannot quit it by hitting Ctrl-Z/Ctrl-C and nothing is printed to the stdout/stderr. Strangely, my program will terminate and do the job (like writing results to files). And after it terminates I am able to see all the strings I print to stdout/stderr. But before its termination I can do nothing about it. I have tried both my local tensorflow and the global one on all three GPU machines. They all had the same problem. Best, Hanqi On Wed, Oct 26, 2016 at 3:11 PM, Dougal Sutherland wrote: > a) You might also want to do > > export JUPYTER_CONFIG_DIR=/home/scratch/$USER/.jupyter > > if you use jupyter notebooks and whatnot. > > > b) I'm also not able to import theano when using GPUs on gpu2/gpu3 as of > this afternoon. It hangs in a similar way, though I've set the compiledir > to be in scratch in my theanorc. I've tracked it down in the debugger to this > line > > or this one > (both > hang, which one is called depends on theano settings), which call this > function > > or this one > . > No idea why this started happening or if it's related -- it doesn't seem > like it should be hitting nfs at all -- but it seemed to start at the same > time. > > > > On Wed, Oct 26, 2016 at 7:20 PM Junier Oliva > wrote: > >> Awesome, seems to work for me too. Thanks!! >> >> -Junier >> >> On Wed, Oct 26, 2016 at 2:08 PM, Dougal Sutherland >> wrote: >> >> Here's a workaround that avoids ipython having its config files / etc on >> nfs, it seems to work for me: >> >> export IPYTHONDIR=/home/scratch/$USER/.ipython >> >> You can do this in a terminal or put it your .bash_profile / similar to >> make it permanent. >> >> (I guess this means something changed about the nfs server >> yesterday/today that broke this.) >> >> >> On Wed, Oct 26, 2016 at 7:03 PM yifei ma wrote: >> >> Second that on foxconn. It launches but the ipython client won't start. >> >> Thanks, >> Yifei >> >> >> On 10/26/2016 01:03 PM, Junier Oliva wrote: >> >> Not sure if this at all related, but is ipython broken for anyone else? >> It seems to just hang upon launching it on several auton machines (GPU1, >> GPU2, LOV4, LOW1). >> >> Thanks, >> Junier >> >> On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland >> wrote: >> >> The same version of TensorFlow as on gpu3 is now installed on gpu2, along >> with cudnn; let me know if there are issues. >> >> I didn't do a global install of Caffe on either machine, because Caffe is >> kind of dumb and doesn't really do global installs. If anyone wants this, >> talk to me and we can figure out what makes sense. >> >> - Dougal >> >> On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < >> predragp at imap.srv.cs.cmu.edu> wrote: >> >> Dear Autonians, >> >> As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards >> per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the >> latest and the greatest. Please wait until his e-mail until you hit the >> machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am >> working now on GPU1. >> >> >> Predrag >> >> P.S. I will escalate MATLAB issue with MathWorks but I don't expect to >> fixed before R2017a. >> >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Wed Oct 26 15:38:09 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Wed, 26 Oct 2016 19:38:09 +0000 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> Message-ID: Hmm - I tested that tensorflow worked on the mnist example when I built it, but when I run python -m tensorflow.models.image.mnist.convolutional on gpu3 now, it gets to "Initialized!" and then just hangs, not responding to ^C or ^Z and also not leading to any GPU utiliziation according to nvidia-smi; it takes a kill -9 to stop it. On Wed, Oct 26, 2016 at 8:25 PM Hanqi Sun wrote: > Hi, > > I am having similar issues with tensorflow as well. > > When I run a tensorflow program (like python XX.py), it hangs after > loading the graph (after executing session.run()). I cannot quit it by > hitting Ctrl-Z/Ctrl-C and nothing is printed to the stdout/stderr. > > Strangely, my program will terminate and do the job (like writing results > to files). And after it terminates I am able to see all the strings I print > to stdout/stderr. But before its termination I can do nothing about it. > > I have tried both my local tensorflow and the global one on all three GPU > machines. They all had the same problem. > > Best, > Hanqi > > On Wed, Oct 26, 2016 at 3:11 PM, Dougal Sutherland > wrote: > > a) You might also want to do > > export JUPYTER_CONFIG_DIR=/home/scratch/$USER/.jupyter > > if you use jupyter notebooks and whatnot. > > > b) I'm also not able to import theano when using GPUs on gpu2/gpu3 as of > this afternoon. It hangs in a similar way, though I've set the compiledir > to be in scratch in my theanorc. I've tracked it down in the debugger to this > line > > or this one > (both > hang, which one is called depends on theano settings), which call this > function > > or this one > . > No idea why this started happening or if it's related -- it doesn't seem > like it should be hitting nfs at all -- but it seemed to start at the same > time. > > > > On Wed, Oct 26, 2016 at 7:20 PM Junier Oliva > wrote: > > Awesome, seems to work for me too. Thanks!! > > -Junier > > On Wed, Oct 26, 2016 at 2:08 PM, Dougal Sutherland > wrote: > > Here's a workaround that avoids ipython having its config files / etc on > nfs, it seems to work for me: > > export IPYTHONDIR=/home/scratch/$USER/.ipython > > You can do this in a terminal or put it your .bash_profile / similar to > make it permanent. > > (I guess this means something changed about the nfs server yesterday/today > that broke this.) > > > On Wed, Oct 26, 2016 at 7:03 PM yifei ma wrote: > > Second that on foxconn. It launches but the ipython client won't start. > > Thanks, > Yifei > > > On 10/26/2016 01:03 PM, Junier Oliva wrote: > > Not sure if this at all related, but is ipython broken for anyone else? It > seems to just hang upon launching it on several auton machines (GPU1, GPU2, > LOV4, LOW1). > > Thanks, > Junier > > On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland > wrote: > > The same version of TensorFlow as on gpu3 is now installed on gpu2, along > with cudnn; let me know if there are issues. > > I didn't do a global install of Caffe on either machine, because Caffe is > kind of dumb and doesn't really do global installs. If anyone wants this, > talk to me and we can figure out what makes sense. > > - Dougal > > On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < > predragp at imap.srv.cs.cmu.edu> wrote: > > Dear Autonians, > > As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards > per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the > latest and the greatest. Please wait until his e-mail until you hit the > machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am > working now on GPU1. > > > Predrag > > P.S. I will escalate MATLAB issue with MathWorks but I don't expect to > fixed before R2017a. > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougal at gmail.com Thu Oct 27 09:04:02 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Thu, 27 Oct 2016 13:04:02 +0000 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> Message-ID: I'm not sure if anyone changed anything, but ipython (without the IPYTHONDIR workaround above), theano with GPUs, and tensorflow -m tensorflow.models.image.mnist.convolutional are all working for me now. On Wed, Oct 26, 2016 at 8:38 PM Dougal Sutherland wrote: > Hmm - I tested that tensorflow worked on the mnist example when I built > it, but when I run > > python -m tensorflow.models.image.mnist.convolutional > > on gpu3 now, it gets to "Initialized!" and then just hangs, not responding > to ^C or ^Z and also not leading to any GPU utiliziation according to > nvidia-smi; it takes a kill -9 to stop it. > > > On Wed, Oct 26, 2016 at 8:25 PM Hanqi Sun wrote: > > Hi, > > I am having similar issues with tensorflow as well. > > When I run a tensorflow program (like python XX.py), it hangs after > loading the graph (after executing session.run()). I cannot quit it by > hitting Ctrl-Z/Ctrl-C and nothing is printed to the stdout/stderr. > > Strangely, my program will terminate and do the job (like writing results > to files). And after it terminates I am able to see all the strings I print > to stdout/stderr. But before its termination I can do nothing about it. > > I have tried both my local tensorflow and the global one on all three GPU > machines. They all had the same problem. > > Best, > Hanqi > > On Wed, Oct 26, 2016 at 3:11 PM, Dougal Sutherland > wrote: > > a) You might also want to do > > export JUPYTER_CONFIG_DIR=/home/scratch/$USER/.jupyter > > if you use jupyter notebooks and whatnot. > > > b) I'm also not able to import theano when using GPUs on gpu2/gpu3 as of > this afternoon. It hangs in a similar way, though I've set the compiledir > to be in scratch in my theanorc. I've tracked it down in the debugger to this > line > > or this one > (both > hang, which one is called depends on theano settings), which call this > function > > or this one > . > No idea why this started happening or if it's related -- it doesn't seem > like it should be hitting nfs at all -- but it seemed to start at the same > time. > > > > On Wed, Oct 26, 2016 at 7:20 PM Junier Oliva > wrote: > > Awesome, seems to work for me too. Thanks!! > > -Junier > > On Wed, Oct 26, 2016 at 2:08 PM, Dougal Sutherland > wrote: > > Here's a workaround that avoids ipython having its config files / etc on > nfs, it seems to work for me: > > export IPYTHONDIR=/home/scratch/$USER/.ipython > > You can do this in a terminal or put it your .bash_profile / similar to > make it permanent. > > (I guess this means something changed about the nfs server yesterday/today > that broke this.) > > > On Wed, Oct 26, 2016 at 7:03 PM yifei ma wrote: > > Second that on foxconn. It launches but the ipython client won't start. > > Thanks, > Yifei > > > On 10/26/2016 01:03 PM, Junier Oliva wrote: > > Not sure if this at all related, but is ipython broken for anyone else? It > seems to just hang upon launching it on several auton machines (GPU1, GPU2, > LOV4, LOW1). > > Thanks, > Junier > > On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland > wrote: > > The same version of TensorFlow as on gpu3 is now installed on gpu2, along > with cudnn; let me know if there are issues. > > I didn't do a global install of Caffe on either machine, because Caffe is > kind of dumb and doesn't really do global installs. If anyone wants this, > talk to me and we can figure out what makes sense. > > - Dougal > > On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < > predragp at imap.srv.cs.cmu.edu> wrote: > > Dear Autonians, > > As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards > per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the > latest and the greatest. Please wait until his e-mail until you hit the > machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am > working now on GPU1. > > > Predrag > > P.S. I will escalate MATLAB issue with MathWorks but I don't expect to > fixed before R2017a. > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at cs.cmu.edu Thu Oct 27 09:12:41 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Thu, 27 Oct 2016 09:12:41 -0400 Subject: GPU2 upgraded In-Reply-To: References: <11f6d11f7b5ef1519ec010d5562d1d71@imap.srv.cs.cmu.edu> <01c9e61e-ce42-bc8e-a9ca-c21d3895a301@gmail.com> Message-ID: <20161027131241.50BoK9FGQ%predragp@cs.cmu.edu> Dougal Sutherland wrote: > I'm not sure if anyone changed anything, but ipython (without the > IPYTHONDIR workaround above), theano with GPUs, and tensorflow > -m tensorflow.models.image.mnist.convolutional are all working for me now. > My guess is that somebody had some ipython notebooks open (I forgot how it works but ipython is notorious for opening sockets and freezing the system) which probably killed NFS for a while. Waiting a bit for those problems to be resolved on its own before pocking the sytem is a good strategy. Predrag > On Wed, Oct 26, 2016 at 8:38 PM Dougal Sutherland wrote: > > > Hmm - I tested that tensorflow worked on the mnist example when I built > > it, but when I run > > > > python -m tensorflow.models.image.mnist.convolutional > > > > on gpu3 now, it gets to "Initialized!" and then just hangs, not responding > > to ^C or ^Z and also not leading to any GPU utiliziation according to > > nvidia-smi; it takes a kill -9 to stop it. > > > > > > On Wed, Oct 26, 2016 at 8:25 PM Hanqi Sun wrote: > > > > Hi, > > > > I am having similar issues with tensorflow as well. > > > > When I run a tensorflow program (like python XX.py), it hangs after > > loading the graph (after executing session.run()). I cannot quit it by > > hitting Ctrl-Z/Ctrl-C and nothing is printed to the stdout/stderr. > > > > Strangely, my program will terminate and do the job (like writing results > > to files). And after it terminates I am able to see all the strings I print > > to stdout/stderr. But before its termination I can do nothing about it. > > > > I have tried both my local tensorflow and the global one on all three GPU > > machines. They all had the same problem. > > > > Best, > > Hanqi > > > > On Wed, Oct 26, 2016 at 3:11 PM, Dougal Sutherland > > wrote: > > > > a) You might also want to do > > > > export JUPYTER_CONFIG_DIR=/home/scratch/$USER/.jupyter > > > > if you use jupyter notebooks and whatnot. > > > > > > b) I'm also not able to import theano when using GPUs on gpu2/gpu3 as of > > this afternoon. It hangs in a similar way, though I've set the compiledir > > to be in scratch in my theanorc. I've tracked it down in the debugger to this > > line > > > > or this one > > (both > > hang, which one is called depends on theano settings), which call this > > function > > > > or this one > > . > > No idea why this started happening or if it's related -- it doesn't seem > > like it should be hitting nfs at all -- but it seemed to start at the same > > time. > > > > > > > > On Wed, Oct 26, 2016 at 7:20 PM Junier Oliva > > wrote: > > > > Awesome, seems to work for me too. Thanks!! > > > > -Junier > > > > On Wed, Oct 26, 2016 at 2:08 PM, Dougal Sutherland > > wrote: > > > > Here's a workaround that avoids ipython having its config files / etc on > > nfs, it seems to work for me: > > > > export IPYTHONDIR=/home/scratch/$USER/.ipython > > > > You can do this in a terminal or put it your .bash_profile / similar to > > make it permanent. > > > > (I guess this means something changed about the nfs server yesterday/today > > that broke this.) > > > > > > On Wed, Oct 26, 2016 at 7:03 PM yifei ma wrote: > > > > Second that on foxconn. It launches but the ipython client won't start. > > > > Thanks, > > Yifei > > > > > > On 10/26/2016 01:03 PM, Junier Oliva wrote: > > > > Not sure if this at all related, but is ipython broken for anyone else? It > > seems to just hang upon launching it on several auton machines (GPU1, GPU2, > > LOV4, LOW1). > > > > Thanks, > > Junier > > > > On Tue, Oct 25, 2016 at 2:14 PM, Dougal Sutherland > > wrote: > > > > The same version of TensorFlow as on gpu3 is now installed on gpu2, along > > with cudnn; let me know if there are issues. > > > > I didn't do a global install of Caffe on either machine, because Caffe is > > kind of dumb and doesn't really do global installs. If anyone wants this, > > talk to me and we can figure out what makes sense. > > > > - Dougal > > > > On Tue, Oct 25, 2016 at 6:09 PM Predrag Punosevac < > > predragp at imap.srv.cs.cmu.edu> wrote: > > > > Dear Autonians, > > > > As scheduled I upgraded GPU2 to 256 GB of RAM and 4 Titan (Pascal) cards > > per Barnabas. Dugal is currently upgrading TensorFlow and Caffe to the > > latest and the greatest. Please wait until his e-mail until you hit the > > machine. MATLAB is broken just like on GPU3 due to cuda-8.0. I am > > working now on GPU1. > > > > > > Predrag > > > > P.S. I will escalate MATLAB issue with MathWorks but I don't expect to > > fixed before R2017a. > > > > > > > > > > > > From predragp at cs.cmu.edu Thu Oct 27 12:57:48 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Thu, 27 Oct 2016 12:57:48 -0400 Subject: GPU1 to be powered down at 3:00 PM Message-ID: <20161027165748.FpSTtwU-k%predragp@cs.cmu.edu> I just got the power cables for those 2 non-functional Tesla cards in GPU1. I would like to install them ASAP which means that I need to power down the server hopefully for no more than 45 minutes. Predrag From predragp at cs.cmu.edu Thu Oct 27 15:36:59 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Thu, 27 Oct 2016 15:36:59 -0400 Subject: GPU1 to be powered down at 3:00 PM In-Reply-To: <20161027165748.FpSTtwU-k%predragp@cs.cmu.edu> References: <20161027165748.FpSTtwU-k%predragp@cs.cmu.edu> Message-ID: <20161027193659._06VWF9R6%predragp@cs.cmu.edu> Predrag Punosevac wrote: > I just got the power cables for those 2 non-functional Tesla cards in > GPU1. I would like to install them ASAP which means that I need to power > down the server hopefully for no more than 45 minutes. > > Predrag Folks, This is done! All 4 GPU Tesla K80 appear to be fully functional. Running nvidia-smi will show that you actually have 8 GPU cards as Tesla K80 consists of 2 K40 cards. If you are doing anything PDE, ODE, related this should be your goto machine due to the properties of Tesla cards. Deep learning, image processing should stick to GPU2 and GPU3 servers. MATLAB is also fully functional on GPU1. Predrag From dougal at gmail.com Fri Oct 28 13:04:24 2016 From: dougal at gmail.com (Dougal Sutherland) Date: Fri, 28 Oct 2016 17:04:24 +0000 Subject: a note about tensorflow and gpu memory usage Message-ID: Hi all, Something that's not necessarily obvious to everyone about tensorflow: if you just run something with tensorflow, it will by default allocate all of the memory on *all* GPUs on the machine. It's pretty unlikely that whatever model you're running is going to need all 48 GB in all 4 cards on gpu{2,3}. :) To stop this behavior, set the environment variable CUDA_VISIBLE_DEVICES to only show tensorflow the relevant devices. For example, "CUDA_VISIBLE_DEVICES=0 python" will then have that tensorflow session use only gpu0. You can check what devices are free with nvidia-smi. Theano will pick a single gpu to use by default; to choose a specific one, use THEANO_FLAGS=device=gpu0. If you're running small models and want to run more than one on a single gpu, you can tell tensorflow to avoid allocating all of a GPU's memory with the methods discussed here . Setting per_process_gpu_memory_fraction lets it allocate a certain portion of the GPU's memory; setting allow_growth=True makes it only claim memory as it needs it. Theano's default behavior is similar to allow_growth=True; you can make it preallocate memory (and often get substantial speedups) with THEANO_FLAGS=device=gpu0,lib.cnmem=1. (lib.cnmem=.5 will allocate half the GPU's memory; lib.cnmem=1024 will allocate 1GB.) - Dougal -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at cs.cmu.edu Sat Oct 29 10:35:11 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Sat, 29 Oct 2016 10:35:11 -0400 Subject: Gogs (Go Git Service) upgraded fully functional Message-ID: <20161029143511.-HDG-I7aA%predragp@cs.cmu.edu> Dear Autonians, Some of you were aware (or even using) that for several months now we were running our own Auton Lab Gogs (Go Git Service), a self-hosted Git service in preparation for a complete migration from CVS and Subversion software versioning and revision control systems to Git. Gogs as you know is an open source alternative to GitHub (another one being GitLab which we also tested). I just upgraded git and Gogs to [git at git ~/gogs/templates]$ git --version git version 2.9.2 [git at git ~/gogs/templates]$ more .VERSION 0.9.99.0915 Gogs is available at http://git.int.autonlab.org from the Auton Lab managed desktops or via x2goclient anywhere on the world. I can reproducibly and safely (using ZFS clone features) build and upgrade Gogs now. We have a very robust installation running out of FreeBSD jail on the top of a ZFS data set. We are talking snapshots 4 times a day and have two remote replicates which with little work can be made secondary mirrors. We have also just acquired a new server which will run in NREC and be used as a third remote replication target. Our Gogs server is integrated with Jenkins continuous integration service which we also run. Finally, my understanding is that we have 100% under control migration from gmake-magic to cmake. At this point I would like to ask everyone to wind use of the Auton Lab CVS and Subversion services and prepare for a migration to Git. If you don't already have an account please e-mail Simon (sheath at andrew.cmu.edu) or I (prdragp at cs.cmu.edu) to get one. In behalf of Git/Jenkins/cmake migration team (Anthony Wertz, Simon Heath, formerly Terence Wong, and yours truly). Predrag Punosevac P.S. We are in the process of creating user's guide at our DokuWiki. People are welcome to stop by NSH 3119 and get quick intro. For people who are not familiar with Git we recommend Pro Git book https://git-scm.com/book/en/v2 From predragp at cs.cmu.edu Sat Oct 29 10:53:59 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Sat, 29 Oct 2016 10:53:59 -0400 Subject: Gogs (Go Git Service) upgraded fully functional In-Reply-To: <20161029143511.-HDG-I7aA%predragp@cs.cmu.edu> References: <20161029143511.-HDG-I7aA%predragp@cs.cmu.edu> Message-ID: <20161029145359.YltJuDS5-%predragp@cs.cmu.edu> Predrag Punosevac wrote: > Dear Autonians, > > Some of you were aware (or even using) that for several months now we Yes e-mail notification now works as well :) Predrag > were running our own Auton Lab Gogs (Go Git Service), a self-hosted Git > service in preparation for a complete migration from CVS and Subversion > software versioning and revision control systems to Git. Gogs as you > know is an open source alternative to GitHub (another one being GitLab > which we also tested). > > I just upgraded git and Gogs to > > [git at git ~/gogs/templates]$ git --version > git version 2.9.2 > [git at git ~/gogs/templates]$ more .VERSION > 0.9.99.0915 > > Gogs is available at > > http://git.int.autonlab.org > > from the Auton Lab managed desktops or via x2goclient anywhere on the > world. > > I can reproducibly and safely (using ZFS clone features) build and > upgrade Gogs now. We have a very robust installation running out of > FreeBSD jail on the top of a ZFS data set. We are talking snapshots 4 > times a day and have two remote replicates which with little work can be > made secondary mirrors. We have also just acquired a new server which > will run in NREC and be used as a third remote replication target. > > Our Gogs server is integrated with Jenkins continuous integration > service which we also run. Finally, my understanding is that we have > 100% under control migration from gmake-magic to cmake. > > At this point I would like to ask everyone to wind use of the Auton Lab > CVS and Subversion services and prepare for a migration to Git. If you > don't already have an account please e-mail Simon > (sheath at andrew.cmu.edu) or I (prdragp at cs.cmu.edu) to get one. > > In behalf of Git/Jenkins/cmake migration team (Anthony Wertz, Simon > Heath, formerly Terence Wong, and yours truly). > > > Predrag Punosevac > > P.S. We are in the process of creating user's guide at our DokuWiki. > People are welcome to stop by NSH 3119 and get quick intro. For people > who are not familiar with Git we recommend Pro Git book > > https://git-scm.com/book/en/v2 From kandasamy at cmu.edu Sat Oct 29 21:02:34 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Sat, 29 Oct 2016 21:02:34 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019172209.esq71ATV8%predragp@cs.cmu.edu> <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> Message-ID: Predrag, Following up on matlab and cuda 7.5. - It seems as if Matlab is not installed on gpu2 and gpu3. Can we have it installed even without GPU support. In most cases, when I use matlab for my gpu experiments, I will be executing an external python command which can call the GPU. - Also, on gpu2/3 as Dougal suggested can we also install cuda 7.5, but default to 8.0? thanks, samy On Fri, Oct 21, 2016 at 5:07 PM, Dougal Sutherland wrote: > I don't think it would be a bad thing to have both versions of cuda > installed and default to 8.0. To use 7.5 for matlab you probably just have > to write a wrapper script to set LD_LIBRARY_FLAGS appropriately. > > On Fri, Oct 21, 2016 at 9:21 PM Kirthevasan Kandasamy > wrote: > >> Hi all, >> >> I was planning on using Matlab with GPUs for one of my projects. >> Can we please keep gpu2 as it is for now? >> >> samy >> >> On Fri, Oct 21, 2016 at 3:54 PM, Barnabas Poczos >> wrote: >> >> Sounds good. Let us have tensorflow system wide on all GPU nodes. We >> can worry about Matlab later. >> >> Best, >> B >> ====================== >> Barnabas Poczos, PhD >> Assistant Professor >> Machine Learning Department >> Carnegie Mellon University >> >> >> On Fri, Oct 21, 2016 at 3:50 PM, Predrag Punosevac >> wrote: >> > Barnabas Poczos wrote: >> > >> >> Hi Predrag, >> >> >> >> If there is no other solution, then I think it is OK not to have >> >> Matlab on GPU2 and GPU3. >> >> Tensorflow has higher priority on these nodes. >> > >> > We could possibly have multiple CUDA libraries for different versions >> > but that is going to bite us for the rear end quickly. People who want >> > to use MATLAB with GPUs will have to live with GPU1 probably until >> > Spring release of MATLAB. >> > >> > Predrag >> > >> >> >> >> Best, >> >> Barnabas >> >> >> >> >> >> >> >> >> >> ====================== >> >> Barnabas Poczos, PhD >> >> Assistant Professor >> >> Machine Learning Department >> >> Carnegie Mellon University >> >> >> >> >> >> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac < >> predragp at cs.cmu.edu> wrote: >> >> > Dougal Sutherland wrote: >> >> > >> >> > >> >> > Sorry that I am late for the party. This is my interpretation of >> what we >> >> > should do. >> >> > >> >> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to >> live >> >> > with it. Barnabas please OK this. I will work with MathWorks for >> this to >> >> > be fixed for 2017a release. >> >> > >> >> > 2. Then I could install TensorFlow compiled by Dougal system wide. >> >> > Please Dugal after I upgrade back to 8.0 recompile it again using >> CUDA >> >> > 8.0. I could give you the root password so that you can compile and >> >> > install directly. >> >> > >> >> > 3. If everyone is OK with above I will pull the trigger on GPU3 at >> >> > 4:30PM and upgrade to 8.0 >> >> > >> >> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards >> during >> >> > the October 25 power outrage. >> >> > >> >> > Predrag >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> >> Heh. :) >> >> >> >> >> >> An explanation: >> >> >> >> >> >> - Different nvidia gpu architectures are called "compute >> capabilities". >> >> >> This is a number that describes the behavior of the card: the >> maximum size >> >> >> of various things, which API functions it supports, etc. There's >> a >> >> >> reference here >> >> >> > and_specifications>, >> >> >> but it shouldn't really matter. >> >> >> - When CUDA compiles code, it targets a certain architecture, >> since it >> >> >> needs to know what features to use and whatnot. I *think* that >> if you >> >> >> compile for compute capability x, it will work on a card with >> compute >> >> >> capability y approximately iff x <= y. >> >> >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1. >> >> >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you >> ask to >> >> >> compile for 6.1 it crashes. >> >> >> - Theano by default tries to compile for the capability of the >> card, but >> >> >> can be configured to compile for a different capability. >> >> >> - Tensorflow asks for a list of capabilities to compile for when >> you >> >> >> build it in the first place. >> >> >> >> >> >> >> >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland >> wrote: >> >> >> >> >> >> > They do work with 7.5 if you specify an older compute >> architecture; it's >> >> >> > just that their actual compute capability of 6.1 isn't supported >> by cuda >> >> >> > 7.5. Thank is thrown off by this, for example, but it can be >> fixed by >> >> >> > telling it to pass compute capability 5.2 (for example) to nvcc. >> I don't >> >> >> > think that this was my problem with building tensorflow on 7.5; >> I'm not >> >> >> > sure what that was. >> >> >> > >> >> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy < >> kandasamy at cmu.edu> >> >> >> > wrote: >> >> >> > >> >> >> > Thanks Dougal. I'll take a look atthis and get back to you. >> >> >> > So are you suggesting that this is an issue with TitanX's not >> being >> >> >> > compatible with 7.5? >> >> >> > >> >> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland < >> dougal at gmail.com> >> >> >> > wrote: >> >> >> > >> >> >> > I installed it in my scratch directory (not sure if there's a >> global >> >> >> > install?). The main thing was to put its cache on scratch; it got >> really >> >> >> > upset when the cache directory was on NFS. (Instructions at the >> bottom of >> >> >> > my previous email.) >> >> >> > >> >> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos < >> bapoczos at cs.cmu.edu> wrote: >> >> >> > >> >> >> > That's great! Thanks Dougal. >> >> >> > >> >> >> > As I remember bazel was not installed correctly previously on >> GPU3. Do >> >> >> > you know what went wrong with it before and why it is good now? >> >> >> > >> >> >> > Thanks, >> >> >> > Barnabas >> >> >> > ====================== >> >> >> > Barnabas Poczos, PhD >> >> >> > Assistant Professor >> >> >> > Machine Learning Department >> >> >> > Carnegie Mellon University >> >> >> > >> >> >> > >> >> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland < >> dougal at gmail.com> >> >> >> > wrote: >> >> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used >> the cuda >> >> >> > 8.0 >> >> >> > > install, and it built fine. So additionally installing 7.5 was >> probably >> >> >> > not >> >> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute >> >> >> > architecture >> >> >> > > that the Titan Xs use, so Theano at least needs to be manually >> told to >> >> >> > use >> >> >> > > an older architecture. >> >> >> > > >> >> >> > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. >> I >> >> >> > think >> >> >> > > it should work fine with the cudnn in my scratch directory. >> >> >> > > >> >> >> > > You should probably install it to scratch, either running this >> first to >> >> >> > put >> >> >> > > libraries your scratch directory or using a virtualenv or >> something: >> >> >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local >> >> >> > > >> >> >> > > You'll need this to use the library and probably to install it: >> >> >> > > export >> >> >> > > >> >> >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/ >> lib64:"$LD_LIBRARY_PATH" >> >> >> > > >> >> >> > > To install: >> >> >> > > pip install --user ~dsutherl/tensorflow-0.11. >> 0rc0-py2-none-any.whl >> >> >> > > (remove --user if you're using a virtualenv) >> >> >> > > >> >> >> > > (A request: I'm submitting to ICLR in two weeks, and for some >> of the >> >> >> > models >> >> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So >> please don't >> >> >> > > run a ton of stuff on gpu3 unless you're working on a deadline >> too. >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > Steps to install it, for the future: >> >> >> > > >> >> >> > > Install bazel in your home directory: >> >> >> > > >> >> >> > > wget >> >> >> > > >> >> >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/ >> bazel-0.3.2-installer-linux-x86_64.sh >> >> >> > > bash bazel-0.3.2-installer-linux-x86_64.sh >> --prefix=/home/scratch/$USER >> >> >> > > --base=/home/scratch/$USER/.bazel >> >> >> > > >> >> >> > > Configure bazel to build in scratch. There's probably a better >> way to do >> >> >> > > this, but this works: >> >> >> > > >> >> >> > > mkdir /home/scratch/$USER/.cache >> >> >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel >> >> >> > > >> >> >> > > Build tensorflow. Note that builds from git checkouts don't >> work, because >> >> >> > > they assume a newer version of git than is on gpu3: >> >> >> > > >> >> >> > > cd /home/scratch/$USER >> >> >> > > wget >> >> >> > > tar xf >> >> >> > > cd tensorflow-0.11.0rc0 >> >> >> > > ./configure >> >> >> > > >> >> >> > > This is an interactive script that doesn't seem to let you pass >> >> >> > arguments or >> >> >> > > anything. It's obnoxious. >> >> >> > > Use the default python >> >> >> > > don't use cloud platform or hadoop file system >> >> >> > > use the default site-packages path if it asks >> >> >> > > build with GPU support >> >> >> > > default gcc >> >> >> > > default Cuda SDK version >> >> >> > > specify /usr/local/cuda-8.0 >> >> >> > > default cudnn version >> >> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g. >> >> >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda >> >> >> > > Pascal Titan Xs have compute capability 6.1 >> >> >> > > >> >> >> > > bazel build -c opt --config=cuda >> >> >> > > //tensorflow/tools/pip_package:build_pip_package >> >> >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./ >> >> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is >> put in the >> >> >> > > directory you specified above. >> >> >> > > >> >> >> > > >> >> >> > > - Dougal >> >> >> > > >> >> >> > > >> >> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy < >> kandasamy at cmu.edu >> >> >> > > >> >> >> > > wrote: >> >> >> > >> >> >> >> > >> Predrag, >> >> >> > >> >> >> >> > >> Any updates on gpu3? >> >> >> > >> I have tried both tensorflow and chainer and in both cases the >> problem >> >> >> > >> seems to be with cuda >> >> >> > >> >> >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac < >> predragp at cs.cmu.edu >> >> >> > > >> >> >> > >> wrote: >> >> >> > >>> >> >> >> > >>> Dougal Sutherland wrote: >> >> >> > >>> >> >> >> > >>> > I tried for a while. I failed. >> >> >> > >>> > >> >> >> > >>> >> >> >> > >>> Damn this doesn't look good. I guess back to the drawing >> board. Thanks >> >> >> > >>> for the quick feed back. >> >> >> > >>> >> >> >> > >>> Predrag >> >> >> > >>> >> >> >> > >>> > Version 0.10.0 fails immediately on build: "The specified >> >> >> > >>> > --crosstool_top >> >> >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid >> >> >> > >>> > cc_toolchain_suite >> >> >> > >>> > rule." Apparently this is because 0.10 required an older >> version of >> >> >> > >>> > bazel ( >> >> >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and >> I don't >> >> >> > have >> >> >> > >>> > the >> >> >> > >>> > energy to install an old version of bazel. >> >> >> > >>> > >> >> >> > >>> > Version 0.11.0rc0 gets almost done and then complains about >> no such >> >> >> > >>> > file or >> >> >> > >>> > directory for libcudart.so.7.5 (which is there, where I told >> >> >> > tensorflow >> >> >> > >>> > it >> >> >> > >>> > was...). >> >> >> > >>> > >> >> >> > >>> > Non-release versions from git fail immediately because they >> call git >> >> >> > -C >> >> >> > >>> > to >> >> >> > >>> > get version info, which is only in git 1.9 (we have 1.8). >> >> >> > >>> > >> >> >> > >>> > >> >> >> > >>> > Some other notes: >> >> >> > >>> > - I made a symlink from ~/.cache/bazel to >> >> >> > >>> > /home/scratch/$USER/.cache/bazel, >> >> >> > >>> > because bazel is the worst. (It complains about doing >> things on NFS, >> >> >> > >>> > and >> >> >> > >>> > hung for me [clock-related?], and I can't find a global >> config file >> >> >> > or >> >> >> > >>> > anything to change that in; it seems like there might be >> one, but >> >> >> > their >> >> >> > >>> > documentation is terrible.) >> >> >> > >>> > >> >> >> > >>> > - I wasn't able to use the actual Titan X compute >> capability of 6.1, >> >> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably >> not a huge >> >> >> > >>> > deal, >> >> >> > >>> > but I don't know. >> >> >> > >>> > >> >> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in >> >> >> > LD_LIBRARY_PATH >> >> >> > >>> > and >> >> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping >> that would >> >> >> > >>> > help >> >> >> > >>> > with the 0.11.0rc0 problem, but it didn't. >> >> >> > >> >> >> >> > >> >> >> >> > > >> >> >> > >> >> >> > >> >> >> > >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at cs.cmu.edu Sat Oct 29 21:52:49 2016 From: predragp at cs.cmu.edu (Predrag Punosevac) Date: Sat, 29 Oct 2016 21:52:49 -0400 Subject: GPU3 back in business In-Reply-To: References: <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> Message-ID: <20161030015249.2Y1mt3H_h%predragp@cs.cmu.edu> Kirthevasan Kandasamy wrote: > Predrag, > > Following up on matlab and cuda 7.5. > - It seems as if Matlab is not installed on gpu2 and gpu3. Can we have it > installed even without GPU support. In most cases, when I use matlab for my > gpu experiments, I will be executing an external python command which can > call the GPU. > - Also, on gpu2/3 as Dougal suggested can we also install cuda 7.5, but > default to 8.0? > > thanks, > samy The MATLAB is installed. Licensing manager was not started per Barnabas. I started it now because you insist on it. I really don't appreciate this back and fourth discussion on Saturday evening over something I believed we agreed on. My understanding was that you guys wanted latest and greatest TensorFlow Titan X (Pascal cards). That requires CUDA 8.0. MATLAB is broken on CUDA 8.0 and it will freeze if GPUs are involved no matter how calls are made. MATLAB is only tested on CUDA 7.5 (not by me but by people at MathWorks). MATLAB is fully functional on GPU1 which has 8 GPU cards Tesla K80 (Kepler). We all agreed that we can live with MATLAB only on GPU1. That were the last marching orders of Dr. Poczos. Predrag P.S. If we freeze servers tonight bunch of image processing guys will be screaming at me until Monday. From kandasamy at cmu.edu Sat Oct 29 22:04:55 2016 From: kandasamy at cmu.edu (Kirthevasan Kandasamy) Date: Sat, 29 Oct 2016 22:04:55 -0400 Subject: GPU3 back in business In-Reply-To: <20161030015249.2Y1mt3H_h%predragp@cs.cmu.edu> References: <20161019183651.3Xy0murhr%predragp@cs.cmu.edu> <7B1F10ED8F1045F6.b801b0a4-484f-4c6e-9558-4a83a0cb33ec@mail.outlook.com> <20161019201053.KHd8koxJl%predragp@cs.cmu.edu> <20161021193727.9hJzyE8s5%predragp@cs.cmu.edu> <20161021195032.EWGQfn13b%predragp@cs.cmu.edu> <20161030015249.2Y1mt3H_h%predragp@cs.cmu.edu> Message-ID: Hi Predrag, Thanks for the prompt response. The matlab wasn't really urgent. I sent the email now, lest I forget, but it would have worked well had you looked into it on Monday. samy On Sat, Oct 29, 2016 at 9:52 PM, Predrag Punosevac wrote: > Kirthevasan Kandasamy wrote: > > > Predrag, > > > > Following up on matlab and cuda 7.5. > > - It seems as if Matlab is not installed on gpu2 and gpu3. Can we have it > > installed even without GPU support. In most cases, when I use matlab for > my > > gpu experiments, I will be executing an external python command which can > > call the GPU. > > - Also, on gpu2/3 as Dougal suggested can we also install cuda 7.5, but > > default to 8.0? > > > > thanks, > > samy > > > The MATLAB is installed. Licensing manager was not started per Barnabas. > I started it now because you insist on it. I really don't appreciate > this back and fourth discussion on Saturday evening over something I > believed we agreed on. > > My understanding was that you guys wanted latest and greatest TensorFlow > Titan X (Pascal cards). That requires CUDA 8.0. MATLAB is broken on CUDA > 8.0 and it will freeze if GPUs are involved no matter how calls are > made. MATLAB is only tested on CUDA 7.5 (not by me but by people at > MathWorks). MATLAB is fully functional on GPU1 which has 8 GPU cards > Tesla K80 (Kepler). We all agreed that we can live with MATLAB only on > GPU1. That were the last marching orders of Dr. Poczos. > > Predrag > > P.S. If we freeze servers tonight bunch of image processing guys will be > screaming at me until Monday. > -------------- next part -------------- An HTML attachment was scrubbed... URL: