The Auton Lab cluster important info

Wed Mar 11 00:44:26 EDT 2020

Dear Autonians,

This the first follow up to your feedback received in response  to the
email sent earlier by our director Dr. Dubrawski in regard to pandemic
preparedness. Nick has raised a valid concern which I could address
immediately. Please continue to read.

The Auton Lab users who don't have Auton Lab supported desktops have on
their disposal three shell gateways

bash.autonlab.org SHA256:Pf/uiR0Hzw9HpSNaf3/fRXon9gdXFes5KP7HEobNaW4
lop2.autonlab.org SHA256:LiG0+LN6Tf5EQZjZatD/WDYF2iV046y+Lnz1EXC+EXY
lop1.autonlab.org SHA256:pvrXGlYOcrBOtI5b7xt4sItRkIbqRMhJ+qLRlTrgIts

lop1 is only to be used by legacy account holders who have their home
directories on /zfsauton. The Auton Lab members who have the Auton Lab
supported Linux desktops should use them as their shell gateways. The
would relieve shell gateways for the members who have no other choice. 
At the moment the following are specs of the most important Auton Lab
computing nodes

                CPU cores       RAM             GPU
CPU nodes:
ari             32              520
athena          32              520
foxconn         32              384
lov1            88              764
lov2            88              764
lov3            64              256
lov4            64              256
lov5            88              764
lov6            88              384
low1            48              520

GPU nodes:
gpu1            24              128             4xTesla K80
gpu2            24              256             4xTitan X
gpu3            32              256             4xTitan X
gpu4            32              256             4xTitan X
gpu5            32              256             4xTitan X
gpu6            32              256             4xTITAN Xp
gpu7(on reserve)32              256             4xTITAN Xp
gpu8            32              256             4xTITAN Xp
gpu9            32              256             4xTITAN Xp
gpu10           32              192             4xGeForce GTX 1080Ti
gpu11           40              96              4xGeForce GTX 1080Ti
gpu12           40              96              4xGeForce GTX 1080Ti
gpu13           40              96              4xGeForce GTX 1080Ti
gpu14           40              96              4xGeForce RTX 2080Ti
gpu15           40              192             4xGeForce RTX 2080Ti
gpu16           40              192             4xGeForce RTX 2080Ti
gpu17           40              192             4xGeForce RTX 2080Ti
gpu18           40              192             4xGeForce RTX 2080Ti
gpu19           40              192             4xGeForce RTX 2080Ti

Nick Gisolfi <ngisolfi at cs.cmu.edu> wrote:

> I???m fine working remotely, so there isn???t anything that needs to
> change in order for me to continue working in the case of a pandemic.
> 
> One thing that may be useful???is there a CLI command to get a list of

I have edited the motto of the day (MOTD) on bash and lop2 which will
display the name of available computing nodes their specs. There are few
older machines left from the list which will be regularly updated. I am
planning to add more resources from the machines in GHC previously used
by Dr. Barnabas. We also have 110K worth of hardware on the way and a
plan to spend another 30-40K. 

> the servers I am allowed to use?  I often find the lov* machines are
> full of users.  I usually work in ari / foxconn in those cases.  I am

The Auton Lab users are encouraged to use

http://monit.autonlab.org 

Read access:

username: auton
password: Dr.Who

to get a rough a quick up and down view of lab resources with CPU/RAM
utilization. Unfortunately monit doesn't support displaying GPU loads.

> not intimately familiar with how the servers operate, but I???ve
> noticed that the following machines have full swap memory???it seems
> like this even makes something as simple as changing directories hang.
> 
> Low1: Full swap memory

41 GB of RAM used out of 520GB

> Lov1: Full swap memory

Just checked "only" 321GB or RAM is used out of 764 GB of RAM. changing
directory was not slow for me.

> Lov2: Full swap memory

32 GB of RAM used out of 764 GB of RAM

> Lov4: Full swap memory
> 

97GB of RAM out of 256GB

Our machines have a very long, often over the year, up times. Sofware,
including OSs have bugs which lead to memory leaks. Please report things
and I will have no problem rebooting things. NFS stale file handles are
another source of the problem. 

I am planning to move /zfsauton/data and /zfsauton/project new file
server ourea.int.autonlab.org bought few months ago. The server is up
and running but I was reluctant to ZFS replicate sets in the middle of
the semmester. The new file server has even dedicated NVMe for SLOG 

https://jrs-s.net/2019/05/02/zfs-sync-async-zil-slog/

Unfortunatelly the Auton Lab currently doesn't have written user
agreement, data user agreement, and software developer non-disclosure
agreement. These might be put together on the emergancy base in response
to misuse of resources. At this point at lease half dozen GPU servers
are hoged by people who are running CPU intensive jobs. This is
absolutely unacceptable and those members have to stop that doing now.

> I don???t know if this is something worth rebooting over.
> 

I just rebooted GPU5 for similar issue.

> I would also recommend encouraging everyone in the lab to use the
> Auton Slack team.  I think Predrag and Kyle are admins on the account
> (?) so they may need to manually add other lab members.  This seems
> like a great way to stay in touch (in real-time) in the event of
> prolonged off-campus work.

I will be logged on Slack regularly.

Predrag

> 
> If I think of anything else, I???ll let you know.
> 
> - Nick
> 
> C: 484.553.2708