The Auton Lab cluster important info
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Mar 11 00:44:26 EDT 2020
Dear Autonians,
This the first follow up to your feedback received in response to the
email sent earlier by our director Dr. Dubrawski in regard to pandemic
preparedness. Nick has raised a valid concern which I could address
immediately. Please continue to read.
The Auton Lab users who don't have Auton Lab supported desktops have on
their disposal three shell gateways
bash.autonlab.org SHA256:Pf/uiR0Hzw9HpSNaf3/fRXon9gdXFes5KP7HEobNaW4
lop2.autonlab.org SHA256:LiG0+LN6Tf5EQZjZatD/WDYF2iV046y+Lnz1EXC+EXY
lop1.autonlab.org SHA256:pvrXGlYOcrBOtI5b7xt4sItRkIbqRMhJ+qLRlTrgIts
lop1 is only to be used by legacy account holders who have their home
directories on /zfsauton. The Auton Lab members who have the Auton Lab
supported Linux desktops should use them as their shell gateways. The
would relieve shell gateways for the members who have no other choice.
At the moment the following are specs of the most important Auton Lab
computing nodes
CPU cores RAM GPU
CPU nodes:
ari 32 520
athena 32 520
foxconn 32 384
lov1 88 764
lov2 88 764
lov3 64 256
lov4 64 256
lov5 88 764
lov6 88 384
low1 48 520
GPU nodes:
gpu1 24 128 4xTesla K80
gpu2 24 256 4xTitan X
gpu3 32 256 4xTitan X
gpu4 32 256 4xTitan X
gpu5 32 256 4xTitan X
gpu6 32 256 4xTITAN Xp
gpu7(on reserve)32 256 4xTITAN Xp
gpu8 32 256 4xTITAN Xp
gpu9 32 256 4xTITAN Xp
gpu10 32 192 4xGeForce GTX 1080Ti
gpu11 40 96 4xGeForce GTX 1080Ti
gpu12 40 96 4xGeForce GTX 1080Ti
gpu13 40 96 4xGeForce GTX 1080Ti
gpu14 40 96 4xGeForce RTX 2080Ti
gpu15 40 192 4xGeForce RTX 2080Ti
gpu16 40 192 4xGeForce RTX 2080Ti
gpu17 40 192 4xGeForce RTX 2080Ti
gpu18 40 192 4xGeForce RTX 2080Ti
gpu19 40 192 4xGeForce RTX 2080Ti
Nick Gisolfi <ngisolfi at cs.cmu.edu> wrote:
> I???m fine working remotely, so there isn???t anything that needs to
> change in order for me to continue working in the case of a pandemic.
>
> One thing that may be useful???is there a CLI command to get a list of
I have edited the motto of the day (MOTD) on bash and lop2 which will
display the name of available computing nodes their specs. There are few
older machines left from the list which will be regularly updated. I am
planning to add more resources from the machines in GHC previously used
by Dr. Barnabas. We also have 110K worth of hardware on the way and a
plan to spend another 30-40K.
> the servers I am allowed to use? I often find the lov* machines are
> full of users. I usually work in ari / foxconn in those cases. I am
The Auton Lab users are encouraged to use
http://monit.autonlab.org
Read access:
username: auton
password: Dr.Who
to get a rough a quick up and down view of lab resources with CPU/RAM
utilization. Unfortunately monit doesn't support displaying GPU loads.
> not intimately familiar with how the servers operate, but I???ve
> noticed that the following machines have full swap memory???it seems
> like this even makes something as simple as changing directories hang.
>
> Low1: Full swap memory
41 GB of RAM used out of 520GB
> Lov1: Full swap memory
Just checked "only" 321GB or RAM is used out of 764 GB of RAM. changing
directory was not slow for me.
> Lov2: Full swap memory
32 GB of RAM used out of 764 GB of RAM
> Lov4: Full swap memory
>
97GB of RAM out of 256GB
Our machines have a very long, often over the year, up times. Sofware,
including OSs have bugs which lead to memory leaks. Please report things
and I will have no problem rebooting things. NFS stale file handles are
another source of the problem.
I am planning to move /zfsauton/data and /zfsauton/project new file
server ourea.int.autonlab.org bought few months ago. The server is up
and running but I was reluctant to ZFS replicate sets in the middle of
the semmester. The new file server has even dedicated NVMe for SLOG
https://jrs-s.net/2019/05/02/zfs-sync-async-zil-slog/
Unfortunatelly the Auton Lab currently doesn't have written user
agreement, data user agreement, and software developer non-disclosure
agreement. These might be put together on the emergancy base in response
to misuse of resources. At this point at lease half dozen GPU servers
are hoged by people who are running CPU intensive jobs. This is
absolutely unacceptable and those members have to stop that doing now.
> I don???t know if this is something worth rebooting over.
>
I just rebooted GPU5 for similar issue.
> I would also recommend encouraging everyone in the lab to use the
> Auton Slack team. I think Predrag and Kyle are admins on the account
> (?) so they may need to manually add other lab members. This seems
> like a great way to stay in touch (in real-time) in the event of
> prolonged off-campus work.
I will be logged on Slack regularly.
Predrag
>
> If I think of anything else, I???ll let you know.
>
> - Nick
>
> C: 484.553.2708
More information about the Autonlab-users
mailing list