Main File server issues
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Oct 31 20:36:05 EDT 2018
Dear Autonians,
Our group has grown up to 129 active accounts which is one shy of
National Robotics Center. Unfortunately some of the computer
infrastructure practices we adapted over the past 25 years no longer
scale well. One of those polices is unrestricted size home directories
which share the same ZFS data set on one of our NFS file servers.
Currently there are 96 historical accounts which share the 36TB data set
on zfs pool (zfsauton) of the same size, 21 recently created accounts
with home directories restricted to 250 GB per user each one being a
separate ZFS data set which is the part of 44TB zfs pool (zfsauton2),
and 12 account of group Neill which has its own file server with ample
space on their own zfs pool (zfsneill). Neill account also share the
same data set but they are irrelevant for the purpose of this e-mail.
Besides being able to implement per account storage restrictions having
each home directory enables us to have more fine grained ZFS snapshot
take and retention policies. However migrating old accounts to separate
data sets (I have several 44 TB ZFS pools available for those accounts)
is time consuming and manual process (essentially I have to rsync old to
new home directory). All file servers in questions have 10 Gigabit
network cards so it is completely irrelevant where is your home
directory.
I was dragging my feet with it but I can't do it any longer. 36TB ZFS
pool which hosts /zfsauton/home dataset is over 90% full which seriously
impacts the speed of NFS (in spite of 10 Gigabit network card) and is
very expensive to rebuild (ZFS resilvering in the case of a dead HDD
will take a month instead of a day). I have no choice but to do the
following.
I am looking for 10-15 volunteers who don't mind removing old junk from
their home directories and having some down time while I rsync those
directories to the new file server (ZFS pool). Once I migrate 10-15
accounts I will stop snapshots for the remaining people and clear stale
snapshots to relive the space on the 36TB ZFS pool. This will be
repeated multiple times up until all accounts are separate ZFS data sets
with limit 250 GB per account (additional storage space will be granted
per sponsoring faculty request).
In the case that there are no volunteers I will compute the size of the
5 largest home directories and those will be migrated to the new file
server after being reduced to the proper size.
I appreciate your cooperation in this matter.
Sincerely,
Predrag Punosevac
P.S. We will also have to implement Slurm workload manager on all
computing nodes no later than the 1st of January next year
https://slurm.schedmd.com/
This will essentially convert the Auton Lab in to the same modus
operands (apart of the Lustre file system http://lustre.org/) as the
Pittsburgh super computing center.
More information about the Autonlab-users
mailing list