Rocks cluster
- Management of computer clusters
- Installation
- Installing Software
- Uninstalling Software
- Upgrade
- Backup
- Adminstration
- Adding a user
- Implementing disk quota
- Exporting a new directory from the frontend to all the compute nodes
- Keeping files up to date on all nodes using the 411 Secure Information Service
- Removing old log files to prevent /var filling up
- Cleaning up temporary directories on compute nodes
- Managing firewall
- Changing the public IP address on the frontend
Management of computer clusters
Intelligent Platform Management Interface (IPMI) is usually available for server machines. It can use the dedicated IPMI Ethernet port or share the first LAN port (so make sure the first port is connected to the internal network switch) for remote monitoring and control.
KVM switch can be used for non-server workstations or older machines.
Please refer to the User Manuals page for details on how to use IPMI or KVM. SuperMicro has a suite of Server Management Utilities to perform health monitoring, power management and firmware maintenance (BIOS and IPMI/BMC firmware upgrade). Rocks also bundles the OpenIPMI console interface.
Installation
Follow the Users Guide in the Support and Docs section of Rocks cluster’s web site.
- Reserve a certain amount of disk space for compute nodes that will not be overwritten when reinstalling happens. 20G seems enough for the operating system and software. Remember: the gateway should be 128.101.162.54!
- Update kernel to the latest version. Update to newer versions when they come out.
yum --enablerepo base upgrade kernel
yum --enablerepo base upgrade kernel-devel
yum --enablerepo base upgrade kernel-headers
cp /var/cache/yum/base/packages/kernel*.rpm /export/rocks/install/contrib/6.1.1/x86_64/RPMS/
cd /export/rocks/install; rocks create distro
reboot
Check if you indeed has the desired version, then kickstart the nodes.
uname -r
while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)
- Create user accounts (see Add a user) before installing anything else so that there is less chance that the desired UID/GIDs conflict with software-generated accounts, and set disk quota (see Implement disk quota) to prevent any user inadvertently generating a huge amount of data from affecting the entire system.
- Install ZFS on Linux (see Use the ZFS file system)
- Install the most recent Torque roll
rocks add roll /path/to/torque/roll.iso
rocks enable roll torque
cd /export/rocks/install; rocks create distro
rocks run roll torque | sh
reboot
- Install screen and cmake (likely from source code, see Add a software application distributed as source code).
Configuring Environment Modules package
It is recommended that modulefiles are stored in a directory shared among all nodes. For example, create the directory under /share/apps, and add it to /usr/share/Modules/init/.modulespath:
mkdir /share/apps/modulefiles
echo "/share/apps/modulefiles" >> /usr/share/Modules/init/.modulespath
Finally, make sure the .modulespath file is broadcast to all nodes (see how to keep files up to date on all nodes using the 411 Secure Information Service).
Using the ZFS file system
Due to the licensing of the software, ZFS on Linux is supplied in source code only even if you have already selected the zfs-linux roll when installing Rocks cluster. Please refer to zfs-linux Roll: Users Guide for how to build the binaries.
- Create a zpool for each additional hard drive that is not used as the system disk, and create a ZFS file system for each active user with compression, NFS sharing, and quota turned on. Compression with ZFS carries very little overhead and because of the reduced file size it sometimes even improves IO.
zpool create space raidz2 /dev/sda /dev/sdb ... raidz2 /dev/sdp /dev/sdq ... raidz2 sdx sdy ... spare sdz ...
zfs set atime=off space
zfs set compression=gzip space
for u in active_user1 active_user2 ...; do
zfs create space/$u
zfs set compression=lz4 space/$u
zfs set sharenfs=on space/$u
zfs set quota=100G space/$u
chown -R $u:$u /space/$u
done
To make these file systems available as /share/$USER/spaceX, add the following line to the end of /etc/auto.share
* -fstype=autofs,-Dusername=& file:/etc/auto.zfsfs
And create /etc/auto.zfsfs with the following contents, and propagate it using 411.
* -nfsvers=3 cluster.local:/&/${username}
You need to enable the share points on every boot by adding to /etc/rc.d/rc.local the following line:
zfs share -a
For how to enable them automatically, see ZFS Administration, Part XV- iSCSI, NFS and Samba.
NOTE: Sometimes “zfs share -a” does not populate “/var/lib/nfs/etab” and make /share/$USER/space available on other nodes. A work-around is simply to execute “zfs set sharenfs=on space/SOME_USER” on any user before calling “zfs share -a”.
Automatic backup
ZFS uses copy-on-write and, as a result, snapshots can be created very quickly and cheaply. Create the following script as /etc/cron.daily/zfs-snapshot to keep the last 7 daily, 5 weekly, 12 monthly, and 7 yearly backups.
#!/bin/bash
snapshot() {
local root=$1
local prefix=$2
local keep=$3
zfs list -t filesystem -o name -H -r "$root" | while read fs; do
[ "$fs" == "$root" ] && continue
# echo "zfs snapshot $fs@$prefix-$(date '+%Y%m%d')"
zfs snapshot "$fs@$prefix-$(date '+%Y%m%d')"
zfs list -t snapshot -o name -s creation -H -r "$fs" | grep "$prefix" | head -n "-$keep" | while read ss; do
# echo "zfs destroy $ss"
zfs destroy $ss
done
done
}
snapshot "space" "daily" 7
[ $(date +%w) -eq 0 ] && snapshot "space" "weekly" 5
[ $(date +%-d) -eq 1 ] && snapshot "space" "monthly" 12
[ $(date +%-j) -eq 1 ] && snapshot "space" "yearly" 7
Periodic error checking
Hard drives can have silent data corruption. ZFS can detect and correct these errors on a live system. Create the following script as /etc/cron.monthly/zfs-scrub (or in /etc/cron.weekly if using cheap commodity disks):
#!/bin/sh
zpool scrub space
Slurm
Add new queues to /etc/slurm/partitions:
PartitionName=E5_2650v4 DEFAULT=YES STATE=UP TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0" DefaultTime=60 DefMemPerCPU=512 nodes=compute-0-[0-139]
PartitionName=4170HE DEFAULT=YES STATE=UP TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0" DefaultTime=60 DefMemPerCPU=512 nodes=compute-2-[0-31]
And make the following changes in /etc/slurm/slurm.conf:
AccountingStorageTRES=gres/gpu
AccountingStorageEnforce=all
FairShareDampeningFactor=5
GresTypes=gpu
PriorityType=priority/multifactor
PriorityFlags=FAIR_TREE
PriorityDecayHalfLife=14-0
PriorityUsageResetPeriod=NONE
PriorityFavorSmall=NO
PriorityMaxAge=1-0
PriorityWeightAge=10
PriorityWeightFairshare=10000
PriorityWeightJobSize=0
PriorityWeightPartition=10000
PriorityWeightQOS=0
PriorityWeightTRES=cpu=0,mem=0,gres/gpu=0
SelectType=select/cons_res
SelectTypeParameters=CR_Core
TmpFs=/state/partition1
Finally, update compute node attributes, sync the configuration to all nodes, and set a maximum walltime:
rocks report slurm_hwinfo | sh
rocks sync slurm
sacctmgr modify cluster where cluster=cluster set maxwall=96:00:00
Slurm by default forbids logging in to compute nodes unless the user has jobs running on that node. If this behavior is not desired, disable it by:
rocks set host attr attr=slurm_pam_enable value=false
rocks sync slurm
Reservation
You can use reservations to drain the cluster for maintenance.
scontrol create reservation starttime=2018-07-06T09:00:00 duration=600 user=root flags=maint,ignore_jobs nodes=ALL
Configuring Torque compute node settings
Edit /var/spool/torque/server_priv/nodes to include node specifications, such as:
compute-0-0 np=8 ntE5-2609 ps2400 E5-26xx
compute-1-0 np=8 ntE5430 ps2660 E54xx
compute-2-0 np=8 ntE5420 ps2500 E54xx
compute-3-0 np=8 ntE5410 ps2330 E54xx
compute-4-0 np=8 ntE5405 ps2000 E54xx
cluster.dept.univ.edu np=4 ntE5405 ps2000 E54xx
after which restart pbs_server by executing “service pbs_server restart”. In this example, the prefixes “nt” and “ps” (configured in maui.cfg) are used to denote node type and processor speed information.
Making your frontend run queued jobs for PBS (Torque/Maui)
If you have installed the Torque roll, issue the following commands as root on the frontend.
The first line setting $frontend just assures that the name matches that returned by /bin/hostname (which is generally the FQDN). They must match, or pbs_mom will refuse to start/work.
The next two lines set the number of cores to be used for running jobs. You probably should reserve a few cores for all the Rocks overhead processes, and for interactive logins, compiling, etc. In this example, we save 4 cores for the overhead and assign the rest for jobs. This is accomplished by setting the “np = $N” (np means number of processors) value.
export frontend=`/bin/hostname`
export N=`cat /proc/cpuinfo | grep processor | wc -l`
export N=`expr $N - 4` # reserve 4 cores
#
qmgr -c "create node $frontend"
qmgr -c "set node $frontend np = $N"
qmgr -c "set node $frontend ntype=cluster"
service pbs_server restart
Alternatively, you can edit /opt/torque/server_priv/nodes by hand, and do “service pbs_server restart” to make it re-read the file. Next, make sure pbs_mom is started on the frontend:
scp compute-0-0:/etc/pbs.conf /etc
chkconfig --add pbs_mom
service pbs_mom start
If you have no compute nodes, you can create /etc/pbs.conf by hand. It should look like this:
pbs_home=/opt/torque
pbs_exec=/opt/torque
start_mom=1
start_sched=0
start_server=0
You should now be able to see the frontend listed in the output of “pbsnodes -a”, and any jobs submitted to the queue will run there.
Creating additional queues in Torque
Run the following commands as root to create two queues, E5-26xx and E54xx, which include only nodes with the corresponding features, as can be defined in /var/spool/torque/server_priv/nodes (see Configure Torque compute node settings).
qmgr -c "create queue E5-26xx queue_type=execution,started=true,enabled=true,resources_max.walltime=360:00:00,resources_default.walltime=24:00:00,resources_default.neednodes=E5-26xx"
qmgr -c "create queue E54xx queue_type=execution,started=true,enabled=true,resources_max.walltime=360:00:00,resources_default.walltime=24:00:00,resources_default.neednodes=E54xx"
NOTE: Separate queues are not necessary for requesting jobs to be run on certain machines. Similar effect can be accomplished by specifying node features in the submission script, for example:
#PBS -l nodes=1:E5-26xx:ppn=1
Configuring Maui scheduler behavior
Change the settings in /opt/maui/maui.cfg to the following, and add the parameters if not already present. Restart maui to incorporate the changes: service maui restart
# Job Prioritization: http://www.adaptivecomputing.com/resources/docs/maui/5.1jobprioritization.php
QUEUETIMEWEIGHT 1
XFACTORWEIGHT 86400
XFMINWCLIMIT 00:15:00
FSWEIGHT 86400
FSUSERWEIGHT 1
# Fairshare: http://www.adaptivecomputing.com/resources/docs/maui/6.3fairshare.php
FSPOLICY DEDICATEDPS
FSDEPTH 7
FSINTERVAL 1:00:00:00
FSDECAY 0.80
# Backfill: http://www.adaptivecomputing.com/resources/docs/maui/8.2backfill.php
BACKFILLPOLICY BESTFIT
BACKFILLMETRIC PROCSECONDS
RESERVATIONPOLICY CURRENTHIGHEST
# Node Allocation: http://www.adaptivecomputing.com/resources/docs/maui/5.2nodeallocation.php
NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITYF='-LOAD - 5*USAGE'
# Creds: http://www.adaptivecomputing.com/resources/docs/maui/6.1fairnessoverview.php
USERCFG[DEFAULT] FSTARGET=25.0
# Node Set: http://www.adaptivecomputing.com/resources/docs/maui/8.3nodesetoverview.php
NODESETDELAY 0:00:00
NODESETPRIORITYTYPE MINLOSS
NODESETATTRIBUTE FEATURE
NODESETPOLICY ONEOF
NODESETLIST E5-26xx E54xx
NODESETTOLERANCE 0.0
# Node Attributes: http://www.adaptivecomputing.com/resources/docs/maui/12.2nodeattributes.php
FEATURENODETYPEHEADER nt
FEATUREPROCSPEEDHEADER ps$
HTCondor
Basic settings
To implement wall time limit (specify “+WallTime = SECONDS” in the job submission file), default file system behavior, and ignore console activity, create /opt/condor/etc/config.d/98Rocks.conf with the following contents and propagate it using 411:
DefaultWallTime = 12 * $(HOUR)
EXECUTE = /state/partition1/condor_jobs
FILESYSTEM_DOMAIN = cluster.group.dept.univ.edu
MaxWallTime = 96 * $(HOUR)
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOTS_CONNECTED_TO_CONSOLE = 0
SLOTS_CONNECTED_TO_KEYBOARD = 0
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = TRUE
START = ifThenElse(isUndefined(WallTime), $(DefaultWallTime), WallTime) <= $(MaxWallTime)
SYSTEM_PERIODIC_REMOVE = RemoteUserCpu + RemoteSysCpu > CpusProvisioned * ifThenElse(isUndefined(WallTime), $(DefaultWallTime), WallTime) || \
RemoteWallClockTime > ifThenElse(isUndefined(WallTime), $(DefaultWallTime), WallTime)
TRUST_UID_DOMAIN = True
UID_DOMAIN = group.dept.univ.edu
Then create the job directory on all compute nodes:
rocks run host command='mkdir -p /state/partition1/condor_jobs; chmod 755 /state/partition1/condor_jobs'
MPI jobs
Enable MPI:
rocks set attr Condor_EnableMPI true
rocks sync host condor frontend compute
Put the following files (named condor_openmpi.sh and condor_parallel_hosts.sh) in $MPIHOME/bin directory:
#!/bin/bash
##**************************************************************
## This is a script to run openmpi jobs under the Condor parallel universe.
## Collects the host and job information into $_CONDOR_PARALLEL_HOSTS_FILE
## and executes
## $MPIRUN --prefix $MPI_HOME --hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@
## command
## The default value of _CONDOR_PARALLEL_HOSTS_FILE is 'parallel_hosts'
##
## The script assumes:
## On the head node (_CONDOR_PROCNO == 0) :
## * $MPIRUN points to the mpirun command
## * condor_parallel_hosts.sh is in $PATH.
## On all nodes:
## * openmpi is installed into $MPI_HOME directoy
##**************************************************************
#----------------------------
MPIRUN=mpirun
MPI_HOME=$(which $MPIRUN)
MPI_HOME=${MPI_HOME%/bin/$MPIRUN}
_CONDOR_PARALLEL_HOSTS_FILE=parallel_hosts
_CONDOR_TEMP_DIR=/state/partition1
#----------------------------
_CONDOR_LIBEXEC=`condor_config_val libexec`
_CONDOR_PARALLEL_HOSTS=$MPI_HOME/bin/condor_parallel_hosts.sh
_CONDOR_SSH_TO_JOB_WRAPPER=$MPI_HOME/bin/condor_ssh_to_job_wraper.sh
# Creates parallel_hosts file containing contact info for hosts
# Returns on head node only
$_CONDOR_PARALLEL_HOSTS
ret=$?
if [ $ret -ne 0 ]; then
echo Error: $ret creating $_CONDOR_PARALLEL_HOSTS_FILE
exit $ret
fi
# Starting mpirun cmd
#exec $MPIRUN --prefix $MPI_HOME --mca orte_rsh_agent $_CONDOR_SSH_TO_JOB_WRAPPER --hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@
exec $MPIRUN --prefix $MPI_HOME --hostfile $_CONDOR_PARALLEL_HOSTS_FILE -map-by core -bind-to core --tmpdir $_CONDOR_TEMP_DIR $@
rm -f $_CONDOR_PARALLEL_HOSTS_FILE
#!/bin/bash
##**************************************************************
## This script collects host and job information about the running parallel job,
## and creates a hostfile including contact info for remote hosts
##**************************************************************
## Helper fn for getting specific machine attributes from $_CONDOR_MACHINE_AD
function CONDOR_GET_MACHINE_ATTR() {
local attr="$1"
awk '/^'"$attr"'[[:space:]]+=[[:space:]]+/ \
{ ret=sub(/^'"$attr"'[[:space:]]+=[[:space:]]+/,""); print; } \
END { exit 1-ret; }' $_CONDOR_MACHINE_AD
return $?
}
## Helper fn for getting specific job attributes from $_CONDOR_JOB_AD
function CONDOR_GET_JOB_ATTR() {
local attr="$1"
awk '/^'"$attr"'[[:space:]]+=[[:space:]]+/ \
{ ret=sub(/^'"$attr"'[[:space:]]+=[[:space:]]+/,""); print; } \
END { exit 1-ret; }' $_CONDOR_JOB_AD
return $?
}
## Helper fn for printing the host info
function CONDOR_PRINT_HOSTS() {
local clusterid=$1
local procid=$2
local reqcpu=$3
local rhosts=$4
# tr ',"' '\n' <<< $rhosts | /bin/grep -v $hostname | \
tr ',"' '\n' <<< $rhosts | \
awk '{ sub(/slot.*@/,""); if ($1 != "") { slots[$1]+='$reqcpu'; subproc[$1]=id++; } } \
END { for (i in slots) print i" slots="slots[i]" max_slots="slots[i]; }'
#END { for (i in slots) print i"-CONDOR-"'$clusterid'".1."subproc[i]" slots="slots[i]" max_slots="slots[i]; }'
}
# Defaults for error testing
: ${_CONDOR_PROCNO:=0}
: ${_CONDOR_NPROCS:=1}
: ${_CONDOR_MACHINE_AD:="None"}
: ${_CONDOR_JOB_AD:="None"}
##**************************************************************
## Usage: CONDOR_GET_PARALLEL_HOSTS_INFO [hostfile]
## If hostfile omitted 'parallel_hosts' is used.
## Return:
## The function returns with error status on main process (_CONDOR_PROCNO==0).
## The function never returns on on the other nodes (sleeping).
## The created file structure:
## HostName1'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
## HostName2'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
## HostName3'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
## ...
##**************************************************************
#function CONDOR_GET_PARALLEL_HOSTS_INFO() {
# getting parameters if _CONDOR_PARALLEL_HOSTS_FILE not set
: ${_CONDOR_PARALLEL_HOSTS_FILE:=$1}
# setting defaults
: ${_CONDOR_PARALLEL_HOSTS_FILE:=parallel_hosts}
#local hostname=`hostname -f`
if [ $_CONDOR_PROCNO -eq 0 ]; then
# collecting info on the main proc
#clusterid=`CONDOR_GET_JOB_ATTR ClusterId`
#local ret=$?
#if [ $ret -ne 0 ]; then
# echo Error: get_job_attr ClusterId
# return 1
#fi
#local line=""
#condor_q -l $clusterid | \
cat $_CONDOR_JOB_AD | \
awk '/^ProcId.=/ { ProcId=$3 } \
/^ClusterId.=/ { ClusterId=$3 } \
/^RequestCpus.=/ { RequestCpus=$3 } \
/^RemoteHosts.=/ { RemoteHosts=$3 } \
END { if (ClusterId != 0) print ClusterId" "ProcId" "RequestCpus" "RemoteHosts }' | \
while read line; do
CONDOR_PRINT_HOSTS $line
done | sort -d > ${_CONDOR_PARALLEL_HOSTS_FILE}
else
# endless loop on the workers
while true ; do
sleep 30
done
fi
# return 0
#}
To request a parallel job, add the following to the job submission script:
machine_count = NODES
request_cpus = CORES_PER_NODE
universe = parallel
And use condor_openmpi.sh instead of mpirun for parallel execution.
SGE
Enter
qconf -mconf
, and make the following changes:
min_uid 500
min_gid 500
execd_params ENABLE_ADDGRP_KILL=true
auto_user_fshare 1000
auto_user_delete_time 0
Enter
qconf -msconf
, and make the following changes:
job_load_adjustments NONE
load_adjustment_decay_time 0
weight_tickets_share 10000
weight_ticket 10000.0
Enter
qconf -mq all.q
, and make the following changes:
load_thresholds NONE
h_rt 96:00:00
Create a file (say “tmp_share_tree”):
id=0
name=Root
type=0
shares=1
childnodes=1
id=1
name=default
type=0
shares=1000
childnodes=NONE
And use it to create a share tree fair share policy:
qconf -Astree tmp_share_tree && rm tmp_share_tree
.
Kill zombie jobs
SGE sometimes fails to kill all processes of a job. Use the following script to clean up these zombie processes (as well as rogue sessions by users who directly ssh to compute nodes):
#!/bin/bash
launcher_pid=($(gawk 'NR==FNR{shepherd_pid[$0];next} ($1 in shepherd_pid){print $2}' <(pgrep sge_shepherd) <(ps -eo ppid,pid --no-headers)))
# Assume regular users have UIDs >=600
rogue_pid=($(gawk 'NR==FNR{launcher_pid[$0];next} ($1>=600)&&(!($2 in launcher_pid)){print $3}' <(printf "%s\n" "${launcher_pid[@]}") <(ps -eo uid,sid,pid --no-headers)))
# Do not allow any rogue processes if there are >1 jobs running on the
# same node; if a single job has the entire node, then allow the job
# owner to run unmanaged processes, while making sure that zombie
# processes from this user are still killed; if no jobs are running,
# then allow unmanaged processes (e.g., testing)
[ ${#launcher_pid[@]} -eq 0 ] && exit 0
uid=($(ps -p "$(echo ${launcher_pid[@]})" -o uid= | sort | uniq))
if [ ${#uid[@]} -gt 1 ]; then
# echo ${rogue_pid[@]}
kill -9 ${rogue_pid[@]}
elif [ ${#uid[@]} -eq 1 ]; then
stime=$(gawk '{print $22}' /proc/${launcher_pid[0]}/stat)
for (( i=0; i<${#rogue_pid[@]}; i++ )); do
rogue_uid=$(ps -p ${rogue_pid[i]} -o uid=)
if [ -n "$rogue_uid" ] && { [ $rogue_uid -ne $uid ] || [ $(gawk '{print $22}' /proc/${rogue_pid[i]}/stat) -lt $stime ]; }; then
# echo ${rogue_pid[i]}
kill -9 ${rogue_pid[i]}
fi
done
fi
It can be enforced as a system cron job, by adding into extend-compute.xml, between “
<file name="/etc/cron.d/kill-zombie-jobs" perms="0600">
*/15 * * * * root /opt/gridengine/util/kill-zombie-jobs.sh
</file>
Remember to escape ampersand, quotes, and less than characters if you use extend-compute.xml to create this script.
Disabling hyper-threading
Based on some crude benchmarks, Intel Hyper-Threading appears to be detrimental to CPU-intensive work loads. It can be turned off in BIOS via IPMI, but if there are too many nodes and IPMI does not allow scripting the command, an alternative exists by extending compute nodes. First figure out the CPU layout using the lstopo program of hwloc, and add the following between “
<file name="/etc/rc.d/rocksconfig.d/post-89-disable-hyperthreading" perms="0755">
#!/bin/sh
for i in {24..47}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
</file>
Installing Software
After installing a new software package, add an entry, either a single file or a directory named some_software, in the directory /share/apps/modules/modulefiles. In the case of multiple files (representing different software versions) existing in the directory, create file .version to specify the default version.
Using Rocks Rolls
Refer to the Roll Developer’s Guide in the Support and Docs section of Rocks cluster’s web site for how to create your own Rolls.
rocks set host attr localhost roll_install_on_the_fly true shadow=yes # for installing Service Pack Rolls
rocks add roll /path/to/rollname.iso
rocks enable roll rollname
cd /export/rocks/install; rocks create distro
rocks run roll rollname | sh
reboot
After the the frontend comes back up you should do the following to populate the node list:
rocks sync config
then kickstart all your nodes
while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)
If installing Service Pack Rolls, it is critical that you run cluster-kickstart-pxe as it will force the compute nodes to PXE boot. It is important that you PXE boot the nodes for the first install, because with a PXE boot based install, the nodes with get their initrd from the frontend and inside the initrd is a new tracker-client that is compatible with the new tracker-server. After the first install, a new initrd will be on the hard disk of the installed nodes and then it is safe to run /boot/kickstart/cluster-kickstart.
while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart-pxe'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)
Using YUM repositories
Several YUM repositories are configured but disabled by default. Add “—enablerepo=REPO_NAME” to yum commands to temporarily enable REPO_NAME.
yum repolist all #Display all configured software repositories
yum clean all #clean cache
yum [--enablerepo=REPO_NAME] check-update #update package information
yum list openmotif* #list packages
yum install openmotif openmotif-devel #requirement for Grace and NEdit
Adding a software package distributed as RPMs
Create a roll first:
cd /export/rocks/install/contrib/5.4/x86_64/RPMS
wget http://url/to/some_software.rpm
cd /export/rocks/install/site-profiles/5.4/nodes
cp skeleton.xml extend-compute.xml
Edit extend-compute.xml, remove unused “
cd /export/rocks/install; rocks create distro
Now reinstall the compute nodes:
while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart-pxe'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)
Adding a software application distributed as source code
Install it into the /share/apps/some_software directory. A typical process is shown below:
wget http://url/to/some_software.tar.bz2
tar xjf some_software.tar.bz2 -C some_software
cd some_software
./configure --prefix=/share/apps/some_software
make -j 8
sudo make install clean
Uninstalling Software
Removing Rolls
rocks disable roll rollname
rocks remove roll rollname
cd /export/rocks/install; rocks create distro
rocks sync config
while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)
Upgrade
- Create an update roll:
rocks create mirror http://mirror.centos.org/centos/6/updates/x86_64/Packages/ rollname=CentOS_6_X_update_$(date '+%Y%m%d')
rocks create mirror http://mirror.centos.org/centos/6/os/x86_64/Packages/ rollname=Centos_6_X
X should be the current minor release number (i.e., X should be 10 if latest stable version of Centos is 6.10).
Add the created update roll created to the installed distribution
rocks add roll CentOS_6_X_update_$(date '+%Y%m%d')-*.iso
rocks add roll Centos_6_X-*.iso
rocks enable roll Centos_6_X
rocks enable roll CentOS_6_X_update_$(date '+%Y%m%d')
cd /export/rocks/install; rocks create distro
- New installed nodes will automatically get the updated packages. It is wise to test the update on a compute nodes to verify that updates did not break anything. To force a node to reinstall, run the command:
rocks run host compute-0-0 /boot/kickstart/cluster-kickstart
If something goes wrong you can always revert the updates removing the update roll.
rocks remove roll CentOS_6_X_update_$(date '+%Y%m%d')
rocks remove roll Centos_6_X
cd /export/rocks/install; rocks create distro
- After you tested the update on some nodes with the previous step, you can update the frontend using the standard yum command
yum update
Updating zfs-linux
Use the opportunity of the kernel update to rebuild and reinstall zfs-linux by following the steps on Users Guide: Updating the zfs-linux Roll:
cd ~/tools
git clone https://github.com/rocksclusters/zfs-linux.git
make binary-roll
rocks remove roll zfs-linux
rocks add roll zfs-linux*.iso
rocks enable roll zfs-linux
cd /export/rocks/install; rocks create distro
zfs umount -a
service zfs stop
rmmod zfs zcommon znvpair zavl zunicode spl zlib_deflate
rocks run roll zfs-linux | sh
Additional notes for Rocks 6
Apache httpd updates on Rocks 6 break the 411 service that runs on unencrypted HTTP protocol. Fix with the following:
echo 'HttpProtocolOptions Unsafe' >> /etc/httpd/conf/httpd.conf
service httpd restart
Backup
Create a Restore Roll that will contain site-specific info and can be used to upgrade or reconfigure the existing cluster quickly.
cd /export/site-roll/rocks/src/roll/restore
make roll
Adminstration
Adding a user
- /usr/sbin/useradd -u UID USERNAME creates the home directory in /export/home/USERNAME (based on the settings in /etc/default/useradd) with UID as the user ID. If the desired user ID or the group ID has already been used, change them using:
usermod -u NEW_UID EXISTING_USER
# or
groupmod -g NEW_GID EXISTING_GROUP
- rocks sync users adjusts all home directories that are listed as /export/home as follows:
- edit /etc/password, replacing /export/home/ with /home/
- add a line to /etc/auto.home pointing to the existing directory in /export/home
- 411 is updated, to propagate the changes in /etc/passwd and /etc/auto.home
In the default Rocks configuration, /home/ is an automount directory. By default, directories in an automount directory are not present until an attempt is made to access them, at which point they are (usually NFS) mounted. This means you CANNOT create a directory in /home/ manually! The contents of /home/ are under autofs control. To “see” the directory, it’s not enough to do a ls /home as that only accesses the /home directory itself, not its contents. To see the contents, you must ls /home/username.
Implementing disk quota
- Edit /etc/fstab, look for the partitions you want have quota on (“LABEL=” or “UUID=”), and change “defaults” to “grpquota,usrquota,defaults” in that line.
- Reboot, check quota state and turn on quota:
quotacheck -guvma
quotaon -guva
- Set up a prototype user quota:
edquota -u PROTOTYPE_USER # -t DAYS to edit the soft time limits
- Duplicate the quotas of the prototypical user to other users:
edquota -p PROTOTYPE_USER -u user1 user2 ...
- To get a quota summary for a file system:
repquota /export
Exporting a new directory from the frontend to all the compute nodes
- Add the directory you want to export to the file /etc/exports.
For example, if you want to export the directory /export/scratch1, add the following to /etc/exports:
/export/scratch1 10.0.0.0/255.0.0.0(rw)
This exports the directory only to nodes that are on the internal network (in the above example, the internal network is configured to be 10.0.0.0)
- Restart NFS:
/etc/rc.d/init.d/nfs restart
- Add an entry to /etc/auto.home (or /etc/auto.share).
For example, say you want /export/scratch1 on the frontend machine (named frontend-0) to be mounted as /home/scratch1 (or /share/scratch1) on each compute node. Add the following entry to /etc/auto.home (or /etc/auto.share):
scratch1 frontend-0:/export/scratch1
or
scratch1 frontend-0:/export/&
- Inform 411 of the change:
make -C /var/411
Now when you login to any compute node and change your directory to /home/scratch, it will be automounted.
Keeping files up to date on all nodes using the 411 Secure Information Service
Add the files to /var/411/Files.mk, and execute the following:
make -C /var/411
rocks run host command="411get --all" #force all nodes to retrieve the latest files from the frontend immediately
Removing old log files to prevent /var filling up
Place the following in /etc/cron.daily:
#!/bin/sh
rm -f /var/log/*-20??????
rm -f /var/log/slurm/*.log-*
rm -f /var/lib/ganglia/archives/ganglia-rrds.20??-??-??.tar
Cleaning up temporary directories on compute nodes
Add a system cron job between “
<file name="/etc/cron.weekly/clean-scratch" perms="0700">
#!/bin/sh
find /tmp /state/partition1 -mindepth 1 -mtime +7 -type f ! -wholename /state/partition1/condor_jobs -exec rm -f {} \;
find /tmp /state/partition1 -mindepth 1 -depth -mtime +7 -type d ! -wholename /state/partition1/condor_jobs -exec rmdir --ignore-fail-on-non-empty {} \;
</file>
This will be picked up by /etc/anacrontab or /etc/cron.d/0hourly.
Managing firewall
The following rules allow access to the web server from UMN IPs:
rocks remove firewall host=cluster rulename=A40-HTTPS-PUBLIC-LAN
rocks add firewall host=cluster rulename=A40-HTTPS-PUBLIC-LAN service=https protocol=tcp chain=INPUT action=ACCEPT network=public flags='-m state --state NEW --source 128.101.0.0/16,134.84.0.0/16,160.94.0.0/16,131.212.0.0/16,199.17.0.0/16'
rocks remove firewall host=cluster rulename=A40-WWW-PUBLIC-LAN
rocks add firewall host=cluster rulename=A40-WWW-PUBLIC-LAN service=www protocol=tcp chain=INPUT action=ACCEPT network=public flags='-m state --state NEW --source 128.101.0.0/16,134.84.0.0/16,160.94.0.0/16,131.212.0.0/16,199.17.0.0/16'
rocks sync host firewall cluster
These add a few national labs to the allowed IPs for SSH:
rocks remove firewall global rulename=A20-SSH-PUBLIC
rocks add firewall global rulename=A20-SSH-PUBLIC service=ssh protocol=tcp chain=INPUT action=ACCEPT network=public flags='-m state --state NEW --source 128.101.0.0/16,134.84.0.0/16,160.94.0.0/16,131.212.0.0/16,199.17.0.0/16,140.221.69.0/24,130.20.235.0/24,134.9.50.0/24,131.243.2.0/24,128.55.209.0/24,160.91.205.0/24,132.175.108.0/24'
rocks sync host firewall
Alternatively, install DenyHost, which will read the log file for SSH authentication failures and then add the offending IPs to /ect/hosts.deny.
yum --enablerepo=epel install denyhosts
chkconfig denyhosts on
service denyhosts start
vim /etc/denyhosts.conf # configuration file
Changing the public IP address on the frontend
It is strongly recommended that the Fully-Qualified Host Name (e.g., cluster.dept.univ.edu) be chosen carefully and never be modified after the initial setup, because doing so will break several cluster services (e.g., NFS, AutoFS, and Apache). If you want to change the public IP address, you can do so by:
rocks set host interface ip frontend iface=eth1 ip=xxx.xxx.xxx.xxx
rocks set attr Kickstart_PublicAddress xxx.xxx.xxx.xxx
# Edit the IP address in /etc/hosts, /etc/sysconfig/network-scripts/ifcfg-eth1, /etc/yum.repos.d/rocks-local.repo
# It's important to enter the following commands in one line if you are doing this remotely, as the network interface will be stopped by the first command.
ifdown eth1; ifup eth1
rocks sync config
rocks sync host network
Comments
Comments powered by Disqus