Rocks cluster

Management of computer clusters

Intelligent Platform Management Interface (IPMI) is usually available for server machines. It can use the dedicated IPMI Ethernet port or share the first LAN port (so make sure the first port is connected to the internal network switch) for remote monitoring and control.

KVM switch can be used for non-server workstations or older machines.

Please refer to the User Manuals page for details on how to use IPMI or KVM. SuperMicro has a suite of Server Management Utilities to perform health monitoring, power management and firmware maintenance (BIOS and IPMI/BMC firmware upgrade). Rocks also bundles the OpenIPMI console interface.

Installation

Follow the Users Guide in the Support and Docs section of Rocks cluster’s web site.

  • Reserve a certain amount of disk space for compute nodes that will not be overwritten when reinstalling happens. 20G seems enough for the operating system and software. Remember: the gateway should be 128.101.162.54!
  • Update kernel to the latest version. Update to newer versions when they come out.
yum --enablerepo base upgrade kernel
yum --enablerepo base upgrade kernel-devel
yum --enablerepo base upgrade kernel-headers
cp /var/cache/yum/base/packages/kernel*.rpm /export/rocks/install/contrib/6.1.1/x86_64/RPMS/
cd /export/rocks/install; rocks create distro
reboot

Check if you indeed has the desired version, then kickstart the nodes.

uname -r
while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)
  • Create user accounts (see Add a user) before installing anything else so that there is less chance that the desired UID/GIDs conflict with software-generated accounts, and set disk quota (see Implement disk quota) to prevent any user inadvertently generating a huge amount of data from affecting the entire system.
  • Install ZFS on Linux (see Use the ZFS file system)
  • Install the most recent Torque roll
rocks add roll /path/to/torque/roll.iso
rocks enable roll torque
cd /export/rocks/install; rocks create distro
rocks run roll torque | sh
reboot

Configuring Environment Modules package

It is recommended that modulefiles are stored in a directory shared among all nodes. For example, create the directory under /share/apps, and add it to /usr/share/Modules/init/.modulespath:

mkdir /share/apps/modulefiles
echo "/share/apps/modulefiles" >> /usr/share/Modules/init/.modulespath

Finally, make sure the .modulespath file is broadcast to all nodes (see how to keep files up to date on all nodes using the 411 Secure Information Service).

Using the ZFS file system

Due to the licensing of the software, ZFS on Linux is supplied in source code only even if you have already selected the zfs-linux roll when installing Rocks cluster. Please refer to zfs-linux Roll: Users Guide for how to build the binaries.

  • Create a zpool for each additional hard drive that is not used as the system disk, and create a ZFS file system for each active user with compression, NFS sharing, and quota turned on. Compression with ZFS carries very little overhead and because of the reduced file size it sometimes even improves IO.
zpool create space raidz2 /dev/sda /dev/sdb ... raidz2 /dev/sdp /dev/sdq ... raidz2 sdx sdy ... spare sdz ...
zfs set atime=off space
zfs set compression=gzip space

for u in active_user1 active_user2 ...; do
  zfs create space/$u
  zfs set compression=lz4 space/$u
  zfs set sharenfs=on space/$u
  zfs set quota=100G space/$u
  chown -R $u:$u /space/$u
done

To make these file systems available as /share/$USER/spaceX, add the following line to the end of /etc/auto.share

* -fstype=autofs,-Dusername=& file:/etc/auto.zfsfs

And create /etc/auto.zfsfs with the following contents, and propagate it using 411.

* -nfsvers=3 cluster.local:/&/${username}

You need to enable the share points on every boot by adding to /etc/rc.d/rc.local the following line:

zfs share -a

For how to enable them automatically, see ZFS Administration, Part XV- iSCSI, NFS and Samba.

NOTE: Sometimes “zfs share -a” does not populate “/var/lib/nfs/etab” and make /share/$USER/space available on other nodes. A work-around is simply to execute “zfs set sharenfs=on space/SOME_USER” on any user before calling “zfs share -a”.

Automatic backup

ZFS uses copy-on-write and, as a result, snapshots can be created very quickly and cheaply. Create the following script as /etc/cron.daily/zfs-snapshot to keep the last 7 daily, 5 weekly, 12 monthly, and 7 yearly backups.

#!/bin/bash

snapshot() {
  local root=$1
  local prefix=$2
  local keep=$3

  zfs list -t filesystem -o name -H -r "$root" | while read fs; do
    [ "$fs" == "$root" ] && continue

    # echo "zfs snapshot $fs@$prefix-$(date '+%Y%m%d')"
    zfs snapshot "$fs@$prefix-$(date '+%Y%m%d')"

    zfs list -t snapshot -o name -s creation -H -r "$fs" | grep "$prefix" | head -n "-$keep" | while read ss; do
      # echo "zfs destroy $ss"
      zfs destroy $ss
    done
  done
}

snapshot "space" "daily" 7
[ $(date +%w) -eq 0 ] && snapshot "space" "weekly" 5
[ $(date +%-d) -eq 1 ] && snapshot "space" "monthly" 12
[ $(date +%-j) -eq 1 ] && snapshot "space" "yearly" 7

Periodic error checking

Hard drives can have silent data corruption. ZFS can detect and correct these errors on a live system. Create the following script as /etc/cron.monthly/zfs-scrub (or in /etc/cron.weekly if using cheap commodity disks):

#!/bin/sh

zpool scrub space

Slurm

Add new queues to /etc/slurm/partitions:

PartitionName=E5_2650v4 DEFAULT=YES STATE=UP TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0" DefaultTime=60 DefMemPerCPU=512 nodes=compute-0-[0-139]
PartitionName=4170HE DEFAULT=YES STATE=UP TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0" DefaultTime=60 DefMemPerCPU=512 nodes=compute-2-[0-31]

And make the following changes in /etc/slurm/slurm.conf:

AccountingStorageTRES=gres/gpu
AccountingStorageEnforce=all
FairShareDampeningFactor=5
GresTypes=gpu
PriorityType=priority/multifactor
PriorityFlags=FAIR_TREE
PriorityDecayHalfLife=14-0
PriorityUsageResetPeriod=NONE
PriorityFavorSmall=NO
PriorityMaxAge=1-0
PriorityWeightAge=10
PriorityWeightFairshare=10000
PriorityWeightJobSize=0
PriorityWeightPartition=10000
PriorityWeightQOS=0
PriorityWeightTRES=cpu=0,mem=0,gres/gpu=0

SelectType=select/cons_res
SelectTypeParameters=CR_Core
TmpFs=/state/partition1

Finally, update compute node attributes, sync the configuration to all nodes, and set a maximum walltime:

rocks report slurm_hwinfo | sh
rocks sync slurm
sacctmgr modify cluster where cluster=cluster set maxwall=96:00:00

Slurm by default forbids logging in to compute nodes unless the user has jobs running on that node. If this behavior is not desired, disable it by:

rocks set host attr attr=slurm_pam_enable value=false
rocks sync slurm

Reservation

You can use reservations to drain the cluster for maintenance.

scontrol create reservation starttime=2018-07-06T09:00:00 duration=600 user=root flags=maint,ignore_jobs nodes=ALL

Configuring Torque compute node settings

Edit /var/spool/torque/server_priv/nodes to include node specifications, such as:

compute-0-0 np=8  ntE5-2609 ps2400 E5-26xx
compute-1-0 np=8  ntE5430   ps2660 E54xx
compute-2-0 np=8  ntE5420   ps2500 E54xx
compute-3-0 np=8  ntE5410   ps2330 E54xx
compute-4-0 np=8  ntE5405   ps2000 E54xx
cluster.dept.univ.edu np=4 ntE5405 ps2000 E54xx

after which restart pbs_server by executing “service pbs_server restart”. In this example, the prefixes “nt” and “ps” (configured in maui.cfg) are used to denote node type and processor speed information.

Making your frontend run queued jobs for PBS (Torque/Maui)

If you have installed the Torque roll, issue the following commands as root on the frontend.

The first line setting $frontend just assures that the name matches that returned by /bin/hostname (which is generally the FQDN). They must match, or pbs_mom will refuse to start/work.

The next two lines set the number of cores to be used for running jobs. You probably should reserve a few cores for all the Rocks overhead processes, and for interactive logins, compiling, etc. In this example, we save 4 cores for the overhead and assign the rest for jobs. This is accomplished by setting the “np = $N” (np means number of processors) value.

export frontend=`/bin/hostname`
export N=`cat /proc/cpuinfo | grep processor | wc -l`
export N=`expr $N - 4` # reserve 4 cores
#
qmgr -c "create node $frontend"
qmgr -c "set node $frontend np = $N"
qmgr -c "set node $frontend ntype=cluster"
service pbs_server restart

Alternatively, you can edit /opt/torque/server_priv/nodes by hand, and do “service pbs_server restart” to make it re-read the file. Next, make sure pbs_mom is started on the frontend:

scp compute-0-0:/etc/pbs.conf /etc
chkconfig --add pbs_mom
service pbs_mom start

If you have no compute nodes, you can create /etc/pbs.conf by hand. It should look like this:

pbs_home=/opt/torque
pbs_exec=/opt/torque
start_mom=1
start_sched=0
start_server=0

You should now be able to see the frontend listed in the output of “pbsnodes -a”, and any jobs submitted to the queue will run there.

Creating additional queues in Torque

Run the following commands as root to create two queues, E5-26xx and E54xx, which include only nodes with the corresponding features, as can be defined in /var/spool/torque/server_priv/nodes (see Configure Torque compute node settings).

qmgr -c "create queue E5-26xx queue_type=execution,started=true,enabled=true,resources_max.walltime=360:00:00,resources_default.walltime=24:00:00,resources_default.neednodes=E5-26xx"
qmgr -c "create queue E54xx queue_type=execution,started=true,enabled=true,resources_max.walltime=360:00:00,resources_default.walltime=24:00:00,resources_default.neednodes=E54xx"

NOTE: Separate queues are not necessary for requesting jobs to be run on certain machines. Similar effect can be accomplished by specifying node features in the submission script, for example:

#PBS -l nodes=1:E5-26xx:ppn=1

Configuring Maui scheduler behavior

Change the settings in /opt/maui/maui.cfg to the following, and add the parameters if not already present. Restart maui to incorporate the changes: service maui restart

# Job Prioritization: http://www.adaptivecomputing.com/resources/docs/maui/5.1jobprioritization.php

QUEUETIMEWEIGHT       1
XFACTORWEIGHT         86400
XFMINWCLIMIT          00:15:00
FSWEIGHT              86400
FSUSERWEIGHT          1

# Fairshare: http://www.adaptivecomputing.com/resources/docs/maui/6.3fairshare.php

FSPOLICY              DEDICATEDPS
FSDEPTH               7
FSINTERVAL            1:00:00:00
FSDECAY               0.80

# Backfill: http://www.adaptivecomputing.com/resources/docs/maui/8.2backfill.php

BACKFILLPOLICY        BESTFIT
BACKFILLMETRIC        PROCSECONDS
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: http://www.adaptivecomputing.com/resources/docs/maui/5.2nodeallocation.php

NODEALLOCATIONPOLICY  PRIORITY
NODECFG[DEFAULT]      PRIORITYF='-LOAD - 5*USAGE'

# Creds: http://www.adaptivecomputing.com/resources/docs/maui/6.1fairnessoverview.php

USERCFG[DEFAULT]      FSTARGET=25.0

# Node Set: http://www.adaptivecomputing.com/resources/docs/maui/8.3nodesetoverview.php

NODESETDELAY          0:00:00
NODESETPRIORITYTYPE   MINLOSS
NODESETATTRIBUTE      FEATURE
NODESETPOLICY         ONEOF
NODESETLIST           E5-26xx E54xx
NODESETTOLERANCE      0.0

# Node Attributes: http://www.adaptivecomputing.com/resources/docs/maui/12.2nodeattributes.php

FEATURENODETYPEHEADER nt
FEATUREPROCSPEEDHEADER ps$

HTCondor

Basic settings

To implement wall time limit (specify “+WallTime = SECONDS” in the job submission file), default file system behavior, and ignore console activity, create /opt/condor/etc/config.d/98Rocks.conf with the following contents and propagate it using 411:

DefaultWallTime = 12 * $(HOUR)
EXECUTE = /state/partition1/condor_jobs
FILESYSTEM_DOMAIN = cluster.group.dept.univ.edu
MaxWallTime = 96 * $(HOUR)
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOTS_CONNECTED_TO_CONSOLE = 0
SLOTS_CONNECTED_TO_KEYBOARD = 0
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = TRUE
START = ifThenElse(isUndefined(WallTime), $(DefaultWallTime), WallTime) <= $(MaxWallTime)
SYSTEM_PERIODIC_REMOVE = RemoteUserCpu + RemoteSysCpu > CpusProvisioned * ifThenElse(isUndefined(WallTime), $(DefaultWallTime), WallTime) || \
                         RemoteWallClockTime > ifThenElse(isUndefined(WallTime), $(DefaultWallTime), WallTime)
TRUST_UID_DOMAIN = True
UID_DOMAIN = group.dept.univ.edu

Then create the job directory on all compute nodes:

rocks run host command='mkdir -p /state/partition1/condor_jobs; chmod 755 /state/partition1/condor_jobs'

MPI jobs

Enable MPI:

rocks set attr Condor_EnableMPI true
rocks sync host condor frontend compute

Put the following files (named condor_openmpi.sh and condor_parallel_hosts.sh) in $MPIHOME/bin directory:

#!/bin/bash

##**************************************************************
## This is a script to run openmpi jobs under the Condor parallel universe.
## Collects the host and job information into $_CONDOR_PARALLEL_HOSTS_FILE
## and executes
##   $MPIRUN --prefix $MPI_HOME --hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@
## command
## The default value of _CONDOR_PARALLEL_HOSTS_FILE is 'parallel_hosts'
##
## The script assumes:
##  On the head node (_CONDOR_PROCNO == 0) :
##    * $MPIRUN points to the mpirun command
##    * condor_parallel_hosts.sh is in $PATH.
##  On all nodes:
##    * openmpi is installed into $MPI_HOME directoy
##**************************************************************

#----------------------------
MPIRUN=mpirun
MPI_HOME=$(which $MPIRUN)
MPI_HOME=${MPI_HOME%/bin/$MPIRUN}
_CONDOR_PARALLEL_HOSTS_FILE=parallel_hosts
_CONDOR_TEMP_DIR=/state/partition1
#----------------------------

_CONDOR_LIBEXEC=`condor_config_val libexec`
_CONDOR_PARALLEL_HOSTS=$MPI_HOME/bin/condor_parallel_hosts.sh
_CONDOR_SSH_TO_JOB_WRAPPER=$MPI_HOME/bin/condor_ssh_to_job_wraper.sh

# Creates parallel_hosts file containing contact info for hosts
# Returns on head node only
$_CONDOR_PARALLEL_HOSTS
ret=$?
if [ $ret -ne 0 ]; then
    echo Error: $ret creating $_CONDOR_PARALLEL_HOSTS_FILE
    exit $ret
fi

# Starting mpirun cmd
#exec $MPIRUN --prefix $MPI_HOME --mca orte_rsh_agent $_CONDOR_SSH_TO_JOB_WRAPPER --hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@
exec $MPIRUN --prefix $MPI_HOME --hostfile $_CONDOR_PARALLEL_HOSTS_FILE -map-by core -bind-to core --tmpdir $_CONDOR_TEMP_DIR $@

rm -f $_CONDOR_PARALLEL_HOSTS_FILE
#!/bin/bash

##**************************************************************
## This script collects host and job information about the running parallel job,
## and creates a hostfile including contact info for remote hosts
##**************************************************************

## Helper fn for getting specific machine attributes from $_CONDOR_MACHINE_AD
function CONDOR_GET_MACHINE_ATTR() {
    local attr="$1"
    awk '/^'"$attr"'[[:space:]]+=[[:space:]]+/ \
        { ret=sub(/^'"$attr"'[[:space:]]+=[[:space:]]+/,""); print; } \
        END { exit 1-ret; }' $_CONDOR_MACHINE_AD
    return $?
}

## Helper fn for getting specific job attributes from $_CONDOR_JOB_AD
function CONDOR_GET_JOB_ATTR() {
    local attr="$1"
    awk '/^'"$attr"'[[:space:]]+=[[:space:]]+/ \
        { ret=sub(/^'"$attr"'[[:space:]]+=[[:space:]]+/,""); print; } \
        END { exit 1-ret; }' $_CONDOR_JOB_AD
    return $?
}

## Helper fn for printing the host info
function CONDOR_PRINT_HOSTS() {
    local clusterid=$1
    local procid=$2
    local reqcpu=$3
    local rhosts=$4
    # tr ',"' '\n' <<< $rhosts | /bin/grep -v $hostname | \
    tr ',"' '\n' <<< $rhosts | \
    awk '{ sub(/slot.*@/,""); if ($1 != "") { slots[$1]+='$reqcpu'; subproc[$1]=id++; } } \
        END { for (i in slots) print i" slots="slots[i]" max_slots="slots[i]; }'
        #END { for (i in slots) print i"-CONDOR-"'$clusterid'".1."subproc[i]" slots="slots[i]" max_slots="slots[i]; }'
}

# Defaults for error testing
: ${_CONDOR_PROCNO:=0}
: ${_CONDOR_NPROCS:=1}
: ${_CONDOR_MACHINE_AD:="None"}
: ${_CONDOR_JOB_AD:="None"}

##**************************************************************
## Usage: CONDOR_GET_PARALLEL_HOSTS_INFO [hostfile]
## If hostfile omitted 'parallel_hosts' is used.
## Return:
##   The function returns with error status on main process (_CONDOR_PROCNO==0).
##   The function never returns on on the other nodes (sleeping).
## The created file structure:
##   HostName1'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
##   HostName2'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
##   HostName3'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
##   ...
##**************************************************************
#function CONDOR_GET_PARALLEL_HOSTS_INFO() {
    # getting parameters if _CONDOR_PARALLEL_HOSTS_FILE not set
    : ${_CONDOR_PARALLEL_HOSTS_FILE:=$1}
    # setting defaults
    : ${_CONDOR_PARALLEL_HOSTS_FILE:=parallel_hosts}
    #local hostname=`hostname -f`
    if [ $_CONDOR_PROCNO -eq 0 ]; then
    # collecting info on the main proc
        #clusterid=`CONDOR_GET_JOB_ATTR ClusterId`
        #local ret=$?
        #if [ $ret -ne 0 ]; then
        #    echo Error: get_job_attr ClusterId
        #    return 1
        #fi
        #local line=""
        #condor_q -l $clusterid | \
        cat $_CONDOR_JOB_AD | \
        awk '/^ProcId.=/ { ProcId=$3 } \
             /^ClusterId.=/ { ClusterId=$3 } \
             /^RequestCpus.=/ { RequestCpus=$3 } \
             /^RemoteHosts.=/ { RemoteHosts=$3 } \
             END { if (ClusterId != 0) print ClusterId" "ProcId" "RequestCpus" "RemoteHosts  }' | \
        while read line; do
            CONDOR_PRINT_HOSTS $line
        done | sort -d > ${_CONDOR_PARALLEL_HOSTS_FILE}
    else
    # endless loop on the workers
        while true ; do
            sleep 30
        done
    fi
#    return 0
#}

To request a parallel job, add the following to the job submission script:

machine_count = NODES
request_cpus = CORES_PER_NODE
universe = parallel

And use condor_openmpi.sh instead of mpirun for parallel execution.

SGE

Enter

qconf -mconf

, and make the following changes:

min_uid                      500
min_gid                      500
execd_params                 ENABLE_ADDGRP_KILL=true
auto_user_fshare             1000
auto_user_delete_time        0

Enter

qconf -msconf

, and make the following changes:

job_load_adjustments              NONE
load_adjustment_decay_time        0
weight_tickets_share              10000
weight_ticket                     10000.0

Enter

qconf -mq all.q

, and make the following changes:

load_thresholds       NONE
h_rt                  96:00:00

Create a file (say “tmp_share_tree”):

id=0
name=Root
type=0
shares=1
childnodes=1
id=1
name=default
type=0
shares=1000
childnodes=NONE

And use it to create a share tree fair share policy:

qconf -Astree tmp_share_tree && rm tmp_share_tree

.

Kill zombie jobs

SGE sometimes fails to kill all processes of a job. Use the following script to clean up these zombie processes (as well as rogue sessions by users who directly ssh to compute nodes):

#!/bin/bash

launcher_pid=($(gawk 'NR==FNR{shepherd_pid[$0];next} ($1 in shepherd_pid){print $2}' <(pgrep sge_shepherd) <(ps -eo ppid,pid --no-headers)))
# Assume regular users have UIDs >=600
rogue_pid=($(gawk 'NR==FNR{launcher_pid[$0];next} ($1>=600)&&(!($2 in launcher_pid)){print $3}' <(printf "%s\n" "${launcher_pid[@]}") <(ps -eo uid,sid,pid --no-headers)))

# Do not allow any rogue processes if there are >1 jobs running on the
# same node; if a single job has the entire node, then allow the job
# owner to run unmanaged processes, while making sure that zombie
# processes from this user are still killed; if no jobs are running,
# then allow unmanaged processes (e.g., testing)
[ ${#launcher_pid[@]} -eq 0 ] && exit 0
uid=($(ps -p "$(echo ${launcher_pid[@]})" -o uid= | sort | uniq))
if [ ${#uid[@]} -gt 1 ]; then
  # echo ${rogue_pid[@]}
  kill -9 ${rogue_pid[@]}
elif [ ${#uid[@]} -eq 1 ]; then
  stime=$(gawk '{print $22}' /proc/${launcher_pid[0]}/stat)
  for (( i=0; i<${#rogue_pid[@]}; i++ )); do
    rogue_uid=$(ps -p ${rogue_pid[i]} -o uid=)
    if [ -n "$rogue_uid" ] && { [ $rogue_uid -ne $uid ] || [ $(gawk '{print $22}' /proc/${rogue_pid[i]}/stat) -lt $stime ]; }; then
      # echo ${rogue_pid[i]}
      kill -9 ${rogue_pid[i]}
    fi
  done
fi

It can be enforced as a system cron job, by adding into extend-compute.xml, between “” and “”:

<file name="/etc/cron.d/kill-zombie-jobs" perms="0600">
*/15 * * * * root /opt/gridengine/util/kill-zombie-jobs.sh
</file>

Remember to escape ampersand, quotes, and less than characters if you use extend-compute.xml to create this script.

Disabling hyper-threading

Based on some crude benchmarks, Intel Hyper-Threading appears to be detrimental to CPU-intensive work loads. It can be turned off in BIOS via IPMI, but if there are too many nodes and IPMI does not allow scripting the command, an alternative exists by extending compute nodes. First figure out the CPU layout using the lstopo program of hwloc, and add the following between “” and “” in extend-compute.xml (assuming 24—47 are virtual cores):

<file name="/etc/rc.d/rocksconfig.d/post-89-disable-hyperthreading" perms="0755">
#!/bin/sh
for i in {24..47}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
</file>

Installing Software

After installing a new software package, add an entry, either a single file or a directory named some_software, in the directory /share/apps/modules/modulefiles. In the case of multiple files (representing different software versions) existing in the directory, create file .version to specify the default version.

Using Rocks Rolls

Refer to the Roll Developer’s Guide in the Support and Docs section of Rocks cluster’s web site for how to create your own Rolls.

rocks set host attr localhost roll_install_on_the_fly true shadow=yes # for installing Service Pack Rolls
rocks add roll /path/to/rollname.iso
rocks enable roll rollname
cd /export/rocks/install; rocks create distro
rocks run roll rollname | sh
reboot

After the the frontend comes back up you should do the following to populate the node list:

rocks sync config

then kickstart all your nodes

while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)

If installing Service Pack Rolls, it is critical that you run cluster-kickstart-pxe as it will force the compute nodes to PXE boot. It is important that you PXE boot the nodes for the first install, because with a PXE boot based install, the nodes with get their initrd from the frontend and inside the initrd is a new tracker-client that is compatible with the new tracker-server. After the first install, a new initrd will be on the hard disk of the installed nodes and then it is safe to run /boot/kickstart/cluster-kickstart.

while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart-pxe'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)

Using YUM repositories

Several YUM repositories are configured but disabled by default. Add “—enablerepo=REPO_NAME” to yum commands to temporarily enable REPO_NAME.

yum repolist all #Display all configured software repositories
yum clean all #clean cache
yum [--enablerepo=REPO_NAME] check-update #update package information
yum list openmotif* #list packages
yum install openmotif openmotif-devel #requirement for Grace and NEdit

Adding a software package distributed as RPMs

Create a roll first:

cd /export/rocks/install/contrib/5.4/x86_64/RPMS
wget http://url/to/some_software.rpm
cd /export/rocks/install/site-profiles/5.4/nodes
cp skeleton.xml extend-compute.xml

Edit extend-compute.xml, remove unused “” lines

cd /export/rocks/install; rocks create distro

Now reinstall the compute nodes:

while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart-pxe'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)

Adding a software application distributed as source code

Install it into the /share/apps/some_software directory. A typical process is shown below:

wget http://url/to/some_software.tar.bz2
tar xjf some_software.tar.bz2 -C some_software
cd some_software
./configure --prefix=/share/apps/some_software
make -j 8
sudo make install clean

Uninstalling Software

Removing Rolls

rocks disable roll rollname
rocks remove roll rollname
cd /export/rocks/install; rocks create distro
rocks sync config
while read cn; do rocks run host $cn '/boot/kickstart/cluster-kickstart'; sleep 3; done < <(rocks list host compute|cut -d ':' -f 1)

Upgrade

  • Create an update roll:
rocks create mirror http://mirror.centos.org/centos/6/updates/x86_64/Packages/ rollname=CentOS_6_X_update_$(date '+%Y%m%d')
rocks create mirror http://mirror.centos.org/centos/6/os/x86_64/Packages/  rollname=Centos_6_X

X should be the current minor release number (i.e., X should be 10 if latest stable version of Centos is 6.10).

Add the created update roll created to the installed distribution

rocks add roll CentOS_6_X_update_$(date '+%Y%m%d')-*.iso
rocks add roll Centos_6_X-*.iso
rocks enable roll Centos_6_X
rocks enable roll CentOS_6_X_update_$(date '+%Y%m%d')
cd /export/rocks/install; rocks create distro
  • New installed nodes will automatically get the updated packages. It is wise to test the update on a compute nodes to verify that updates did not break anything. To force a node to reinstall, run the command:
rocks run host compute-0-0 /boot/kickstart/cluster-kickstart

If something goes wrong you can always revert the updates removing the update roll.

rocks remove roll CentOS_6_X_update_$(date '+%Y%m%d')
rocks remove roll Centos_6_X
cd /export/rocks/install; rocks create distro
  • After you tested the update on some nodes with the previous step, you can update the frontend using the standard yum command
yum update

Updating zfs-linux

Use the opportunity of the kernel update to rebuild and reinstall zfs-linux by following the steps on Users Guide: Updating the zfs-linux Roll:

cd ~/tools
git clone https://github.com/rocksclusters/zfs-linux.git
make binary-roll

rocks remove roll zfs-linux
rocks add roll zfs-linux*.iso
rocks enable roll zfs-linux
cd /export/rocks/install; rocks create distro

zfs umount -a
service zfs stop
rmmod zfs zcommon znvpair zavl zunicode spl zlib_deflate

rocks run roll zfs-linux | sh

Additional notes for Rocks 6

Apache httpd updates on Rocks 6 break the 411 service that runs on unencrypted HTTP protocol. Fix with the following:

echo 'HttpProtocolOptions Unsafe' >> /etc/httpd/conf/httpd.conf
service httpd restart

Backup

Create a Restore Roll that will contain site-specific info and can be used to upgrade or reconfigure the existing cluster quickly.

cd /export/site-roll/rocks/src/roll/restore
make roll

Adminstration

Adding a user

  • /usr/sbin/useradd -u UID USERNAME creates the home directory in /export/home/USERNAME (based on the settings in /etc/default/useradd) with UID as the user ID. If the desired user ID or the group ID has already been used, change them using:
usermod -u NEW_UID EXISTING_USER
# or
groupmod -g NEW_GID EXISTING_GROUP
  • rocks sync users adjusts all home directories that are listed as /export/home as follows:
  1. edit /etc/password, replacing /export/home/ with /home/
  2. add a line to /etc/auto.home pointing to the existing directory in /export/home
  3. 411 is updated, to propagate the changes in /etc/passwd and /etc/auto.home

In the default Rocks configuration, /home/ is an automount directory. By default, directories in an automount directory are not present until an attempt is made to access them, at which point they are (usually NFS) mounted. This means you CANNOT create a directory in /home/ manually! The contents of /home/ are under autofs control. To “see” the directory, it’s not enough to do a ls /home as that only accesses the /home directory itself, not its contents. To see the contents, you must ls /home/username.

Implementing disk quota

  • Edit /etc/fstab, look for the partitions you want have quota on (“LABEL=” or “UUID=”), and change “defaults” to “grpquota,usrquota,defaults” in that line.
  • Reboot, check quota state and turn on quota:
quotacheck -guvma
quotaon -guva
  • Set up a prototype user quota:
edquota -u PROTOTYPE_USER # -t DAYS to edit the soft time limits
  • Duplicate the quotas of the prototypical user to other users:
edquota -p PROTOTYPE_USER -u user1 user2 ...
  • To get a quota summary for a file system:
repquota /export

Exporting a new directory from the frontend to all the compute nodes

  • Add the directory you want to export to the file /etc/exports.

For example, if you want to export the directory /export/scratch1, add the following to /etc/exports:

/export/scratch1 10.0.0.0/255.0.0.0(rw)

This exports the directory only to nodes that are on the internal network (in the above example, the internal network is configured to be 10.0.0.0)

  • Restart NFS:
/etc/rc.d/init.d/nfs restart
  • Add an entry to /etc/auto.home (or /etc/auto.share).

For example, say you want /export/scratch1 on the frontend machine (named frontend-0) to be mounted as /home/scratch1 (or /share/scratch1) on each compute node. Add the following entry to /etc/auto.home (or /etc/auto.share):

scratch1 frontend-0:/export/scratch1

or

scratch1 frontend-0:/export/&
  • Inform 411 of the change:
make -C /var/411

Now when you login to any compute node and change your directory to /home/scratch, it will be automounted.

Keeping files up to date on all nodes using the 411 Secure Information Service

Add the files to /var/411/Files.mk, and execute the following:

make -C /var/411
rocks run host command="411get --all" #force all nodes to retrieve the latest files from the frontend immediately

Removing old log files to prevent /var filling up

Place the following in /etc/cron.daily:

#!/bin/sh

rm -f /var/log/*-20??????
rm -f /var/log/slurm/*.log-*
rm -f /var/lib/ganglia/archives/ganglia-rrds.20??-??-??.tar

Cleaning up temporary directories on compute nodes

Add a system cron job between “” and “” in extend-compute.xml:

<file name="/etc/cron.weekly/clean-scratch" perms="0700">
#!/bin/sh
find /tmp /state/partition1 -mindepth 1 -mtime +7 -type f ! -wholename /state/partition1/condor_jobs -exec rm -f {} \;
find /tmp /state/partition1 -mindepth 1 -depth -mtime +7 -type d ! -wholename /state/partition1/condor_jobs -exec rmdir --ignore-fail-on-non-empty {} \;
</file>

This will be picked up by /etc/anacrontab or /etc/cron.d/0hourly.

Managing firewall

The following rules allow access to the web server from UMN IPs:

rocks remove firewall host=cluster rulename=A40-HTTPS-PUBLIC-LAN
rocks add firewall host=cluster rulename=A40-HTTPS-PUBLIC-LAN service=https protocol=tcp chain=INPUT action=ACCEPT network=public flags='-m state --state NEW --source 128.101.0.0/16,134.84.0.0/16,160.94.0.0/16,131.212.0.0/16,199.17.0.0/16'
rocks remove firewall host=cluster rulename=A40-WWW-PUBLIC-LAN
rocks add firewall host=cluster rulename=A40-WWW-PUBLIC-LAN service=www protocol=tcp chain=INPUT action=ACCEPT network=public flags='-m state --state NEW --source 128.101.0.0/16,134.84.0.0/16,160.94.0.0/16,131.212.0.0/16,199.17.0.0/16'
rocks sync host firewall cluster

These add a few national labs to the allowed IPs for SSH:

rocks remove firewall global rulename=A20-SSH-PUBLIC
rocks add firewall global rulename=A20-SSH-PUBLIC service=ssh protocol=tcp chain=INPUT action=ACCEPT network=public flags='-m state --state NEW --source 128.101.0.0/16,134.84.0.0/16,160.94.0.0/16,131.212.0.0/16,199.17.0.0/16,140.221.69.0/24,130.20.235.0/24,134.9.50.0/24,131.243.2.0/24,128.55.209.0/24,160.91.205.0/24,132.175.108.0/24'
rocks sync host firewall

Alternatively, install DenyHost, which will read the log file for SSH authentication failures and then add the offending IPs to /ect/hosts.deny.

yum --enablerepo=epel install denyhosts
chkconfig denyhosts on
service denyhosts start
vim /etc/denyhosts.conf # configuration file

Changing the public IP address on the frontend

It is strongly recommended that the Fully-Qualified Host Name (e.g., cluster.dept.univ.edu) be chosen carefully and never be modified after the initial setup, because doing so will break several cluster services (e.g., NFS, AutoFS, and Apache). If you want to change the public IP address, you can do so by:

rocks set host interface ip frontend iface=eth1 ip=xxx.xxx.xxx.xxx
rocks set attr Kickstart_PublicAddress xxx.xxx.xxx.xxx
# Edit the IP address in /etc/hosts, /etc/sysconfig/network-scripts/ifcfg-eth1, /etc/yum.repos.d/rocks-local.repo
# It's important to enter the following commands in one line if you are doing this remotely, as the network interface will be stopped by the first command.
ifdown eth1; ifup eth1
rocks sync config
rocks sync host network

Comments

Comments powered by Disqus