High Performance Computing

Getting Started with henry2 Linux Cluster

Page Contents:


Henry2 System Configuration

There are 1149 dual Xeon compute nodes in the henry2 cluster. Each node has two Xeon processors (mix of single-, dual-, quad-, six-, and eight-core) and 2 to 4 GigaBytes of memory per core.

The nodes all have 64-bit processors. Generally, either 32-bit or 64-bit x86 executables will run correctly. 64-bit executables are required in order to access more than about 3GB of memory for program data.

The compute nodes are managed by the LSF resource manager and are not for access except through LSF (accounts directly accessing compute nodes are subject to immediate termination).

Logins for the cluster are handled by a set of login nodes which can be accessed as login.hpc.ncsu.edu using ssh.

Additional information on the initial henry2 configuration (c. 2004) is available in http://hpc.ncsu.edu/Documents/hpc_cluster_config.pdf.
Some additional informaion about the cluster architecture is available at http://hpc.ncsu.edu/Hardware/henry2_architecture.php.

Logging onto henry2 cluster

SSH access is supported to the login nodes (login.hpc.ncsu.edu. Logins are authenticated using Unity user names and passwords. Microsoft Windows users can obtain ssh clients from ITECS remote access page. Also, Windows X11 server for Microsoft Windows is available from the same ITECS site.

Login nodes should not be used for interactive jobs that take any significant amount of system resources. The usual way to run CPU intensive codes is to submit them as batch jobs to LSF, which schedules them for execution on computational nodes. Example LSF job submission files can be found in Intel Compilers.

It is sometimes necessary to use interactive GUI based serial pre and post processors for data resident in the HPC environment. Interactive computing in the HPC environment should be performed by requesting a Virtual Computing Lab (VCL) HPC environment. To use the VCL HPC environment go to the web page http://vcl.ncsu.edu and click on "Make a VCL Reservation". If you have not already authenticated with your Unity ID and password you will be prompted to do so.

From the list of environments, select "HPC (64-bit RedHat Linux)".

When the environment is ready VCL will provide information regarding how to log in. VCL provides a dedicated environment, so heavy interactive use will not interfer with other users. If you have an HPC account, but have problems accessing the VCL HPC environment, send e-mail to oit_hpc@help.ncsu.edu.

File Systems

AFS files are not available from the cluster (but are available on the VCL HPC environments described above).

Users have a home directory that is shared by all the cluster nodes. Also, the /usr/local file system is shared by all nodes. Home file system is backed up daily, with one copy of each file retained.

Three NFS mounted shared scratch file systems /share, share2, and /share3 are also available to all users. These file systems are not backed up and files may be deleted from the file systems automatically at any time, use of these file systems is at the users own risk. There is a 1TB group quota on each of these file systems.

A parallel file system /gpfs_share is also available. Directories on /gpfs_share can be requested. There is a 1TB group quota imposed on /gpfs_share. /gpfs_share file system is not backed up and files are subject to being deleted at any time. Use is at the users own risk.

Finally, from the login nodes the HPC mass storage file systems, /ncsu/volume1 and /ncsu/volume2, are available for storage in excess of what can be accomodated in /home. Since these file system are not available from the compute nodes they cannot be used for running jobs.

User files in /home, /ncsu/volume1, and /ncsu/volume2 are backed up daily. A single backup version is maintained for each file. User files in all other file systems are not backed up.

Important files should never be placed on storage that is not backed up unless another copy of the file exists in another location.

HPC projects are allocated 1TB of storage in one of the HPC mass storage systems (volume1 or volume2). Additional backed up space in these file systems can be purchased or leased.

Also a Storage Parter Program provides option for faculty partners to purchase additional storage and have it network attached to the henry2 cluster either using NFS or GPFS.

Additional information about storage on HPC resources is available from http://hpc.ncsu.edu/Documents/GettingStartedstorage.php

Software

Many software packages have already been compiled to run on the cluster. If you click on Software in the left toolbar or on http://hpc.ncsu.edu/Software/Software.php , you'll see a list of software. In many cases, there are "HowTos" which explain how to get access and submit example jobs. Suggestions on documentation updates and on additional software are encouraged.

Compiling

There are three compiler flavors available on the cluster: 1) the standard GNU compilers supplied with Linux, 2) the Intel compilers, and 3) the Portland Group compilers.

The default GNU compilers are okay for compiling utility programs but in most cases are not appropriate for computationally intensive applications.

Overall the best performance has been observed using the Intel compilers. However, the Intel compilers support very few extensions of the Fortran standard - so codes written using non-standard Fortran may fail to compile without modifications.

The Portland Group compilers tend to be somewhat less syntacticly strict than the Intel compilers while still generating more efficient code than the Gnu compilers.

Additional information about use of each of these compilers is available from the following links. Generally objects and libraries built with different compiler flavors should not be mixed as unexpected behavior may result.

Programs with memory requirements of more than ~1GB should review the following information.
A note on compiling executables with large (> ~1 GB) memory requirements

Running Jobs

The cluster is designed to run computationally intensive jobs on compute nodes. Running resource intensive jobs on the login nodes, while technically possible, is not permitted.

Please limit your use of the login nodes to editing and compiling, and transferring files. Running more than one concurrent file transfer program (scp, sftp, cp) from login nodes is also not desirable.

Running Serial Jobs

To run computationally intensive jobs on the cluster use the compute nodes. Access to the compute nodes is managed by LSF. All tasks for the compute nodes should be submitted to LSF.

The following steps are used to submit jobs to LSF:

  • Create a script file containing the commands to be executed for your job:
    #BSUB -o standard_output
    #BSUB -e standard_error
    
    cp input /share/myuserid/input
    cd /share/myuserid
    ./job.exe < input
    cp output /home/myuserid
    
    
  • Use the bsub command to submit the script to the batch system. In the following example two hours of run time are requested:
    bsub -W 2:00 < script.csh
    
  • The bjobs command can be used to monitor the progress of a job
  • The -e and -o options specify the files for standard error and standard output respectively. If these are not specified the standard output and standard error will be sent by email to the account submitting the job.
  • The bpeek command can be used to view standard output and standard error for a running job.
  • The bkill command can be used to remove a job from LSF (regardless of current job status).

Running MPI Parallel Jobs with Hydra MPICH2

After an operating system update in May 2010, many codes compiled with MPICH-1 libraries exhibited "net_send" errors. Since the MPICH-1 libraries are no longer maintained by the mpich developers, MPICH was updated to MPICH-2. Compiling and running with the MPICH-2 parallel libraries uses the following syntax.

For MPICH-2, environmental variables are set with "add pgi_hydra", "add intel_hydra for pgi and intel compilers, respectively. Alternatively, "source /usr/local/apps/env/pgi_hydra.csh, source /usr/local/apps/env/intel_hydra.csh,

We discourage use of gnu compilers for parallel computation, but if you do need an equivalent parallel tool chain, please contact HPC support. For MPICH-2 compiled codes, a job submission script bfoo looks like

#! /bin/csh
#BSUB -o standard_output.%J
#BSUB -e standard_error.%J
#BSUB -n 4
#BSUB -W 15
#BSUB -R span[ptile=2]
setenv MPICH_NO_LOCAL 1
mpiexec_hydra ./parjob.exe

The span[ptile=2] requests than job tasks be distributed two per node. This specification is optional and can range from 1 to 12. Specifying specific number of tasks per node may result in longer time waiting in the queue for the available resources to match the request.

The setenv MPICH_NO_LOCAL 1 specifies that all MPI messages will be passed through sockets, not using shared memory available on a node. If setenv MPICH_NO_LOCAL 1 is omitted, the span[ptile must remain. Some possible alternative lines would be

unsetenv MPICH_NO_LOCAL
#BSUB -R span[ptile=4]
which would allocate 4 MPI processes on each node, or
unsetenv MPICH_NO_LOCAL
#BSUB -R span[ptile=8]
which would allocate 8 MPI processes each on quad core (8 core total) nodes. "span[ptile=8]" restricts the choice of nodes on which LSF can schedule jobs to empty quad core nodes. Quad core nodes are usually not empty, so asking for 8 cores on a node can entail a long wait before running.

If the number of MPI processes on each node (specified by the -R span[ptile= ) is not specified, then the line "setenv MPICH_NO_LOCAL 1" is necessary. But even with "setenv MPICH_NO_LOCAL 1", a ptile setting often helps job execution performance. (Absent a ptile setting, many processes may land on a few nodes. Runtime bottlenecks can occur as many processes communicate through a few sockets.)

Running MPI Parallel Jobs with Infiniband

The cluster nodesare connected by a Gigabit network. A limited number of nodes are connected by a lower latency infiniband network. To use the infiniband interconnect, codes should link to mvapich libraries. For pgi compilers,

add pgi_mvapich

or inside a bsub job submission script
 
source /usr/local/apps/env/pgi_mvapich.csh 

or for intel compilers
add intel_mvapich

or inside a bsub job submssion script
source /usr/local/apps/env/intel_mvapich.csh

will set environmental variables so that mpif90 and mpicc use pgf90 and pgcc compilers (ifort and icc) and link to the infiniband mpich (mvapich) libraries.

In order to run infiniband jobs, a couple of environmental variables need to be set on all nodes on which the job will run. To do that, edit a file .tcshrc in your home directory. .tcshrc will be executed as part of the setup process for parallel jobs.

cd
ls -l .tcshrc

will show whether you already have a .tcshrc file. Put or append the lines
 
setenv RLMIT_MEMLOCK 1000000
limit memorylocked unlimited

to the .tcshrc file.

Once you have a mvapich linked executable, you can submit your infiniband job to the standard_ib queue. Instead of mpiexec_hydra, use mpiexec_mvapich. A sample bsub script follows.

UB -W 15
#BSUB -n 16 
#BSUB -q standard_ib 
#BSUB -R "span[ptile=4] same[chassis]" 
source /usr/local/apps/env/pgi_mvapich
mpiexec_mvapich ./ringping
#BSUB -o out.%J
#BSUB -e err.%J

Note that if you use the span[ptile=4] to specify 4 cores per node, you also need the same[chassis], else the standard_ib queue may put jobs on two different chassis, causing a runtime infiniband error.

For performance reasons, we do not recommend using gnu compilers. If you find a need to use gnu compilers (code only compiles with gnu or you want to make sure your code works with open source compilers?), please contact HPC support.

Running Shared Memory Parallel Jobs

Henry2 nodes are a mix of single-, dual-, quad-, and six-core processors. Total processor cores per node range from 2 to 16. All the processor cores on a node share access to the all of the memory on the node. Individual nodes can be used to run programs written using a shared memory programming model - such as OpenMP.

To submit a shared memory job to use multiple cores on a single node use the bsub options -n 16 -x. These request exclusive use of a node. An example submission file might be

#! /bin/csh 
#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 16
#BSUB -R "rusage[mem=128000]  span[hosts=1]" 
#BSUB -W 15
#BSUB -q shared_memory 
setenv OMP_NUM_THREADS 16
./exec

If the above file is shmemjob, it could be submitted by the command

 
bsub < shmemjob
and will run on a node with 16 cores.

In September, 2013, the maximal amount of RAM available for a shared_memory queue job is 512 GBytes (2 nodes). 9 nodes have 128 Gbytes of RAM, and 3 nodes have 128 GBytes. To request memory, use the -R rusage[mem=xxxx] flag, where mem is expressed in megabytes. The above bsub file used 128000 (128 with 3 zeros) to request 128 Gbytes of RAM.

Shared memory jobs can also be run on other nodes, but with access to fewer total processor cores. A script such as the following would use nodes with two quad-core processors to access 8 total processor cores.

 
#! /bin/csh
#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 8
#BSUB -R span[hosts=1]
#BSUB -W 15
setenv OMP_NUM_THREADS 8
./exec

The number of job slots requested, -n 8 in this example, needs to match the number of threads the parallel job will use (OMP_NUM_THREADS). The resource request must specify span[hosts=1] to ensure that LSF assigns all the requested job slots on the same node - so they will have access to the same physical memory.

See the individual compilers for the flags needed to compile codes to enable OpenMP shared memory parallelism. Short course lecture notes on Openmp from the fall of 2009 give some instructions for converting a Fortran or C code to use OpenMP parallelism.

Running Hybrid (MPI + Shared Memory) Parallel Jobs
Normally, when running a hybrid parallel job, you want place 1 MPI process on each node and, under that MPI process, you want to use all the cores available on that node. The following script is a simple sample script that can be used to run a hybrid parallel job "hybrid-job".
#!/bin/csh

#BSUB -o standard_output.%J
#BSUB -e standard_error.%J
#BSUB -n 16 
#BSUB -x
#BSUB -R "qc span[ptile=1]"
#BSUB -W 15

source /usr/local/apps/env/intel_mpich2_hydra-101.csh

setenv OMP_NUM_THREADS `grep processor /proc/cpuinfo | wc -l`; mpiexec_hydra ./hybrid-job
If the script is named hybrid-job.csh, then it can be submitted by the command
bsub < hybrid-job.csh
The following specifications in the above script are necessary for running a hybrid parallel job:
  1. The specification of -x requests exclusive use of each node.
  2. The specification of span[ptile=1] requests that 1 MPI process be placed on each node. Thus, there are 16 nodes and each node gets 1 MPI process.
  3. The specification of qc means that you are requesting quad-core nodes. This enables you (most probably) to get nodes with same type of cores on each node and with same number of cores on each node. (If the nodes have different types of cores or different numbers of cores, then some nodes may be under-utilized.) You may change qc to dc to request dual-core nodes.
  4. The source step is necessary for setting up appropriate Hydra MPICH2 related environment variables. Depending on your situation, you may need to source a different file such as /usr/local/apps/env/pgi_mpich2_hydra-105.csh
  5. The command
    setenv OMP_NUM_THREADS `grep processor /proc/cpuinfo | wc -l`
    sets the environment variable OMP_NUM_THREADS to the number of cores on each node regardless of how many cores are there on the node. This ensures that all the cores on each node are utilized.

Job Queues and LSF

A number of LSF queues are configured on the henry2 cluster. Often the best queue will be selected without the user specifing a queue to the bsub command. In some cases LSF may override user queue choices and assign jobs to a more appropriate queue.

Jobs requesting 4 or fewer processors and 15 minutes or less time are assigned to the debug queue and run with minimal wait times. Once a user is satisfied a job is running well, more time will typically be requested.

Queues available to all users support jobs running on up to 128 processors for one day or jobs running for up to a week on up to 16 processors. Jobs that need up to two hours and up to 28 processors are run in a queue that has access to nearly all cluster nodes [generally the queues open to all users only have access to nodes that were purchased with central funding]. Jobs that require 28 or fewer processors (but more than 2 hours) are placed in the single chassis queue. Jobs in this queue are scheduled on nodes located within the same physical chassis - resulting in better message passing bandwidth and lower latency for messages.

Partners, those who have purchased nodes to add to the henry2 cluster, may add the bsub option -q partnerqueueame to place their job in the partner queue. Partner queues are dedicated for use of the partner and their project and have priority access to the quantity of processors the partner has added to the cluster.

A note on LSF job scheduling provides some additional details regarding how LSF is configured on henry2 cluster.

LSF writes some intermediate files in the user's home directory as jobs are starting and running. If the user's disk quota has been exceeded, then the batch job will fail, often without any meaningful error messages or output. The quota command will display usage of /home file system.