Slurm is a job scheduling system that distributes jobs to several machines respecting given constraints. A whole bunch of documentation and tutorials can be found here. Please consider subscribing to our...
All IBR users who use, intend to use or are interested in using the IBR Slurm infrastructure, should subscribe to the slurm-users mailinglist. Members of the IBR LDAP group ibrslurm
are automatically subscribed. Former mailiglist messages can be found in the public archive.
The CM and ALG research groups supplied servers (nodes) to our Slurm pool. Partitions and features are used to decide on which machines jobs get executed. Partitions are named by the research group (cm
, alg
) and under usual circumstance, users should only use their groups partition.
Note that some servers' use is not limited to Slurm. Some servers are also used as GitLab runners or through direct SSH access. Take care of all exit codes of your Slurm jobs. They could fail occasionally due to varying resource limitation.
On the other hand, some servers are only accessible by users of some specific groups, e.g. algusers
may access ALG nodes, and only members of the group ibrslurm
may access some CM nodes.
You should avoid addressing nodes explicity, since the list of nodes may vary over time, as well as their current availability. Instead, make use of feature attributes when submitting your jobs. Maybe, you also want to negotiate your Slurm cluster usage through the mailinglist, especially in case of larger jobs and if you intend to uses nodes of a "foreign" partition.
The command sinfo
(or sinfo -Nl
in more detail) gives an overview of the existing compute resources.
steinb@x1 ~/ 1094 $ sinfo -Nl
Mon Jun 26 14:50:00 2023
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
algpc01 1 alg drained* 4 4:1:1 62000 0 1 alggen02 ibr-shutdown-daemon
algpc02 1 alg drained* 4 4:1:1 62000 0 1 alggen02 ibr-shutdown-daemon
algpc03 1 alg drained* 4 4:1:1 62000 0 1 alggen02 ibr-shutdown-daemon
algpc04 1 alg drained* 4 4:1:1 62000 0 1 alggen02 ibr-shutdown-daemon
algpc05 1 alg drained* 4 4:1:1 32000 0 1 alggen01 ibr-shutdown-daemon
algpc06 1 alg drained* 4 4:1:1 32000 0 1 alggen01 ibr-shutdown-daemon
algpc07 1 alg drained* 4 4:1:1 32000 0 1 alggen01 ibr-shutdown-daemon
algpc08 1 alg drained* 4 4:1:1 32000 0 1 alggen01 ibr-shutdown-daemon
algrtx01 1 alg idle 32 32:1:1 120000 0 1 alggen04 none
algry01 1 alg drained* 16 16:1:1 120000 0 1 alggen03 ibr-shutdown-daemon
algry02 1 alg drained* 16 16:1:1 120000 0 1 alggen03 ibr-shutdown-daemon
algry03 1 alg drained* 16 16:1:1 120000 0 1 alggen03 ibr-shutdown-daemon
algry04 1 alg drained* 16 16:1:1 120000 0 1 alggen03 ibr-shutdown-daemon
crunch1 1 cm* idle 64 2:16:2 123000 800000 1 amd none
crunch2 1 cm* idle 64 2:16:2 60000 350000 1 intel,fa none
crunch3 1 cm* idle 32 1:16:2 60000 350000 1 amd,fast none
i1 1 cm* idle 16 2:4:2 28000 200000 1 intel none
i2 1 cm* idle 16 2:4:2 28000 700000 1 intel none
i3 1 cm* idle 8 1:4:2 28000 3000000 1 fastsing none
i4 1 cm* idle 36 1:18:2 248000 3000000 1 huge,fas none
i5 1 cm* idle 36 1:18:2 248000 3000000 1 huge,fas none
iz1202-01 1 cm* idle 8 1:4:2 20000 400000 1 intel,gp none
iz1202-02 1 cm* idle 8 1:4:2 20000 400000 1 intel,gp none
iz1202-03 1 cm* idle 8 1:4:2 20000 400000 1 intel,gp none
iz1202-04 1 cm* drained* 8 1:4:2 20000 400000 1 intel,gp ibr-shutdown-daemon
paccrunch 1 cm* idle 76 1:76:1 292000 1500000 1 fastmult none
Currently available features (addressable via --constraint
option) are:
intel
amd
alggen1
, alggen2
, alggen3
fastsingle
: high single-core performancefastmulti
: high multi-core performancehuge
: lots of RAMgpu
: CUDA availabletmpsdd
: large and fast SSD available at /opt/tmpssd
sqm
: students using these node are invited to use ibr-slurm-renice
with negative values.Staff members should add students to the group ibrslurm
.
Avoid addressing explicit nodes. Instead express your jobs' requirements through features and other resource constraints.
Remember to run ibr-slurm-prepare
when constraint nfs3
is not given, so that your NFS volumes remain usable throughout the job lifetime on NFS4 nodes.
Consider running your jobs with appropriate nice
values. If you have a good reason to run your jobs with a higher priority, you can use ibr-slurm-renice --nice=VALUE JOBNAME
with a negative VALUE and the common job name of your spooled jobs. You may check current priorities with sprio -l
.
You should almost always try to estimate upper bounds of your jobs' memory needs and specify them through the --mem
option. This allows the Slurm scheduler to run multiple jobs in parallel on multi-core nodes and therefore speedup large job arrays significantly.
In case of I/O-intensive jobs try to avoid using slow and thwarting NFS volumes and prefer local storage, e.g. /opt/tmp/slurm
, if possible. Those paths can later be accessed as /net/<node>/opt/tmp/slurm
for post-processing on any host.
Use /tmp
within your jobs for local temporary storage. It will exist solely for the lifetime of each job.
Consider asking any Slurm questions on the mailinglist. Also use the mailinglist to announce urgent job schedules or ask for rearranging jobs if required.
Dominik Krupke (ALG) developed slurminade, a Python module that gives Python programmers an elegant way to instrument their Python code for distributed Slurm processing. It is installed on all IBR hosts.
Servers located in the closed server room are using NFSv3 and a fast switch interconnect. They should be preferred in most cm
cases. A reasonable way to address them is by using --partition cm --constraint=nfs3
. However, if you intend to use also nodes that use NFSv4 to access your code, write your logs and results, etc., you have to tweak with the fact that NFS access requires a valid Kerberos ticket. This may become annoying since at first glance things seem to work fine for some time after you once logged into a node, but tickets get destroyed upon SSH logout and cached NFS credentials expire some time later.
We think a common best practice is to store a fresh Kerberos credentials cache with a reasonable ticket life time manually on each target node. A simple way to achieve this is using the command ibr-slurm-prepare
. It takes a list of host and/or partition names as arguments:
$ ibr-slurm-prepare cm
sent wake packet to iz1202-04 (14:b3:1f:02:1d:84)
waiting for node iz1202-04............................................................... FAILED
no need to prepare node crunch1
no need to prepare node crunch2
no need to prepare node i1
no need to prepare node i2
no need to prepare node i3
no need to prepare node i4
no need to prepare node i5
preparing node iz1202-01... /tmp/krb5cc_1659_slurm expires 29/06/23 13:15:59
preparing node iz1202-02... /tmp/krb5cc_1659_slurm expires 29/06/23 13:15:59
preparing node iz1202-03... /tmp/krb5cc_1659_slurm expires 29/06/23 13:15:59
not preparing node iz1202-04 due to unexpected state
no need to prepare node paccrunch
Nodes that could not get prepared and remain drained will simply not get used by the Slurm scheduler.
This section explains different approaches on where to store the output data of your simulation. The first approach is using the SSDs of the specific nodes. The second approach uses your own IBR home directory, e.g. ~/Simulation-Data/
Servers i3-5 have a semi-temporary SSD storage. These are mounted locally at /opt/tmpssd/
. Furthermore, each server directly mounts the SSDs of all other server via NFS3 so that processes running on different servers can still write to the same location. For example, the SSD of i5 can be accessed from i4 via /net/i5/opt/tmpsdd
. This storage space is persistent and not cleaned up automatically. It is expected that outdated data is moved to the NFS for archiving or deleted.
This storage is very useful to store large simulation results temporarily since:
Therefore, the following approach has proven to be practical:
/ibr/messdaten/
)SAving your simulation data inside your home directory has the benefit, that your simulation data will be automatically synchronized between the different nodes and that all data is always accesible at the same location. Therefore you don't have to make sure that a specific network mounted path is available for one specific node (you still have to use ibr-slurm-prepare
) and you don't need to copy data between different nodes. This however is only possible, since the home directory itself is network mounted. Therefore you should not use this approach, if you simulations write huge amounts of data into the output folder constantly, since this may cause congestion inside the network.
One thing you have to keep in mind, is the limited size of you home directory. Using ibr-quota
you can see how much space is available for your home directory and how much space is still available. If your simulation data causes the limit to be reached certain functionalities (like e-mail) may not work anymore.
Another disadvantage are the zfs snapshots that are constantly created to backup your home directory. Since they are incremental, they can grow incredibly huge, especially when you are constantly deleting old data and running new simulations. Using the command ibr-quota
shows you how much storage is available for you, how much storage you are using and how much storage you are using including you snapshots. You may run into a situation, where you are using 50% of your quota for snapshots only. To fix this, you can run the command ibr-snapshot-clean -k {N}
to clean old snapshots. N beeing the amount of snapshots that you want to keep. Setting N to zero will remove all snapshots.