Queues#
SMU’s high-performance computing (HPC) clusters use SLURM to schedule and manage resources.
See also
For examples and tips on SLURM usage, see our SLURM documentation.
Note
The efficiency of SLURM and our HPC systems depend on reasonable resource requests. Whenever possible, you should request only the resources your job needs including memory, number of cores, number of gpus, and run time.
Requesting more than you need makes your job and everyone elses jobs run less efficiently.
M3 Queues#
M3 has the following queues:
Partition Name |
Number of Nodes |
Cores Per Node |
Memory Per Node |
Time Limit |
Notes |
---|---|---|---|---|---|
dev |
4 |
128 |
500GB |
2 hours |
|
gpu-dev |
3 |
36 |
734GB |
2 hours |
8 Nvidia V100 GPUs per node |
htc |
10 + 10 shared* |
128 |
500GB |
24 hours |
*share nodes are listed in htc and standard-s |
standard-s |
136 + 10 shared* |
128 |
500GB |
24 hours |
*share nodes are listed in htc and standard-s |
standard-l |
20 |
128 |
500GB |
7 days |
|
highmem |
8 |
128 |
2TB |
5 days |
|
dtn |
2 |
128 |
500GB |
7 days |
*approval required |
All M3 nodes are identical and contain dual AMD EPYC 7763 64-Core Processors with the exception of the
nodes on the highmem
partition having more memory.
The
dev
nodes are meant for interactive and development workThe
gpu-dev
nodes are meant for interactive and development work. Production and batch jobs should run on the SuperPod.The
htc
nodes are for “High Throughput Computing” and are intended for jobs that typically launch a large number of tasks that require a single core and a small amount of memory for a short time. This is also an appropriate queue for many interactive jobs requiring longer than 2 hours.The
standard-s
andstandard-l
nodes are where most jobs should be run depending on their durationThe
highmem
nodes are for jobs requiring more than 500GB of memoryThe
dtn
nodes are “Data Transfer Nodes.” Access requires approval. These nodes are meant only for transfering data to/from M3.
Note
You should select the queue most appropriate for your jobs and not on current availability.
SuperPod (MP) Queues#
The SuperPod has 2 queues:
Partition Name |
Number of Nodes |
Cores Per Node |
Memory Per Node |
GPUs Per Node |
Time Limit |
---|---|---|---|---|---|
batch |
18 |
128 |
2TB |
8 Nvidia A100s |
48 hours |
short |
2 |
128 |
2TB |
8 Nvidia A100s |
4 hours |
All SuperPod nodes are identical and contain dual AMD EPYC 7742 64-Core Processors.
We have partitioned the nodes on the SuperPod, so for every GPU you request, you will also receive 16 CPU cores and 96GB of memory automatically.
Note
CPU only jobs should be run on M3 and not the SuperPod
Short queue#
The short queue is intended for interactive work, development, and other short tasks. The following job restrictions apply:
4 hour time limit
default allocation 1 GPU, 16 cores, 96GB memory, 2 hours
default of of 96 GB and 16 cores per GPU
Max 2 GPUs
Max 128GB per GPU
Max 16 cores per GPU
Max 1 running job per user
ssh access to compute nodes enabled for running jobs
Special Requests#
Exceptions may be requested, for example, to run a job on all 20 SuperPOD nodes. Approval for any such requests will be the responsibility of the O’Donnell Data Science and Research Computing Institute and will be based on demonstrated need and impact to other users.