The cluster currently provides several generations of NVIDIA GPUs, optimized for different workloads ranging from inference to large-scale training.
Nodes: catfish-01 … catfish-05
GPUs per node: 8 × NVIDIA L4
GPU memory: 24 GB per GPU
Profile:
Power-efficient, inference-optimized GPUs
Excellent for serving models, video processing, and light-to-medium AI workloads
Supports CUDA, TensorRT, and modern AI frameworks
Best for:
Inference
Small–medium model training
Cost-efficient GPU jobs
Nodes: salmon-01 … salmon-10
GPUs per node: 8 × NVIDIA L40S
GPU memory: 48 GB per GPU
Profile:
High-end workstation / data-center hybrid GPU
Strong FP32 / Tensor Core performance
Very good price/performance ratio
Best for:
AI training
Fine-tuning large models
Graphics, simulation, and mixed workloads
Nodes: goldfish-01
GPUs per node: 8 × NVIDIA H200
GPU memory: 141 GB HBM3e per GPU
Profile:
Top-tier NVIDIA data-center GPU
Extremely high memory bandwidth and capacity
Designed for very large models and HPC
Best for:
Large language models (LLMs)
Multi-GPU distributed training
Memory-intensive workloads
Nodes: dogfish-01, dogfish-02
GPUs per node: 8 × NVIDIA A100 (80 GB)
Status: DOWN / DRAINED
Note: These nodes are currently unavailable due to hardware issues.
srun)Request NVIDIA L4 GPUs
srun --gres=gpu:l4:1 --pty bash #request 1 gpu
srun --gres=gpu:l4:4 --pty bash #reuqest 4 gpu`s
srun --gres=gpu:l40s:1 --pty bash #request 1 gpu
srun --gres=gpu:l40s:4 --pty bash #request 4 gpu`s
srun --gres=gpu:h200:1 --pty bash #request 1 gpu
srun --gres=gpu:h200:4 --pty bash #request 4 gpu`s
sbatch)Example batch job requesting 2 L40S GPUs:
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --time=1:00:00
#SBATCH --gres=gpu:l40s:2
python my_script.py
Example batch job requesting 1 H200 GPU:
#SBATCH --gres=gpu:h200:1