Using job launchers (Slurm...)¶

RAITAP supports launcher-based remote execution via Hydra's Submitit plugin.

This is useful for Slurm-managed HPC clusters, shared GPU servers, and other environments where you want Hydra to submit or fan out jobs for you.

Prerequisites¶

Install the launcher extra when setting up RAITAP:

uv add "raitap[launcher]"

pip

pip install "raitap[launcher]"

Ensure you have access to a Slurm-managed remote server and know your:
- Partition name (e.g., gpu, compute)
- Account name (required by most clusters for job accounting)

Slurm sweeps¶

Use Hydra's submitit_slurm launcher for multi-runs: each combination of config choices becomes a Slurm array task.

YAML example¶

Compose the sweep, launcher selection, and Slurm resources in one experiment YAML. Use Hydra's defaults list to choose the dataset, transparency presets, and the submitit_slurm launcher; see Composing YAML files. Submitit's Slurm resource fields live under hydra.launcher.

Note

Slurm launching is CLI-only — the hydra.launcher / hydra.sweep blocks shown on this page are Hydra-plugin configuration with no Python-API equivalent. Python users would invoke subprocess directly to submit jobs.

# assessment.yaml
defaults:
  - raitap_schema   # bind AppConfig schema — required for unset optional fields
  - _self_
  - hydra/launcher: submitit_slurm # this is a preset from the plugin

transparency:
  captum:
    algorithm: IntegratedGradients
    call:
      target: 0
    visualisers:
      - _target_: CaptumImageVisualiser
  shap:
    algorithm: GradientExplainer
    call:
      target: 0
    visualisers:
      - _target_: ShapImageVisualiser

data:
  source: my_dataset

hydra:
    launcher:
        partition: gpu
        account: myproject
        timeout_min: 240
        cpus_per_task: 8
        gpus_per_node: 1
        mem_gb: 48
        array_parallelism: 8
    sweep:
        dir: outputs/my_experiment/${now:%Y-%m-%d}/${now:%H-%M-%S} # see the output section lower on this page
        subdir: ${hydra.job.num}

uv run raitap --multirun --config-name assessment

pip

raitap --multirun --config-name assessment

CLI override example¶

The same sweep and resource knobs expressed as overrides:

uv run raitap \
  --multirun \
  hydra/launcher=submitit_slurm \
  +transparency=captum,shap \
  "transparency.captum.algorithm=IntegratedGradients" \
  "transparency.shap.algorithm=GradientExplainer" \
  hydra.launcher.partition=gpu \
  hydra.launcher.account=myproject \
  hydra.launcher.timeout_min=240 \
  hydra.launcher.cpus_per_task=8 \
  hydra.launcher.gpus_per_node=1 \
  hydra.launcher.mem_gb=48 \
  hydra.launcher.array_parallelism=8

pip

raitap \
  --multirun \
  hydra/launcher=submitit_slurm \
  +transparency=captum,shap \
  "transparency.captum.algorithm=IntegratedGradients" \
  "transparency.shap.algorithm=GradientExplainer" \
  hydra.launcher.partition=gpu \
  hydra.launcher.account=myproject \
  hydra.launcher.timeout_min=240 \
  hydra.launcher.cpus_per_task=8 \
  hydra.launcher.gpus_per_node=1 \
  hydra.launcher.mem_gb=48 \
  hydra.launcher.array_parallelism=8

YAML keeps Slurm settings next to the rest of the experiment under version control. CLI overrides are handy for one-off runs or quick resource adjustments.

Reusing the same launcher preset¶

If you run sweeps across several experiments on the same cluster, extract the hydra.launcher.* settings into a standalone YAML file.

# my_launcher.yaml

# @package hydra.launcher
defaults:
  - submitit_slurm

submitit_folder: ${hydra.sweep.dir}/.submitit/%j
timeout_min: 240
cpus_per_task: 8
gpus_per_node: 1
tasks_per_node: 1
mem_gb: 48
nodes: 1
name: ${hydra.job.name}
partition: gpu
account: myproject
qos: null
constraint: null
gres: null
array_parallelism: 8
signal_delay_s: 120

Reference it in your experiment YAML in place of submitit_slurm and the inline hydra.launcher.* block:

# assessment.yaml
defaults:
  - _self_
  - hydra/launcher: my_launcher  # replaces submitit_slurm + inline hydra.launcher.*

//...your options

The run command is unchanged:

uv run raitap --multirun --config-name assessment

pip

raitap --multirun --config-name assessment

Remote environment setup¶

Many shared servers and HPC systems require some environment setup before running jobs. Here's a typical workflow:

1. Configure UV cache location (if using uv):

Some clusters have shared /tmp or home directory quotas. Set a user-specific cache location:

export UV_PATH="/cluster/home/$USER/.uv"

2. Load required modules:

# Load Python (adjust version to what's available; raitap supports 3.11+)
module load python/3.13.2

# Load uv if provided as a module
VENV=raitap-env module load uv/0.6.12

3. Add dependencies:

See 2. Pick assessment extras for the dependencies you need to add. Below is an example:

uv add "raitap[launcher,torch-cuda,captum,shap,metrics]"

4. Submit your multirun:

uv run raitap --multirun --config-name assessment

pip

raitap --multirun --config-name assessment

Best practices¶

Resource management:

Set timeout_min generously; some explainability methods (especially SHAP) can take quite a while
Adjust mem_gb based on your model size and batch configuration
Use cpus_per_task to match the number of dataloader workers
For single-GPU jobs, set gpus_per_node=1, tasks_per_node=1, nodes=1
Use array_parallelism to limit concurrent jobs (e.g., 8 for an 8-GPU node)

Output¶

Each job in the array writes its results under hydra.sweep.dir. Configure it in your experiment YAML to organize outputs by experiment:

hydra:
  sweep:
    dir: outputs/my_experiment/${now:%Y-%m-%d}/${now:%H-%M-%S}
    subdir: ${hydra.job.num}

This creates a structure like:

outputs/my_experiment/2026-04-13/14-30-00/
├── 0/                                     # First job in the array
│   ├── transparency/
│   └── .hydra/
├── 1/                                     # Second config combination
├── ...
└── .submitit/
    ├── 12345_0_log.out
    ├── 12345_0_log.err
    └── 12345_submission.sh

Useful Slurm advice¶

The following tips apply to Slurm generally and are not specific to RAITAP.

Data staging:

For large datasets, consider copying data to fast local storage before processing
Many clusters provide job-local scratch space (e.g., /scratch, /tmp)
Use a setup script in your Slurm job to handle staging logic
Clean up temporary files after job completion

Monitoring jobs:

# Check job status
squeue -u $USER

# Check detailed job info
scontrol show job <job-id>

# Cancel a job / all your jobs
scancel <job-id>
scancel -u $USER

# View job logs (Hydra writes these under the sweep dir)
tail -f outputs/multirun/.../submitit/<job-id>/<job-id>_0_log.out

For more details on Submitit configuration, see the Hydra Submitit documentation.