Host Configuration In-Depth#

The hosts.cfg file defines execution environments where your tasks run. This includes local machines, HPC clusters, cloud resources, and how to access software environments on each.

Structure Overview#

A host configuration defines:

[hostname]
scheduler = slurm
scratch_dir = /scratch/$USER

    [[patterns]]
    # Auto-detect this host

    [[queues]]
        [[[normal]]]
        # Queue definitions

    [[envs]]
        [[[myenv]]]
        # Environment configurations

Host Basics#

Minimal Host#

[local]
scheduler = background
scratch_dir = /tmp

Required Fields:

  • Host section name (e.g., [local], [hpc_cluster])

  • scheduler - How to run jobs (background, slurm, pbspro)

  • scratch_dir - Temporary working directory

Host Selection#

Explicit Selection:

woom run --host hpc_cluster

Auto-Detection:

Configure patterns to auto-detect:

[datarmor]
scheduler = slurm

    [[patterns]]
    hostname = datarmor.*

When hostname matches pattern, this host is automatically used.

Multiple Hosts#

Define different configurations for different systems:

[laptop]
scheduler = background
scratch_dir = /tmp

[workstation]
scheduler = background
scratch_dir = /scratch

[university_cluster]
scheduler = slurm
scratch_dir = /scratch/$USER

[national_supercomputer]
scheduler = pbspro
scratch_dir = /work/$USER

Schedulers#

Background Scheduler#

For local execution without a batch scheduler:

[local]
scheduler = background
scratch_dir = /tmp

Characteristics:

  • Jobs run as background processes

  • No queuing system

  • Immediate execution

  • Good for development and testing

  • Limited parallelism

SLURM Scheduler#

For systems using SLURM Workload Manager:

[slurm_cluster]
scheduler = slurm
scratch_dir = /scratch/$USER

    [[queues]]
        [[[normal]]]
        partition = compute
        qos = normal
        account = myproject

        [[[gpu]]]
        partition = gpu
        gres = gpu:4
        account = myproject

SLURM-Specific Options:

  • partition - SLURM partition name

  • qos - Quality of service

  • account - Billing account

  • reservation - Reservation name

  • gres - Generic resources (like GPUs)

  • constraint - Node constraints

PBS Pro Scheduler#

For systems using PBS Professional:

[pbspro_cluster]
scheduler = pbspro
scratch_dir = /work/$USER

    [[queues]]
        [[[normal]]]
        queue_name = workq
        project = PROJ001

PBS-Specific Options:

  • queue_name - PBS queue name

  • project - Project code for accounting

Queue Configuration#

Queues define resource pools and policies.

Basic Queue#

[[queues]]
    [[[normal]]]
    partition = compute
    account = myaccount

Generic vs Scheduler-Specific#

Generic Options (work across schedulers):

[[[normal]]]
# Translated automatically to scheduler syntax

Scheduler-Specific Options (SLURM example):

[[[gpu_queue]]]
partition = gpu
gres = gpu:4
qos = high

Multiple Queues#

Define different queues for different resource needs:

[[queues]]
    [[[debug]]]
    partition = debug
    # Fast queue with limited time

    [[[normal]]]
    partition = compute
    account = project123

    [[[highmem]]]
    partition = highmem
    account = project123

    [[[gpu]]]
    partition = gpu
    gres = gpu:4
    account = project123

    [[[long]]]
    partition = compute
    qos = long
    account = project123

Queue Inheritance#

Avoid repetition with inheritance:

[[queues]]
    [[[base]]]
    account = myproject
    partition = compute

    [[[normal]]]
    inherit = base

    [[[highmem]]]
    inherit = base
    partition = highmem

    [[[gpu]]]
    inherit = base
    partition = gpu
    gres = gpu:4

Environment Configuration#

Environments define software stacks for tasks.

No Environment#

Tasks can run without special environment:

# No [[envs]] section needed
# Tasks with env = None use default environment

Module Environment#

Load software via environment modules:

[[envs]]
    [[[ocean_model]]]
    modules = netcdf/4.8.1, hdf5/1.12.0, openmpi/4.1.1

    [[[python_env]]]
    modules = python/3.9, scipy-stack/2023a

Multiple Modules:

Comma-separated list loads in order.

Conda/Mamba Environment#

Activate conda environments:

[[envs]]
    [[[analysis]]]
    conda = analysis_env

    [[[forecast]]]
    mamba = forecast_env

Virtualenv/venv#

Activate Python virtual environments:

[[envs]]
    [[[python_analysis]]]
    venv = /home/user/envs/analysis

UV Virtual Environment#

Activate UV environments:

[[envs]]
    [[[fast_python]]]
    uv_venv = /home/user/.venv

Combined Environments#

Combine multiple environment types:

[[envs]]
    [[[full_stack]]]
    modules = gcc/11.2, openmpi/4.1
    conda = scientific_py
    exports = OMP_NUM_THREADS=8, MKL_NUM_THREADS=8

Raw Text Environments#

For complex setup, provide raw shell commands:

[[envs]]
    [[[custom]]]
    raw_text = '''
        source /opt/custom/setup.sh
        export CUSTOM_VAR=value
        module load special_software
        '''

Environment Variables#

Set variables in the environment:

[[envs]]
    [[[model_env]]]
    modules = netcdf/4.8
    exports = '''
        OMP_NUM_THREADS=16
        DATA_ROOT=/data/ocean
        MODEL_VERSION=v2.3
        '''

Auto-Detection Patterns#

Configure pattern matching to automatically select hosts.

Hostname Pattern#

[university_hpc]
scheduler = slurm

    [[patterns]]
    hostname = login[0-9]+.hpc.university.edu

Matches: login1.hpc.university.edu, login2.hpc.university.edu, etc.

Environment Variables#

[national_center]
scheduler = pbspro

    [[patterns]]
    env_vars = CLUSTER_NAME=national_hpc

Matches when environment variable is set.

Multiple Patterns#

Combine patterns (AND logic):

[specific_cluster]
scheduler = slurm

    [[patterns]]
    hostname = compute.*
    env_vars = SITE=facility_a

Complete Host Examples#

Example 1: Development Laptop#

[laptop]
scheduler = background
scratch_dir = /tmp/woom

    [[envs]]
        [[[python]]]
        conda = dev_env

Example 2: University SLURM Cluster#

[university_hpc]
scheduler = slurm
scratch_dir = /scratch/$USER

    [[patterns]]
    hostname = login.*.hpc.edu

    [[queues]]
        [[[debug]]]
        partition = debug
        account = course101
        # 30 min limit

        [[[normal]]]
        partition = compute
        account = research_proj
        qos = normal

        [[[highmem]]]
        partition = highmem
        account = research_proj

    [[envs]]
        [[[ocean_model]]]
        modules = gcc/11, netcdf-fortran/4.5, openmpi/4.1

        [[[python_analysis]]]
        modules = python/3.10, scipy/1.9
        exports = PYTHONUNBUFFERED=1

Example 3: National Supercomputer (PBS)#

[national_hpc]
scheduler = pbspro
scratch_dir = /work/$USER/scratch

    [[patterns]]
    hostname = login.*.national.gov

    [[queues]]
        [[[standard]]]
        queue_name = standard
        project = ATMO12345

        [[[large]]]
        queue_name = capability
        project = ATMO12345

    [[envs]]
        [[[intel_mpi]]]
        raw_text = '''
            module purge
            module load intel/2023
            module load impi/2021
            module load netcdf/4.9
            '''

        [[[analysis]]]
        modules = python/3.11
        venv = /work/$USER/venvs/analysis

Example 4: Multi-Site Configuration#

# Site A - SLURM
[site_a]
scheduler = slurm
scratch_dir = /scratch/$USER

    [[patterns]]
    hostname = login-a.*

    [[queues]]
        [[[normal]]]
        partition = compute
        account = proj_a

    [[envs]]
        [[[model]]]
        modules = netcdf/4.8, openmpi/4.1

# Site B - PBS
[site_b]
scheduler = pbspro
scratch_dir = /work/$USER

    [[patterns]]
    hostname = login-b.*

    [[queues]]
        [[[normal]]]
        queue_name = standard
        project = proj_b

    [[envs]]
        [[[model]]]
        modules = netcdf/4.7, mpt/2.25

# Local development
[local]
scheduler = background
scratch_dir = /tmp

    [[envs]]
        [[[model]]]
        conda = ocean_dev

Advanced Features#

Host Inheritance#

Share configuration between similar hosts:

[base_slurm]
scheduler = slurm

    [[queues]]
        [[[normal]]]
        account = myproject

[cluster_a]
inherit = base_slurm
scratch_dir = /scratch/a/$USER

    [[patterns]]
    hostname = login-a.*

[cluster_b]
inherit = base_slurm
scratch_dir = /scratch/b/$USER

    [[patterns]]
    hostname = login-b.*

Custom Scheduler Options#

Pass extra options to scheduler:

[[queues]]
    [[[special]]]
    partition = compute
    # Custom SLURM options
    extra_options = --constraint=ib&haswell

Parameter Overrides#

Override workflow parameters per host:

[laptop]
scratch_dir = /tmp

    [[params]]
    nprocs = 4  # Laptop has fewer cores

[supercomputer]
scratch_dir = /scratch/$USER

    [[params]]
    nprocs = 1024  # Use many cores

Access from templates:

mpirun -n {{ params.nprocs }} ./model

Best Practices#

  1. Use Auto-Detection: Configure patterns for automatic host selection

  2. Organize by Purpose: Group queues logically (debug, normal, long, highmem, gpu)

  3. Document Requirements: Comment what modules/software are needed

  4. Test Locally First: Have a local/background host for testing

  5. Use Inheritance: Avoid repeating common configurations

  6. Keep Secrets Out: Don’t put passwords or keys in configuration

  7. Environment Modules: Prefer modules over hardcoded paths

  8. Validate Accounts: Ensure account/project codes are correct

  9. Check Queue Limits: Know walltime and resource limits

  10. Version Control: Track host configs in git (without secrets)

Common Patterns#

Pattern 1: Development + Production#

[dev]
scheduler = background
scratch_dir = /tmp

[prod]
scheduler = slurm
scratch_dir = /scratch/$USER
    [[queues]]
        [[[normal]]]
        partition = compute

Pattern 2: Multi-Tier Queues#

[[queues]]
    [[[debug]]]
    # Fast, limited
    partition = debug

    [[[normal]]]
    # Standard
    partition = compute

    [[[long]]]
    # Extended time
    partition = compute
    qos = long

    [[[highmem]]]
    # More memory
    partition = highmem

    [[[gpu]]]
    # GPU access
    partition = gpu
    gres = gpu:4

Pattern 3: Software Stacks#

[[envs]]
    [[[gnu_stack]]]
    modules = gcc/11, openmpi/4, netcdf/4.8

    [[[intel_stack]]]
    modules = intel/2023, impi/2021, netcdf/4.9

    [[[python_stack]]]
    modules = python/3.10
    conda = analysis_env

Troubleshooting#

Auto-Detection Not Working#

Check:

  • Pattern matches actual hostname: echo $HOSTNAME

  • Environment variables are set: env | grep CLUSTER

  • No typos in pattern syntax

  • Multiple hosts don’t match (creates ambiguity)

Module Load Fails#

Check:

  • Module exists: module avail modulename

  • Module dependencies loaded first

  • Correct module version specified

  • Module system initialized

Environment Not Activated#

Check:

  • Conda/venv path is correct

  • Environment exists: conda env list

  • Correct environment type specified (conda vs mamba vs venv vs uv_venv)

  • Shell initialization allows activation

Jobs Not Submitting#

Check:

  • Queue/partition exists: sinfo (SLURM) or qstat -q (PBS)

  • Account/project is valid

  • Resource requests within limits

  • User has access to queue

Migration Guide#

Moving Between Systems#

When moving workflow to new system:

  1. Create new host configuration

  2. Test with simple task: woom run --host newhpc

  3. Adjust paths (scratch_dir, data locations)

  4. Update queue/partition names

  5. Verify environment modules/software

  6. Test full workflow

Scheduler Migration#

Moving from SLURM to PBS (or vice versa):

  1. Change scheduler setting

  2. Update queue configuration (partition → queue_name, etc.)

  3. Test submission with simple job

  4. Adjust resource request translations if needed

  5. Update any scheduler-specific custom options

See Also#