Artifacts In-Depth#

Artifacts are output files that your tasks produce. Woom tracks artifacts to verify task completion, provide visibility into outputs, and help with workflow debugging and data management.

What Are Artifacts?#

Artifacts represent important output files that:

  • Indicate successful task completion

  • Serve as inputs to downstream tasks

  • Represent final products of your workflow

  • Need to be validated and tracked

Examples:

  • Model output files (NetCDF, HDF5)

  • Restart/checkpoint files

  • Analysis results (CSV, plots)

  • Log files

  • Processed data products

Why Track Artifacts?#

  1. Validation: Verify tasks completed successfully by checking outputs exist

  2. Debugging: Quickly identify which files are missing

  3. Documentation: See what files your workflow produces

  4. Dependencies: Understand data flow between tasks

  5. Data Management: Identify files to archive or clean up

Basic Configuration#

Simple Artifact#

[run_model]
    [[artifacts]]
        [[[output]]]
        path = output.nc
        check = True

Fields:

  • path: File path (absolute or relative to run_dir)

  • check: Whether to verify file exists after task completes (default: True)

  • callable: Whether path is a function name (default: False)

Multiple Artifacts#

[run_model]
    [[artifacts]]
        [[[output]]]
        path = output.nc
        check = True

        [[[restart]]]
        path = restart.nc
        check = True

        [[[log]]]
        path = model.log
        check = False  # Optional file

Absolute vs Relative Paths#

Relative Paths#

Relative to task’s run_dir:

[task]
    [[content]]
    run_dir = /scratch/run

    [[artifacts]]
        [[[output]]]
        path = results/output.nc  # → /scratch/run/results/output.nc

Absolute Paths#

[[artifacts]]
    [[[shared_output]]]
    path = /data/shared/results.nc  # Absolute path

Template Paths#

Use template variables:

[[artifacts]]
    [[[output]]]
    path = {{ task_run_dir }}/output_{{ cycle.token }}.nc

    [[[dated_output]]]
    path = /data/outputs/{{ cycle.date.year }}/{{ cycle.date.month }}/data.nc

Artifact Checking#

Mandatory Artifacts#

[[artifacts]]
    [[[critical_output]]]
    path = output.nc
    check = True  # Task fails if missing

If check = True and file doesn’t exist after task completes, woom marks the task as failed.

Optional Artifacts#

[[artifacts]]
    [[[debug_log]]]
    path = debug.log
    check = False  # Warning only if missing

If check = False, missing files generate warnings but don’t fail the task.

Multiple Files#

Lists of Files#

[[artifacts]]
    [[[outputs]]]
    path = output1.nc, output2.nc, output3.nc
    check = True

Woom checks each file in the list.

Wildcards (Not Supported)#

Woom doesn’t support glob patterns in paths. Use callable generators instead:

[[artifacts]]
    [[[all_outputs]]]
    path = generate_output_list
    callable = True
    check = True

Dynamic Artifact Paths#

Using Templates#

Cycle-Dependent:

[[artifacts]]
    [[[daily_output]]]
    path = {{ task_run_dir }}/output_{{ cycle.date_str }}.nc

Member-Dependent:

[[artifacts]]
    [[[ensemble_output]]]
    path = {{ task_run_dir }}/output_{{ member.label }}.nc

Combined:

[[artifacts]]
    [[[result]]]
    path = {{ scratch_dir }}/{{ task_path }}/result_{{ cycle.token }}_{{ member.label }}.nc

Using Callable Generators#

For complex path generation:

tasks.cfg:

[[artifacts]]
    [[[ensemble_outputs]]]
    path = generate_ensemble_outputs
    callable = True
    check = True

        [[[[kwargs]]]]
        base_dir = {{ task_run_dir }}
        prefix = member

ext/artifacts_generators.py:

from woom.tasks import ARTIFACTS_GENERATORS

def generate_ensemble_outputs(context, base_dir, prefix):
    """Generate list of ensemble output files"""
    outputs = []
    if context.get('member'):
        # Single member - return its file
        member = context['member']
        outputs.append(f"{base_dir}/{prefix}_{member.label}.nc")
    else:
        # No member context - return all expected files
        workflow = context['workflow']
        for member in workflow.members:
            outputs.append(f"{base_dir}/{prefix}_{member.label}.nc")
    return outputs

ARTIFACTS_GENERATORS['generate_ensemble_outputs'] = generate_ensemble_outputs

Advanced: Time Series#

Generate daily file list:

def generate_daily_outputs(context, output_dir, pattern):
    """Generate daily output files for cycle"""
    outputs = []
    cycle = context.get('cycle')
    if not cycle:
        return outputs

    current_date = cycle.begin_date
    while current_date < cycle.end_date:
        filename = pattern.format(
            year=current_date.year,
            month=current_date.month,
            day=current_date.day
        )
        outputs.append(f"{output_dir}/{filename}")
        current_date += pd.Timedelta(days=1)

    return outputs

ARTIFACTS_GENERATORS['daily_outputs'] = generate_daily_outputs

Usage:

[[artifacts]]
    [[[daily_files]]]
    path = daily_outputs
    callable = True

        [[[[kwargs]]]]
        output_dir = {{ task_run_dir }}/daily
        pattern = output_{year:04d}{month:02d}{day:02d}.nc

Viewing Artifacts#

Command Line#

List all artifacts:

woom show artifacts

Filter by task:

woom show artifacts --task-name run_model

Filter by cycle:

woom show artifacts --cycle 2020-01-01

From Python#

# Get all artifacts for a task
artifacts = workflow.get_task_artifacts('run_model', cycle='2020-01-01')

for name, paths in artifacts.items():
    print(f"{name}: {paths}")

# Get specific artifact
output_path = workflow.get_task_artifact_paths(
    'output',
    'run_model',
    cycle='2020-01-01'
)

# Check if exists
import os
if os.path.exists(output_path):
    print("Output file exists")

Artifacts DataFrame#

# Get DataFrame of all artifacts
df = workflow.get_artifacts()
print(df)

# Filter
df_model = workflow.get_artifacts(task_name='run_model')

# Check existence
missing = df[~df['EXISTS?']]
print(f"Missing files: {len(missing)}")

Common Patterns#

Model Outputs#

[run_ocean_model]
    [[artifacts]]
        [[[output]]]
        path = {{ task_run_dir }}/ocean_{{ cycle.token }}.nc
        check = True

        [[[restart]]]
        path = {{ task_run_dir }}/restart_{{ cycle.end_date_str }}.nc
        check = True

        [[[diagnostics]]]
        path = {{ task_run_dir }}/diagnostics.nc
        check = False  # Optional

Analysis Results#

[compute_statistics]
    [[artifacts]]
        [[[stats]]]
        path = {{ scratch_dir }}/analysis/stats_{{ cycle.token }}.csv
        check = True

        [[[plots]]]
        path = plot_sst.png, plot_currents.png, plot_salinity.png
        check = False  # Plots are optional

Post-Processing#

[merge_outputs]
    [[artifacts]]
        [[[merged_file]]]
        path = /data/final/merged_{{ app.exp }}_{{ cycle.date.year }}.nc
        check = True

Data Download#

[download_forcing]
    [[artifacts]]
        [[[forcing_file]]]
        path = {{ params.forcing_dir }}/era5_{{ cycle.date_str }}.nc
        check = True

Ensemble Processing#

[ensemble_mean]
    [[artifacts]]
        [[[mean]]]
        path = {{ scratch_dir }}/ensemble/mean_{{ cycle.token }}.nc
        check = True

        [[[std]]]
        path = {{ scratch_dir }}/ensemble/std_{{ cycle.token }}.nc
        check = True

        [[[individual_members]]]
        path = list_ensemble_files
        callable = True
        check = False  # Don't fail if individual files missing

Artifact Best Practices#

  1. Track Important Outputs: Define artifacts for critical files only

  2. Use Meaningful Names: Artifact names should describe what the file contains

  3. Set Appropriate Check Flags: - check=True for required outputs - check=False for optional/debug files

  4. Use Template Variables: Make paths dynamic with cycle/member information

  5. Organize Output Directories: Use consistent directory structures

  6. Document Expectations: Comment what each artifact represents

  7. Validate Paths: Test that paths are correct before running workflow

  8. Handle Missing Gracefully: Use check=False for truly optional outputs

  9. Consider Downstream: Think about which files downstream tasks need

  10. Archive Strategy: Identify which artifacts to keep long-term

Troubleshooting#

Artifact Not Found#

Symptoms: Task marked as failed, “Artifact not found” message

Causes:

  1. Task didn’t create the file

  2. Wrong path in configuration

  3. File created in different location

  4. Permissions prevent access

Debug:

# Check what files task created
ls -R /path/to/run/dir

# Compare to expected artifact path
woom show artifacts --task-name my_task

# Check job output
cat jobs/*/my_task/job.out

Solutions:

  • Verify task command actually creates file

  • Check path template rendering

  • Ensure run_dir is set correctly

  • Use absolute paths if needed

  • Check file permissions

Path Template Errors#

Symptoms: Path doesn’t render correctly

Causes:

  • Syntax error in template

  • Variable undefined in context

  • Wrong variable used

Debug:

# Check rendered path
context = workflow.get_context(task_name='my_task', cycle='2020-01-01')
task = workflow.get_task('my_task')
task.set_context(context)
artifacts = task.render_artifacts()
print(artifacts)

Solutions:

  • Test template syntax

  • Verify variables exist in context

  • Use | default() for optional variables

  • Check for typos in variable names

Callable Not Working#

Symptoms: Artifact generator doesn’t run or errors

Causes:

  • Function not registered

  • Wrong function signature

  • Runtime error in function

Debug:

# Check if registered
from woom.tasks import ARTIFACTS_GENERATORS
print('my_generator' in ARTIFACTS_GENERATORS)

# Test function directly
result = ARTIFACTS_GENERATORS['my_generator'](context, **kwargs)
print(result)

Solutions:

  • Ensure function is registered in ARTIFACTS_GENERATORS

  • Check function signature matches: func(context, **kwargs)

  • Add error handling in generator function

  • Test with simple case first

Wrong Files Checked#

Symptoms: Task succeeds but didn’t create expected files

Causes:

  • check=False on critical artifacts

  • Wrong artifact configured

  • Files created with different names

Solutions:

  • Set check=True for required outputs

  • Verify artifact names match actual outputs

  • Use callable to list actual files created

  • Review task output logs

Performance Issues#

Symptoms: Artifact checking is slow

Causes:

  • Too many artifacts defined

  • Network file system latency

  • Large file lists from callables

Solutions:

  • Only track essential artifacts

  • Use check=False for non-critical files

  • Optimize callable generators

  • Consider aggregate artifacts (one check for directory)

Integration with Workflow#

Artifacts as Dependencies#

While woom doesn’t automatically create task dependencies based on artifacts, you can design your workflow to reflect these relationships:

# Stage 1: Create data
[[prolog]]
prepare = download_forcing

# Stage 2: Use data
[[cycles]]
simulate = run_model  # Uses forcing from prepare

Checking Before Run#

# Verify previous task's artifacts before running
artifacts = workflow.get_task_artifacts('previous_task', cycle='2020-01-01')

for name, paths in artifacts.items():
    for path in paths:
        if not os.path.exists(path):
            raise RuntimeError(f"Required input missing: {path}")

# Now safe to run dependent task
workflow.run()

Cleanup Strategy#

# Show all artifacts with existence status
woom show artifacts > artifact_inventory.txt

# Use artifact info for cleanup decisions
# Keep final products, remove intermediate files

See Also#