1 storage and artifacts
william le roux edited this page 2026-04-01 11:13:16 +03:00

Storage And Artifacts

This document describes the repository's current storage model and artifact contract.

Artifact Flow At A Glance

flowchart LR
	Intake[Intake] --> Source[source_docx]
	Source --> Preflight[preflight_report]
	Source --> Extraction[extraction_manifest]
	Extraction --> QA[qa_report]
	QA --> Review[review_tmx or review_import_tmx]
	QA --> Reassembly[reassembly_manifest]
	Reassembly --> Final[final_docx]
	Review --> Replay[replay_package]

Storage Backends

The checked-in code supports two artifact storage backends:

  • local
  • s3

local uses Django's filesystem storage under IRIS_MEDIA_ROOT. s3 uses storages.backends.s3boto3.S3Boto3Storage and is intended for AWS S3 or S3-compatible systems such as MinIO.

Object Key Layout

Artifacts are stored under deterministic hierarchical object keys built by services/storage/keys.py.

Pattern:

{project_code}/{external_reference}/{version_label}/{artifact_type}/{filename}

Example:

nuclear-proj/lanl-q-001/rev-a/source/original.docx

Rules:

  • project_code, external_reference, and version_label are slugified to lowercase safe path segments
  • filename is preserved as-is
  • keys have no leading slash and always use forward slashes

Artifact Types

apps/jobs/models.py defines these artifact types:

Artifact type Typical producer
source_docx intake
preflight_report intake / rerun
extraction_manifest extraction
review_tmx TMX export
review_import_tmx TMX import
final_docx reassembly
qa_report QA finalization
reassembly_manifest reassembly
replay_package replay export

Artifact Records

Each Artifact row stores:

  • artifact type
  • storage backend
  • object key
  • original filename
  • content type
  • SHA-256 checksum
  • size in bytes
  • retention class and expiry
  • arbitrary JSON metadata

Download Behavior

Artifact listing is available through GET /api/v1/jobs/{job_id}/artifacts.

Current download behavior depends on the backend:

  • for s3 artifacts, the app first asks the storage backend for a URL; if that URL is absolute, the API returns that direct pre-signed URL
  • for local artifacts, the API returns a Django download URL protected by a signed token
  • if an S3-compatible backend returns only a relative URL, the code falls back to the Django-signed download view

Artifact link expiry is based on IRIS_ARTIFACT_DOWNLOAD_TTL_SECONDS.

Retention Classes

services/storage/artifacts.py maps artifact types to retention classes as follows:

Artifact type Retention class
source_docx source
preflight_report intermediate
extraction_manifest intermediate
qa_report intermediate
reassembly_manifest intermediate
review_tmx reviewed
review_import_tmx reviewed
replay_package reviewed
final_docx delivered
anything else debug

retention_days is taken from the related project's retention_days value when the artifact record is created.

Retention Enforcement

Retention is now enforced by the checked-in maintenance flow:

  • tasks.workflow.maintenance_tick
  • python manage.py maintenance_tick

When an artifact's retention_expires_at is in the past, maintenance deletes the storage object and then removes the Artifact row.

Use python manage.py maintenance_tick --dry-run to see how many artifacts would be affected before a live cleanup run.

Governance Deletion Controls

The API now also exposes an admin-only artifact deletion endpoint:

  • DELETE /api/v1/artifacts/{artifact_id}

Current safeguards:

  • the related job must already be in a terminal state
  • retention_expires_at must already be in the past
  • the storage object is deleted before the database row is removed
  • an artifact_deleted audit event is written with the authenticated actor and source IP

Candidate and approved memory entries use separate admin-only deactivation endpoints rather than hard delete:

  • DELETE /api/v1/memory/candidates/{candidate_id}
  • DELETE /api/v1/memory/approved/{entry_id}

Those endpoints flip is_active to false and record an audit event.

Materialization For DOCX Processing

Some workflow stages need a real local file path. The storage layer supports that through materialization helpers:

  • local filesystem artifacts are used directly when a storage path is available
  • remote-only artifacts are temporarily copied to a local temp file and cleaned up afterward

That is why the workflow can use S3-compatible storage without requiring every stage to rely on storage.path(...).

Replay Package Contents

The replay package service writes a ZIP archive that contains:

  • reviewed.tmx
  • unit-map.json
  • document-manifest.json
  • job-manifest.json
  • qa-report.json
  • reassembly-manifest.json

For the job lifecycle that produces these artifacts, see Jobs And Workflow. For configuration of local versus s3, see Setup And Configuration.