Storage And Artifacts
This document describes the repository's current storage model and artifact contract.
Related Docs
Artifact Flow At A Glance
flowchart LR
Intake[Intake] --> Source[source_docx]
Source --> Preflight[preflight_report]
Source --> Extraction[extraction_manifest]
Extraction --> QA[qa_report]
QA --> Review[review_tmx or review_import_tmx]
QA --> Reassembly[reassembly_manifest]
Reassembly --> Final[final_docx]
Review --> Replay[replay_package]
Storage Backends
The checked-in code supports two artifact storage backends:
locals3
local uses Django's filesystem storage under IRIS_MEDIA_ROOT.
s3 uses storages.backends.s3boto3.S3Boto3Storage and is intended for AWS S3 or S3-compatible systems such as MinIO.
Object Key Layout
Artifacts are stored under deterministic hierarchical object keys built by services/storage/keys.py.
Pattern:
{project_code}/{external_reference}/{version_label}/{artifact_type}/{filename}
Example:
nuclear-proj/lanl-q-001/rev-a/source/original.docx
Rules:
project_code,external_reference, andversion_labelare slugified to lowercase safe path segmentsfilenameis preserved as-is- keys have no leading slash and always use forward slashes
Artifact Types
apps/jobs/models.py defines these artifact types:
| Artifact type | Typical producer |
|---|---|
source_docx |
intake |
preflight_report |
intake / rerun |
extraction_manifest |
extraction |
review_tmx |
TMX export |
review_import_tmx |
TMX import |
final_docx |
reassembly |
qa_report |
QA finalization |
reassembly_manifest |
reassembly |
replay_package |
replay export |
Artifact Records
Each Artifact row stores:
- artifact type
- storage backend
- object key
- original filename
- content type
- SHA-256 checksum
- size in bytes
- retention class and expiry
- arbitrary JSON metadata
Download Behavior
Artifact listing is available through GET /api/v1/jobs/{job_id}/artifacts.
Current download behavior depends on the backend:
- for
s3artifacts, the app first asks the storage backend for a URL; if that URL is absolute, the API returns that direct pre-signed URL - for
localartifacts, the API returns a Django download URL protected by a signed token - if an S3-compatible backend returns only a relative URL, the code falls back to the Django-signed download view
Artifact link expiry is based on IRIS_ARTIFACT_DOWNLOAD_TTL_SECONDS.
Retention Classes
services/storage/artifacts.py maps artifact types to retention classes as follows:
| Artifact type | Retention class |
|---|---|
source_docx |
source |
preflight_report |
intermediate |
extraction_manifest |
intermediate |
qa_report |
intermediate |
reassembly_manifest |
intermediate |
review_tmx |
reviewed |
review_import_tmx |
reviewed |
replay_package |
reviewed |
final_docx |
delivered |
| anything else | debug |
retention_days is taken from the related project's retention_days value when the artifact record is created.
Retention Enforcement
Retention is now enforced by the checked-in maintenance flow:
tasks.workflow.maintenance_tickpython manage.py maintenance_tick
When an artifact's retention_expires_at is in the past, maintenance deletes the storage object and then removes the Artifact row.
Use python manage.py maintenance_tick --dry-run to see how many artifacts would be affected before a live cleanup run.
Governance Deletion Controls
The API now also exposes an admin-only artifact deletion endpoint:
DELETE /api/v1/artifacts/{artifact_id}
Current safeguards:
- the related job must already be in a terminal state
retention_expires_atmust already be in the past- the storage object is deleted before the database row is removed
- an
artifact_deletedaudit event is written with the authenticated actor and source IP
Candidate and approved memory entries use separate admin-only deactivation endpoints rather than hard delete:
DELETE /api/v1/memory/candidates/{candidate_id}DELETE /api/v1/memory/approved/{entry_id}
Those endpoints flip is_active to false and record an audit event.
Materialization For DOCX Processing
Some workflow stages need a real local file path. The storage layer supports that through materialization helpers:
- local filesystem artifacts are used directly when a storage path is available
- remote-only artifacts are temporarily copied to a local temp file and cleaned up afterward
That is why the workflow can use S3-compatible storage without requiring every stage to rely on storage.path(...).
Replay Package Contents
The replay package service writes a ZIP archive that contains:
reviewed.tmxunit-map.jsondocument-manifest.jsonjob-manifest.jsonqa-report.jsonreassembly-manifest.json
For the job lifecycle that produces these artifacts, see Jobs And Workflow. For configuration of local versus s3, see Setup And Configuration.