william/iris_translation_v2

Fork 0

Table of Contents

Operations And Scaling

Related Docs
Current Runtime Baseline
Scaling Order At A Glance
Health And Readiness

Liveness
Readiness

Logging
What Can Be Scaled Today

1. Worker Replicas
2. Queue Specialization With The Same Image
3. Remote Artifact Storage
4. Optional OpenSearch For Glossary Lookup

What Blocks Safe High-Scale Deployment Right Now

PostgreSQL Exists, But The Deployment Contract Is Still Thin
Web Startup Is Not Replica-Friendly As-Is
Not Every Routed Queue Has An Implementation Yet
Maintenance Exists, But Scheduling Topology Does Not

Practical Scaling Order
Compose-Specific Notes

Operations And Scaling

This document covers the repository's current operational behavior and the scaling guidance that is justified by the checked-in code and Docker setup.

Current Runtime Baseline

The checked-in local runtime is designed as a development stack, not a finished production deployment.

Current baseline facts:

the web process starts with python manage.py runserver 0.0.0.0:8000
the web startup script runs python manage.py migrate --noinput on container start
the checked-in worker uses Celery with --pool=solo --concurrency=1
local development now defaults to PostgreSQL, with SQLite kept as a fallback option
artifact storage can be local or S3-compatible
async workflow can be off or on depending on configuration
/console/ is a session-authenticated operator UI served by the Django web process
/api/v1/* is authenticated and role-gated, while health and signed artifact downloads stay public
Django admin remains an internal-only support surface, not the supported operator workflow
maintenance can be run through a checked-in task/command, but no scheduler topology is shipped
readiness checks cover database, storage, Aspose, broker, and optional OpenSearch

Scaling Order At A Glance

flowchart TD
	A[Enable async workflow with a real broker] --> B[Move artifacts to S3-compatible storage]
	B --> C[Add worker replicas]
	C --> D[Run scheduled maintenance]
	D --> E[Split workers by queue family if needed]
	E --> F[Separate migrations from web startup]
	F --> G[Replace runserver with a production web process]
	G --> H[Harden PostgreSQL for deployment-grade operation]

Health And Readiness

Liveness

GET /health/live returns a simple service-ok response.

Readiness

GET /health/ready checks:

database connectivity
storage write/read/delete round-trip
Aspose runtime loading when enabled
broker connectivity when async workflow is enabled
OpenSearch connectivity when OpenSearch is enabled

Status behavior:

ok when a check passes
disabled when the related optional feature is off
error when the check fails

The endpoint returns HTTP 503 when any enabled check fails.

Logging

settings.py configures:

console logging
a rotating file handler at var/logs/iris.log

The root log level is DEBUG when Django debug mode is on, otherwise INFO.

Audit records now also capture the authenticated actor and request source IP for API-driven control, configuration, review, and governance actions.

What Can Be Scaled Today

1. Worker Replicas

The safest checked-in scaling lever is horizontal Celery worker replication.

Why this is the first scaling step:

async tasks already persist job state in Django models
queue routing is already defined
the worker is intentionally configured with --concurrency=1, so more throughput should come from more worker processes or replicas, not from assuming high in-process concurrency

2. Queue Specialization With The Same Image

The repo does not ship separate worker services, but the queue contract already exists. A deployment can run multiple Celery worker processes from the same image with narrower --queues lists if isolation is needed.

Examples of useful splits based on current routed queues:

extraction and reassembly workers for docx_extract,docx_reassemble
translation workers for translate_batch,qa_verify
control and review workers for job_control,review_io

This is an operational pattern supported by the current queue names; it is not a checked-in Compose topology.

3. Remote Artifact Storage

Artifact storage can already be moved off the local filesystem by setting IRIS_STORAGE_BACKEND=s3 and configuring an S3-compatible backend.

That reduces coupling to a single host's local disk and is already exercised by the Docker Compose stack through MinIO.

4. Optional OpenSearch For Glossary Lookup

When glossary volume or glossary-query latency becomes a concern, OpenSearch can already be enabled for glossary retrieval without changing the rest of the workflow.

What Blocks Safe High-Scale Deployment Right Now

PostgreSQL Exists, But The Deployment Contract Is Still Thin

The repo now ships a real PostgreSQL configuration path and the checked-in local setup uses PostgreSQL by default. That improves local/runtime parity, but it does not by itself define a full production database contract.

What is still missing:

connection-pooling guidance
backup/restore expectations
HA or failover guidance
PostgreSQL-specific operational validation beyond the local dev baseline

Web Startup Is Not Replica-Friendly As-Is

docker/start-web.sh runs migrations every time the web container starts and then launches Django's development server.

Operational implication:

before scaling web replicas, move migrations to a one-off deployment step
use a production web server or process model outside the checked-in runserver command

The repository does not currently provide that production web-process configuration.

Not Every Routed Queue Has An Implementation Yet

settings.py still routes retrieve_context, but retrieval continues to run inside translation processing and there is no matching checked-in tasks.workflow.retrieve_context task.

That means queue specialization should be based on the tasks that actually exist today.

Maintenance Exists, But Scheduling Topology Does Not

The codebase now ships a concrete maintenance path through tasks.workflow.maintenance_tick and python manage.py maintenance_tick.

Current maintenance behavior:

deletes expired artifacts using their stored retention metadata
deletes stale non-promoted candidate memory tied to terminal jobs
refreshes review-coverage snapshots
reports integrity issues for missing storage objects and completed jobs missing delivery artifacts

What is still not shipped:

a dedicated Celery Beat service or cron container in Compose
worker-role specialization in the checked-in Docker topology

Practical Scaling Order

If you need to scale this application beyond a single-machine dev setup, the codebase supports this order of operations best:

enable async workflow with a real broker
move artifact storage to S3-compatible storage
add more Celery worker replicas
run scheduled maintenance regularly through Celery Beat, cron, or another scheduler
split worker replicas by queue family if DOCX and translation workloads interfere with each other
separate migrations from web startup
replace the development-server web command with a production process manager
harden PostgreSQL into a production-grade shared database contract

Steps 1 through 3 are directly supported by the current code and configuration surface. Step 4 is supported at the task/command level but still expects an external scheduler. Steps 5 through 8 are operational work that the repo still expects the deployer to add.

Compose-Specific Notes

The checked-in Compose stack is useful for:

end-to-end local development
async workflow testing
local MinIO-backed S3 storage testing
local OpenSearch glossary testing

It should not be described as a production topology because it still uses:

Django runserver
startup-time migrations in the web container
a single generic worker service
development-grade PostgreSQL defaults rather than a production database contract