2 operations and scaling
william le roux edited this page 2026-04-01 16:51:37 +03:00

Operations And Scaling

This document covers the repository's current operational behavior and the scaling guidance that is justified by the checked-in code and Docker setup.

Current Runtime Baseline

The checked-in local runtime is designed as a development stack, not a finished production deployment.

Current baseline facts:

  • the web process starts with python manage.py runserver 0.0.0.0:8000
  • the web startup script runs python manage.py migrate --noinput on container start
  • the checked-in worker uses Celery with --pool=solo --concurrency=1
  • local development now defaults to PostgreSQL, with SQLite kept as a fallback option
  • artifact storage can be local or S3-compatible
  • async workflow can be off or on depending on configuration
  • /console/ is a session-authenticated operator UI served by the Django web process
  • /api/v1/* is authenticated and role-gated, while health and signed artifact downloads stay public
  • Django admin remains an internal-only support surface, not the supported operator workflow
  • maintenance can be run through a checked-in task/command, but no scheduler topology is shipped
  • readiness checks cover database, storage, Aspose, broker, and optional OpenSearch

Scaling Order At A Glance

flowchart TD
	A[Enable async workflow with a real broker] --> B[Move artifacts to S3-compatible storage]
	B --> C[Add worker replicas]
	C --> D[Run scheduled maintenance]
	D --> E[Split workers by queue family if needed]
	E --> F[Separate migrations from web startup]
	F --> G[Replace runserver with a production web process]
	G --> H[Harden PostgreSQL for deployment-grade operation]

Health And Readiness

Liveness

GET /health/live returns a simple service-ok response.

Readiness

GET /health/ready checks:

  • database connectivity
  • storage write/read/delete round-trip
  • Aspose runtime loading when enabled
  • broker connectivity when async workflow is enabled
  • OpenSearch connectivity when OpenSearch is enabled

Status behavior:

  • ok when a check passes
  • disabled when the related optional feature is off
  • error when the check fails

The endpoint returns HTTP 503 when any enabled check fails.

Logging

settings.py configures:

  • console logging
  • a rotating file handler at var/logs/iris.log

The root log level is DEBUG when Django debug mode is on, otherwise INFO.

Audit records now also capture the authenticated actor and request source IP for API-driven control, configuration, review, and governance actions.

What Can Be Scaled Today

1. Worker Replicas

The safest checked-in scaling lever is horizontal Celery worker replication.

Why this is the first scaling step:

  • async tasks already persist job state in Django models
  • queue routing is already defined
  • the worker is intentionally configured with --concurrency=1, so more throughput should come from more worker processes or replicas, not from assuming high in-process concurrency

2. Queue Specialization With The Same Image

The repo does not ship separate worker services, but the queue contract already exists. A deployment can run multiple Celery worker processes from the same image with narrower --queues lists if isolation is needed.

Examples of useful splits based on current routed queues:

  • extraction and reassembly workers for docx_extract,docx_reassemble
  • translation workers for translate_batch,qa_verify
  • control and review workers for job_control,review_io

This is an operational pattern supported by the current queue names; it is not a checked-in Compose topology.

3. Remote Artifact Storage

Artifact storage can already be moved off the local filesystem by setting IRIS_STORAGE_BACKEND=s3 and configuring an S3-compatible backend.

That reduces coupling to a single host's local disk and is already exercised by the Docker Compose stack through MinIO.

4. Optional OpenSearch For Glossary Lookup

When glossary volume or glossary-query latency becomes a concern, OpenSearch can already be enabled for glossary retrieval without changing the rest of the workflow.

What Blocks Safe High-Scale Deployment Right Now

PostgreSQL Exists, But The Deployment Contract Is Still Thin

The repo now ships a real PostgreSQL configuration path and the checked-in local setup uses PostgreSQL by default. That improves local/runtime parity, but it does not by itself define a full production database contract.

What is still missing:

  • connection-pooling guidance
  • backup/restore expectations
  • HA or failover guidance
  • PostgreSQL-specific operational validation beyond the local dev baseline

Web Startup Is Not Replica-Friendly As-Is

docker/start-web.sh runs migrations every time the web container starts and then launches Django's development server.

Operational implication:

  • before scaling web replicas, move migrations to a one-off deployment step
  • use a production web server or process model outside the checked-in runserver command

The repository does not currently provide that production web-process configuration.

Not Every Routed Queue Has An Implementation Yet

settings.py still routes retrieve_context, but retrieval continues to run inside translation processing and there is no matching checked-in tasks.workflow.retrieve_context task.

That means queue specialization should be based on the tasks that actually exist today.

Maintenance Exists, But Scheduling Topology Does Not

The codebase now ships a concrete maintenance path through tasks.workflow.maintenance_tick and python manage.py maintenance_tick.

Current maintenance behavior:

  • deletes expired artifacts using their stored retention metadata
  • deletes stale non-promoted candidate memory tied to terminal jobs
  • refreshes review-coverage snapshots
  • reports integrity issues for missing storage objects and completed jobs missing delivery artifacts

What is still not shipped:

  • a dedicated Celery Beat service or cron container in Compose
  • worker-role specialization in the checked-in Docker topology

Practical Scaling Order

If you need to scale this application beyond a single-machine dev setup, the codebase supports this order of operations best:

  1. enable async workflow with a real broker
  2. move artifact storage to S3-compatible storage
  3. add more Celery worker replicas
  4. run scheduled maintenance regularly through Celery Beat, cron, or another scheduler
  5. split worker replicas by queue family if DOCX and translation workloads interfere with each other
  6. separate migrations from web startup
  7. replace the development-server web command with a production process manager
  8. harden PostgreSQL into a production-grade shared database contract

Steps 1 through 3 are directly supported by the current code and configuration surface. Step 4 is supported at the task/command level but still expects an external scheduler. Steps 5 through 8 are operational work that the repo still expects the deployer to add.

Compose-Specific Notes

The checked-in Compose stack is useful for:

  • end-to-end local development
  • async workflow testing
  • local MinIO-backed S3 storage testing
  • local OpenSearch glossary testing

It should not be described as a production topology because it still uses:

  • Django runserver
  • startup-time migrations in the web container
  • a single generic worker service
  • development-grade PostgreSQL defaults rather than a production database contract