Table of Contents
- Operations And Scaling
- Related Docs
- Current Runtime Baseline
- Scaling Order At A Glance
- Health And Readiness
- Logging
- What Can Be Scaled Today
- 1. Worker Replicas
- 2. Queue Specialization With The Same Image
- 3. Remote Artifact Storage
- 4. Optional OpenSearch For Glossary Lookup
- What Blocks Safe High-Scale Deployment Right Now
- PostgreSQL Exists, But The Deployment Contract Is Still Thin
- Web Startup Is Not Replica-Friendly As-Is
- Not Every Routed Queue Has An Implementation Yet
- Maintenance Exists, But Scheduling Topology Does Not
- Practical Scaling Order
- Compose-Specific Notes
Operations And Scaling
This document covers the repository's current operational behavior and the scaling guidance that is justified by the checked-in code and Docker setup.
Related Docs
Current Runtime Baseline
The checked-in local runtime is designed as a development stack, not a finished production deployment.
Current baseline facts:
- the web process starts with
python manage.py runserver 0.0.0.0:8000 - the web startup script runs
python manage.py migrate --noinputon container start - the checked-in worker uses Celery with
--pool=solo --concurrency=1 - local development now defaults to PostgreSQL, with SQLite kept as a fallback option
- artifact storage can be local or S3-compatible
- async workflow can be off or on depending on configuration
/console/is a session-authenticated operator UI served by the Django web process/api/v1/*is authenticated and role-gated, while health and signed artifact downloads stay public- Django admin remains an internal-only support surface, not the supported operator workflow
- maintenance can be run through a checked-in task/command, but no scheduler topology is shipped
- readiness checks cover database, storage, Aspose, broker, and optional OpenSearch
Scaling Order At A Glance
flowchart TD
A[Enable async workflow with a real broker] --> B[Move artifacts to S3-compatible storage]
B --> C[Add worker replicas]
C --> D[Run scheduled maintenance]
D --> E[Split workers by queue family if needed]
E --> F[Separate migrations from web startup]
F --> G[Replace runserver with a production web process]
G --> H[Harden PostgreSQL for deployment-grade operation]
Health And Readiness
Liveness
GET /health/live returns a simple service-ok response.
Readiness
GET /health/ready checks:
- database connectivity
- storage write/read/delete round-trip
- Aspose runtime loading when enabled
- broker connectivity when async workflow is enabled
- OpenSearch connectivity when OpenSearch is enabled
Status behavior:
okwhen a check passesdisabledwhen the related optional feature is offerrorwhen the check fails
The endpoint returns HTTP 503 when any enabled check fails.
Logging
settings.py configures:
- console logging
- a rotating file handler at
var/logs/iris.log
The root log level is DEBUG when Django debug mode is on, otherwise INFO.
Audit records now also capture the authenticated actor and request source IP for API-driven control, configuration, review, and governance actions.
What Can Be Scaled Today
1. Worker Replicas
The safest checked-in scaling lever is horizontal Celery worker replication.
Why this is the first scaling step:
- async tasks already persist job state in Django models
- queue routing is already defined
- the worker is intentionally configured with
--concurrency=1, so more throughput should come from more worker processes or replicas, not from assuming high in-process concurrency
2. Queue Specialization With The Same Image
The repo does not ship separate worker services, but the queue contract already exists. A deployment can run multiple Celery worker processes from the same image with narrower --queues lists if isolation is needed.
Examples of useful splits based on current routed queues:
- extraction and reassembly workers for
docx_extract,docx_reassemble - translation workers for
translate_batch,qa_verify - control and review workers for
job_control,review_io
This is an operational pattern supported by the current queue names; it is not a checked-in Compose topology.
3. Remote Artifact Storage
Artifact storage can already be moved off the local filesystem by setting IRIS_STORAGE_BACKEND=s3 and configuring an S3-compatible backend.
That reduces coupling to a single host's local disk and is already exercised by the Docker Compose stack through MinIO.
4. Optional OpenSearch For Glossary Lookup
When glossary volume or glossary-query latency becomes a concern, OpenSearch can already be enabled for glossary retrieval without changing the rest of the workflow.
What Blocks Safe High-Scale Deployment Right Now
PostgreSQL Exists, But The Deployment Contract Is Still Thin
The repo now ships a real PostgreSQL configuration path and the checked-in local setup uses PostgreSQL by default. That improves local/runtime parity, but it does not by itself define a full production database contract.
What is still missing:
- connection-pooling guidance
- backup/restore expectations
- HA or failover guidance
- PostgreSQL-specific operational validation beyond the local dev baseline
Web Startup Is Not Replica-Friendly As-Is
docker/start-web.sh runs migrations every time the web container starts and then launches Django's development server.
Operational implication:
- before scaling web replicas, move migrations to a one-off deployment step
- use a production web server or process model outside the checked-in
runservercommand
The repository does not currently provide that production web-process configuration.
Not Every Routed Queue Has An Implementation Yet
settings.py still routes retrieve_context, but retrieval continues to run inside translation processing and there is no matching checked-in tasks.workflow.retrieve_context task.
That means queue specialization should be based on the tasks that actually exist today.
Maintenance Exists, But Scheduling Topology Does Not
The codebase now ships a concrete maintenance path through tasks.workflow.maintenance_tick and python manage.py maintenance_tick.
Current maintenance behavior:
- deletes expired artifacts using their stored retention metadata
- deletes stale non-promoted candidate memory tied to terminal jobs
- refreshes review-coverage snapshots
- reports integrity issues for missing storage objects and completed jobs missing delivery artifacts
What is still not shipped:
- a dedicated Celery Beat service or cron container in Compose
- worker-role specialization in the checked-in Docker topology
Practical Scaling Order
If you need to scale this application beyond a single-machine dev setup, the codebase supports this order of operations best:
- enable async workflow with a real broker
- move artifact storage to S3-compatible storage
- add more Celery worker replicas
- run scheduled maintenance regularly through Celery Beat, cron, or another scheduler
- split worker replicas by queue family if DOCX and translation workloads interfere with each other
- separate migrations from web startup
- replace the development-server web command with a production process manager
- harden PostgreSQL into a production-grade shared database contract
Steps 1 through 3 are directly supported by the current code and configuration surface. Step 4 is supported at the task/command level but still expects an external scheduler. Steps 5 through 8 are operational work that the repo still expects the deployer to add.
Compose-Specific Notes
The checked-in Compose stack is useful for:
- end-to-end local development
- async workflow testing
- local MinIO-backed S3 storage testing
- local OpenSearch glossary testing
It should not be described as a production topology because it still uses:
- Django
runserver - startup-time migrations in the web container
- a single generic worker service
- development-grade PostgreSQL defaults rather than a production database contract