3 implementation plan
william le roux edited this page 2026-04-01 16:51:37 +03:00

Implementation Plan: Verified Next Work

This file is not a speculative roadmap. It records the next delivery work that is justified by the current repository state and the current repo-truth docs.

What Is Already In Place

The checked-in repo already has:

  • DOCX intake and preflight
  • extraction, translation, QA, and reassembly workflow code
  • TMX export/import review flow
  • glossary and memory APIs
  • project, provider, and policy configuration APIs
  • artifact retention metadata and signed download behavior
  • optional OpenSearch-backed glossary retrieval
  • Celery-based staged async workflow dispatch
  • maintenance execution for retention cleanup, stale-candidate cleanup, review-coverage refresh, and integrity checks
  • job comparison, audit search, and review-coverage reporting
  • authenticated API access with operator/admin role enforcement
  • governance deletion/deactivation controls for artifacts and memory records
  • a session-authenticated Django-template operator console for the core workflow
  • background batch intake through queued submit_batch_item tasks
  • filename-based file history with per-run artifact access and run-to-run comparison
  • startup recovery for stale in_progress jobs
  • an enforced 80% coverage floor for the maintained app surface plus an opt-in live integration smoke harness

Recent Plan Items That Are No Longer Gaps

The older version of this plan treated these as pending work. They are now part of the baseline and should not be planned again unless they need follow-on refinement:

  • API authentication and authorization
  • audit attribution hardening
  • governance deletion controls
  • maintenance task implementation
  • the minimum operator console
  • queued batch intake and monitoring
  • the file-history console and run comparison baseline

Highest-Value Verified Gaps

These are the most important remaining gaps after reviewing Current Product Scope and Runtime Architecture.

1. File Identity And Lineage Contract

The repo now ships a useful file-history console, but it is still a filename-based grouping layered over DocumentVersion, Job, and Artifact rows.

What is missing:

  • an explicit supported identity contract for "the same file across runs" beyond project + source_filename
  • a stronger collision policy than "collapse and warn" when two unrelated files share the same filename inside one project
  • a rename story if the same logical file later arrives under a different filename
  • optional storage provenance capture such as object etag or provider version_id, without making bucket versioning the primary lineage key

2. File-Centric Comparison And Monitoring

The console can already inspect segments, compare runs, and expose final DOCX and TMX artifacts per run. The remaining gap is turning that into a first-class, supportable file-centric contract rather than a helpful operator-only view.

What is missing:

  • file-centric APIs that mirror the /console/files/ lineage and comparison views
  • richer diff filtering and pagination for large TMX-style comparisons
  • a clearer version summary for extracted, translated, review-required, and delivered state across many runs
  • explicit retention and missing-artifact behavior when older versions age out

3. Production Runtime Baseline

The current product and architecture docs still describe a development-first runtime:

  • PostgreSQL is available and now used by the checked-in host-local path, but the production database contract is still thin
  • the web process still runs through Django runserver
  • migrations still run on every web-container start in the checked-in Docker path
  • the repo does not ship a production WSGI/ASGI contract

This is now the biggest gap between the current implementation and a supportable non-dev deployment.

4. Supported Worker And Scheduler Topology

The architecture docs still say the checked-in stack ships one generic worker and no dedicated scheduler.

What is missing:

  • a supported split-worker deployment pattern
  • a checked-in scheduler path for maintenance
  • a clear production broker recommendation and validation path

This is a runtime-contract gap, not just a docs gap.

5. Retrieval Contract Cleanup

The docs and settings still expose a retrieve_context queue even though retrieval happens inside translation processing and there is no separate checked-in task.

What is missing:

  • either a real retrieval task and queue boundary, or
  • removal of the dead queue route and matching docs cleanup

In parallel, the memory/glossary retrieval strategy is still too lightly defined for a production-supportable contract.

6. Database And Index Strategy For Retrieval

The older plan correctly anticipated this gap, and it still exists:

  • there is no PostgreSQL-backed retrieval strategy shipped in code
  • there is no checked-in indexing/reindexing workflow for richer memory retrieval
  • the current docs do not yet define whether approved-memory lookup should stay simple, move to pg_trgm, add embeddings, or adopt pgvector

This needs to become an explicit supported strategy rather than a future-looking placeholder.

7. Large Artifact Ingress Hardening

The product and architecture docs still describe the web process as the entry point for uploads.

What is missing:

  • signed direct-to-object-storage upload/finalize flows for large DOCX and TMX artifacts
  • explicit finalize validation and failure behavior
  • a documented upload contract for large files

This is the clearest remaining storage/runtime scaling gap after signed downloads were added.

8. Delivery Integrity Hardening

The workflow already reassembles and delivers final DOCX files, but the older plan's delivery-hardening gap still remains:

  • stronger post-reassembly structural validation is not shipped
  • delivery is not yet blocked on those deeper structural failures

That matters because the product promise is not merely artifact generation, but reliable translated DOCX delivery.

9. Observability And Operability

The runtime docs still do not describe OpenTelemetry-based traces, metrics, or end-to-end correlation across web and worker flows.

What is missing:

  • trace propagation across web -> Celery -> provider/storage paths
  • operational metrics for queue latency, job-stage durations, and failure rates
  • consistent correlation IDs for logs and audit-friendly diagnostics

This is now one of the main supportability gaps rather than a product-feature gap.

10. Product Contract Clarification

The remaining open questions are still meaningful enough to block a final deployment/support story:

  • production database target
  • supported worker topology
  • private deployment minimum, including whether the target promise is private-network deployment only or something closer to fully disconnected execution

The product docs are now intentionally narrower than a marketing roadmap, so these decisions should be resolved explicitly rather than left implicit.

Suggested Work Order

Based on the current product and runtime docs, the next work should start by making the new file-centric workflow contract explicit and only then continue with runtime hardening.

  1. define the supported file-identity and collision contract for version history
  2. expand file-centric comparison, monitoring, and API coverage
  3. establish a production runtime baseline: PostgreSQL, production web process, explicit migration step
  4. ship a supported worker/scheduler topology and a production broker recommendation
  5. resolve the retrieval queue mismatch and define the real retrieval/index strategy
  6. harden large artifact upload and finalize flows
  7. add stronger post-reassembly structural validation
  8. add observability and correlation across web and worker processes
  9. close the remaining product/deployment contract questions in docs and code

Phase 12: File Identity And Lineage

Goal: Turn the current filename-based history view into an explicit supported lineage contract.

Tasks:

  • Decide whether the supported near-term identity remains project + source_filename or moves to an explicit tracked-file key.
  • If filename-based identity remains for now, enforce and document collision behavior at intake instead of only warning after grouping.
  • If a tracked-file key is introduced, backfill existing DocumentVersion rows into deterministic file timelines.
  • Persist optional storage provenance such as checksum, etag, and provider version_id when available, but keep lineage resolution application-side.
  • Document how reruns, reuploads, and renamed files are represented in the lineage model.

Checks:

  • every DocumentVersion belongs to one deterministic file timeline
  • collisions are either impossible or explicit to operators
  • object-storage versioning remains supplementary provenance rather than the primary lineage key

Tests to add:

  • filename-collision behavior tests
  • lineage backfill and idempotency tests
  • rename/history continuity tests if renamed-file support is added
  • artifact provenance persistence tests

Exit criteria:

  • operators can rely on file history as a supported contract rather than a best-effort filename grouping

Phase 13: File-Centric Comparison And Monitoring

Goal: Make version inspection and run comparison first-class at the file level.

Tasks:

  • Add file-history and comparison API endpoints that mirror the /console/files/ views.
  • Add comparison filters for changed text, verification_state, result_source, provider model, and review status.
  • Add per-run version summaries for extracted, translated, review-required, and delivered state, including percentages.
  • Add pagination or similar controls so large segment tables and comparisons stay usable.
  • Keep DOCX access versioned and downloadable per run; do not promise structural DOCX diffing until such a feature is actually shipped.

Checks:

  • users can answer "what changed between runs?" without falling back to raw database inspection
  • per-run DOCX and TMX access remains available from the file-centric surface
  • large files remain inspectable without loading every segment at once

Tests to add:

  • file-history API tests
  • comparison filtering and pagination tests
  • artifact-summary regression tests
  • large-history performance smoke tests

Exit criteria:

  • file/version monitoring is clearly file-centric instead of primarily job-centric

Phase 14: Production Runtime Baseline

Goal: Close the gap between the current development-first runtime and a supportable production deployment baseline.

Tasks:

  • Add a supported PostgreSQL configuration path and migration/test guidance.
  • Replace the implicit runserver deployment assumption with a documented production web-process contract.
  • Move migrations out of normal web-container startup and define a one-off migration step.
  • Define the supported storage/broker/database matrix for non-dev deployments.
  • Update Compose or add a separate checked-in deployment example that reflects the supported baseline more honestly.

Checks:

  • the app boots cleanly on PostgreSQL
  • the web process runs without runserver
  • migrations can be executed independently of normal web startup

Tests to add:

  • PostgreSQL smoke tests
  • deployment-startup smoke tests for the production web-process path
  • migration compatibility tests against PostgreSQL

Exit criteria:

  • the runtime docs no longer depend on development-only assumptions for the baseline deployment story

Phase 15: Worker Topology And Retrieval Contract

Goal: Make the queue topology and retrieval behavior match what the runtime docs claim and what operations can actually support.

Tasks:

  • Decide whether retrieve_context becomes a real task or is removed from routing.
  • Ship a supported worker-role topology, including scheduler execution for maintenance.
  • Validate and document the recommended broker path for production use.
  • Define the approved-memory and glossary retrieval strategy on PostgreSQL.
  • Add the indexing/reindexing workflow required by the chosen retrieval strategy.

Checks:

  • every routed queue corresponds to a real supported contract
  • the worker split and scheduler story can be run repeatedly without undocumented steps
  • retrieval behavior is explicit in code and docs

Tests to add:

  • split-worker and scheduler smoke tests
  • retrieval integration tests on the chosen database backend
  • indexing/reindexing idempotency tests

Exit criteria:

  • the runtime architecture no longer contains queue contracts or retrieval promises that are only partially implemented

Phase 16: Artifact Transfer And Delivery Hardening

Goal: Reduce web-tier bottlenecks for large files and tighten the correctness bar for final delivery artifacts.

Tasks:

  • Add signed direct-to-object-storage upload and finalize flows for large DOCX and TMX files.
  • Define server-side finalize validation for uploaded artifacts before they enter the workflow.
  • Add stronger post-reassembly structural validation for delivered DOCX files.
  • Block delivery when those structural checks fail.

Checks:

  • large uploads do not require the web tier to proxy the entire payload
  • finalize validation rejects incomplete or invalid uploads
  • structurally invalid reassemblies fail closed instead of shipping as completed jobs

Tests to add:

  • signed upload/finalize flow tests
  • invalid finalize-path tests
  • post-reassembly structural validation regression tests

Exit criteria:

  • the delivery path is materially safer and more scalable than the current proxy-through-web model

Phase 17: Observability And Operability

Goal: Make the system diagnosable and supportable across web, worker, storage, and provider boundaries.

Tasks:

  • Add OpenTelemetry-based tracing for web requests and Celery tasks.
  • Add metrics for queue latency, job-stage durations, throughput, and failure classes.
  • Add correlation IDs across web logs, worker logs, and task execution.
  • Define the minimum operational dashboards or log queries needed to support the system.

Checks:

  • one job can be traced from intake through worker execution and delivery
  • failures can be correlated across logs and task boundaries

Tests to add:

  • tracing/metrics smoke tests where practical
  • correlation-ID propagation tests

Exit criteria:

  • operators can diagnose real workflow failures without relying on ad hoc database inspection

Phase 18: Product Contract Closure

Goal: Resolve the remaining open decisions that still affect deployment claims and future plan stability.

Tasks:

  • Decide the supported production database target.
  • Decide the supported worker topology and scheduler story.
  • Decide the private deployment minimum, including whether full disconnected execution is a real target.
  • Update Open Questions to remove decisions that become resolved.
  • Re-check Current Product Scope and Runtime Architecture after those decisions land.

Checks:

  • the product and architecture docs stop carrying critical deployment ambiguity

Exit criteria:

  • the next implementation-plan revision can focus on concrete delivery work rather than unresolved platform promises

Explicit Non-Goals Unless The Product Docs Change

Do not put these back into the next plan without first changing the repo-truth docs:

  • a separate SPA or standalone frontend beyond the Django-template console
  • Django admin as a supported operator workflow
  • non-DOCX source intake as if it were already on the near-term product path

Documentation Rule For Future Changes

When any of the gaps above are implemented, update the scoped docs in this order:

  1. Runtime Architecture
  2. Setup And Configuration
  3. Operations And Scaling
  4. the workflow, storage, API, or product docs affected by the change

If a change is still undecided, put it in Open Questions instead of documenting it as current behavior.