Enterprise Readiness Assessment¶

Subject: someli-api (the backend HTTP API + background job fleet) Audience: CTO, engineering leadership, security & compliance, ops Out of scope: the web client, the embedded dashboard frontend, business-side cron jobs not present in this repo. Those will be assessed separately.

This document is an enterprise auditor's view of what it takes to turn the current backend into platform that can credibly serve enterprise customers, pass SOC 2 / ISO 27001 audit, scale horizontally, and be operated by an on-call rotation without heroics.

It does not prescribe specific code changes. It identifies the structural, architectural, and operational gaps, names the standards we should hold the platform to, and proposes a phased modernization roadmap.

1. Executive Summary¶

1.1 Current state in one paragraph¶

someli-api today is a distributed monolith: one Express HTTP application sharing a single MySQL database with ~108 background workers. The HTTP layer is concentrated in two single-file route modules totaling ~46,000 lines (routes/auth.js at 24,432 lines and routes/routes.js at 21,631 lines). Workers communicate via DB polling rather than a job queue. There are no automated tests, no structured logging, no rate limiting, no global error handler, no API contract document, and no observability pipeline beyond pm2 logs. Secrets — including a Slack bot token, the Polotno render key, and the Express session secret — are hardcoded into source. CI/CD deploys to a single host via SSH. The system ships features and serves real users today despite its design rather than because of it. It cannot pass a serious enterprise procurement security review, it cannot horizontally scale to a second host without correctness regressions (double-posting, race conditions on the polling lock, real-time state divergence), and it cannot survive a credible technical due-diligence pass by an acquirer or a sophisticated enterprise customer.

1.2 Verdict¶

Readiness rating: Level 1 of 5 (see Maturity Model).

This rating is below the floor that most enterprise procurement processes will accept and below the level at which the current owners' stated ambitions for the platform can be realized. Level 1 is recoverable — many successful enterprise platforms started here — but recovery is a deliberate, multi-quarter program, not a series of sprints. Many of the gaps in this report are consequences of trading engineering rigor for shipping velocity, which is rational at small scale and self-defeating at the scale this platform is now operating at.

1.3 The shape of the work¶

Three categories of investment, roughly in order of urgency:

Priority	Theme	Why now
P0 — Risk	Security gaps, secrets in source, no rate limiting, no global error handler, hardcoded credentials	These are findings that block enterprise contracts and pose immediate breach risk
P1 — Operability	Observability, structured logging, job queue, deployment automation, IaC	Without these, scaling the team or the customer base will produce on-call pain at a faster rate than headcount can absorb
P2 — Architecture	Modular boundaries, language/runtime strategy, data partitioning, contract-first APIs	These are the long-lead investments. They unlock everything else but take time

1.4 Recommended posture¶

Do not rewrite. Do not decompose into microservices on day one. The value-bearing path is:

Stabilize — close the highest-severity security and operability gaps in the existing monolith (3–6 months).
Modularize — refactor the monolith into bounded, independently-testable modules sharing one process (6–12 months).
Extract selectively — only break out modules into separate services when there is a specific operational reason (different scaling profile, different team, different compliance boundary) — typically year 2+.

Important framing. "No rewrite" and "modular monolith" are not synonyms for "small change." Phases 1 and 2 of this roadmap touch substantially every file in the codebase. The label "no rewrite" describes a strategy for preserving running behavior in production — it does not describe the size of the engineering investment. The investment is large. Treating it as small is the most common way these programs fail.

On language: the platform does not need a wholesale runtime change, but it does need to stop being a single-runtime monolith. The recommended target is Node.js + TypeScript for the HTTP layer and Go (or JVM/Kotlin) for the worker fleet, joined by a real job queue. The earlier all-Node-forever framing has been revised after a candid review of Node.js as an operational platform — see §5.1 Language & Runtime.

Calibration for team size. This document was originally drafted assuming a larger, more experienced engineering organization. It has been re-tuned for a small team with limited prior experience refactoring at this scale. Concretely, that means the recommendations bias toward:

Boring choices over clever ones. Direct function calls over event buses. One database over many. One process over many. Synchronous over distributed. The complexity tax on every "interesting" choice falls hardest on a small team that has to operate it.
Incremental change over rewrites. No restructuring PR should be too large for one engineer to review meaningfully. If it is, split it.
Tooling that enforces patterns automatically. Linters, CI gates, and templates do the work that a senior architect would otherwise do in code review. With a small team and limited senior bandwidth, these are essential.
A bias toward not adopting things. When in doubt about whether to add a pattern, library, framework, language, or service — don't. If the current code works, leaving it alone has a lower risk than changing it. Re-evaluate when the team has evidence of a real pain it would solve.
One step at a time. Phases in this roadmap are sequential for a reason. Skipping ahead (e.g., trying to extract microservices before the modular monolith exists) is a leading cause of failure.

The recommendations below have been written with these constraints in mind. Where the doc says something can wait — it can. Where it says something must happen now — it must, regardless of team size.

1.5 Hard Truths¶

This section exists because the rest of this document is written in audit-report register. The owners of this platform have set a high bar, and a high bar deserves a direct read of what is actually being looked at. The following statements cannot be diplomatically softened:

The two route files — routes/auth.js (24,432 lines) and routes/routes.js (21,631 lines), ~46,000 lines combined — are engineering malpractice. There is no defensible reading of this. No code review process produced these files; they accumulated over 3.5 years across 21 contributors and 1,490+ commits to routes.js alone. Any senior engineer evaluating this codebase in a hiring or due-diligence context identifies the problem inside their first 30 minutes and forms an opinion that is hard to recover. The files are not the output of automation or AI generation — they are the residue of an organizational gap (no architectural review process with authority to block merges that grow these files).
Hardcoded secrets in committed source code is grounds for failing any responsible vendor security review. A Slack bot token, a Polotno render key, the Express session secret, and likely more are currently in source. They are in git history. They must be considered compromised and rotated today, before anything else in this roadmap.
Zero automated tests on a paid B2B platform in 2026 is below the industry minimum acceptable threshold. This is not a "we'll get to it" item. It is the reason every refactor is risky, every PR review is shallow, every regression is found by a customer, and every senior hire that reads the codebase declines.
The double-posting risk under retries is not theoretical. Without idempotency keys on social-publishing jobs, a single retried network blip can post the same content to a customer's audience twice. A high-follower brand experiencing this is a customer-relationship event that no apology fully repairs and no contractual SLA covers.
The platform cannot horizontally scale today. A second host running these workers would race the first host on the polling pattern and double-process records. The Socket.IO global means real-time state would be wrong on a fraction of requests. The platform's growth ceiling is the largest VM that can be purchased — and that ceiling is closer than the current load suggests.
Technical due diligence by an acquirer or a sophisticated investor would identify the issues in this document inside week 1 and meaningfully discount the valuation. This is not hypothetical. It is what every TDD checklist screens for, and the findings here would not survive that screen.
The current state actively impairs senior hiring. Engineers with 7+ years of experience who read this codebase as part of an interview process will decline. The exact level of engineer needed to fix it is the level least likely to accept the role as it stands. The codebase improvements are the recruiting tool.
The duplications in the codebase are an organizational symptom, not a technical one. Two RAG implementations, three database drivers, dual auth paths, 86 separate S3 client instantiations, 49 separate Polotno instantiations — none of these are individually catastrophic. Collectively, they indicate that there is no architectural review with authority before code merges. The fix lives in the org chart and the PR template, not in the IDE.
The platform is one bad incident away from a public reputation problem. A double-post on a high-follower brand, an AI-generated post containing PII surfaced to an unintended audience, a leaked customer list via a CORS misconfiguration — any of these are presently within reach. The probability is not zero and grows with customer base.
Maintaining the current trajectory is the most expensive option on the table. The cost of the modernization roadmap below is real but bounded; the cost of continuing to defer it compounds with every new customer, every new feature, and every new engineer who has to learn the existing implicit conventions instead of reading explicit ones.

These are the items that make the roadmap urgent, not aspirational. The rest of this document describes how to address them in a way that does not destroy value to create it.

2. Methodology & Scope¶

2.1 What was audited¶

The someli-api repository (HTTP API + background jobs)
Configuration, deployment manifests, ecosystem files
The codebase documentation (under doc/) generated from prior code analysis

2.2 What was NOT audited¶

The web client (separate repo / deliverable)
The dashboard frontend (server-rendered analytics UI)
Out-of-band data pipelines or business cron jobs not in this repo
The actual production infrastructure (we audited Dockerfile, nginx config, ecosystem.config.js — not the running cluster)
Vendor-side configuration (SendGrid, Paddle, AWS account hardening)

2.3 Frameworks referenced¶

This assessment is structured against widely-adopted frameworks so findings map cleanly to procurement and audit requirements:

AWS Well-Architected Framework (six pillars: operational excellence, security, reliability, performance efficiency, cost optimization, sustainability)
OWASP API Security Top 10 (2023)
CIS Controls v8
SOC 2 Trust Services Criteria (security, availability, processing integrity, confidentiality, privacy)
ISO/IEC 27001:2022 (information security management)
DORA metrics (deployment frequency, lead time, MTTR, change failure rate) for delivery performance
Twelve-Factor App (config, dependencies, processes, port-binding) — a baseline check

3. Maturity Model¶

A 5-level maturity rating per pillar. 1 = ad-hoc / heroic, 5 = optimized / measured / continuously improving.

Pillar	Current	12-mo Target	24-mo Target
Architecture & Modularity	1	3	4
API Contract & Versioning	1	3	4
Data Architecture	1	3	4
Background Processing	1	4	4
Security & Compliance	1	3	4
Observability	1	4	4
Reliability & Resilience	1	3	4
Scalability	1	3	4
Testing & Quality Gates	1	3	4
CI/CD & Deployment	2	4	4
Infrastructure as Code	1	3	4
Cost Visibility & FinOps	1	2	3
Documentation & Knowledge Mgmt	3	4	4
Team Practices & Governance	2	3	4

Note on the rating revision. Data Architecture, Reliability & Resilience, and Scalability were dropped from 2 to 1 in this revision. A platform that cannot run on a second host without correctness regressions does not earn a "repeatable" rating on Scalability. A platform with no migration tool, three competing DB drivers, and analytics tables in the OLTP store does not earn a 2 on Data Architecture. A platform with no idempotency, no graceful shutdown, and a single-DB SPOF does not earn a 2 on Reliability. The previous ratings were generous.

Rating rubric¶

1 — Initial: ad-hoc, heroic, undocumented, single point of human failure
2 — Repeatable: same person can do it twice the same way; partly documented
3 — Defined: codified, anyone on the team can do it, low truck-factor risk
4 — Managed: instrumented, measured, SLO-bounded
5 — Optimizing: continuous improvement loop, automated remediation

Most healthy enterprise platforms operate at level 3–4 across most pillars. Level 5 is rare and usually only worth it for the 1–2 pillars where you have a competitive moat.

4. Current-State Findings (by pillar)¶

This section reads like an audit report — dispassionate, quantified where possible, citing the artifact in the codebase that grounds the finding.

4.1 Architecture & Modularity¶

Finding	Severity	Evidence
HTTP routes concentrated in two files totaling ~46,000 lines	High	`routes/auth.js` (24,432 lines), `routes/routes.js` (21,631 lines)
No internal module boundaries — billing logic, content generation, social publishing, AI orchestration, and analytics all share the same module graph and import each other freely	High	Codebase structure
108 background jobs at the project root with no namespacing or grouping	High	`ls job_*.js`
Three different MySQL access patterns coexist (`mysql` async, `mysql2/promise`, `sync-mysql`)	Medium	See `architecture-overview.md`
Dual authentication paths (Passport JWT + custom AES-encrypted token in `methods.js`)	Medium	`middlewares/auth.js`, `methods.js`
Two parallel RAG implementations (in-memory + Vertex Cloud RAG) with no clear policy on which to use when	Medium	`helper/ragProcess.js`, `helper/functionsForAi/cloudRag.js`
The dashboard sub-app is partly a separate process and partly mounted into the main app — the boundary is unclear	Low	`dashboard/server.js` and `server.js`

The two route modules totaling ~46,000 lines (routes/auth.js at 24,432 lines and routes/routes.js at 21,631 lines) are the single most visible architectural failure in the codebase. Neither file can be edited safely (the surface area is too large for any individual to hold in working memory), neither can be reviewed effectively (a meaningful PR-level review requires holding the same context), and neither can be tested in isolation. Splitting them is on the order of magnitude of work as rebuilding the rest of the system combined, and yet it is unavoidable: as long as they exist in their current form, every other improvement in this document is bottlenecked by them.

Forensic note. A natural question is whether these files are the residue of automated or AI-driven code generation that nobody dared touch. They are not. The git history shows ~1,490 commits to routes/routes.js alone, distributed across 21 distinct human contributors over 3.5 years (August 2022 → March 2026), with no AI-tooling co-author trailers and no machine-generation markers in the source. Stylistic markers — inconsistent var/const, persistent typo "carousal" propagated through commits, three S3 client instances named s3 / s2 / s32, 235 commented-out code blocks, terse commit messages like "changes" and "carousal edit changes" — are all consistent with hand-authored code under shipping pressure. AI tooling appears in the codebase only as a product feature (calling Claude / GPT / Gemini APIs on behalf of users), never as a code-generation aid. The fix is therefore organizational as much as technical — without a code-review practice that has authority to block file-growth, the same pattern will reproduce in whatever modules replace these.

Risk: any meaningful change requires a developer to hold either route file in their head. Onboarding new engineers takes weeks; senior engineers form a negative opinion of the codebase before they have read 5% of it. Refactoring is high-risk because there are no tests to catch regressions, and so refactoring stops happening.

4.2 API Contract & Versioning¶

Finding	Severity	Evidence
No OpenAPI / Swagger / AsyncAPI specification	High	Repo grep
No versioning strategy — endpoints have no `/v1/` prefix or header-based versioning	High	All routes
~728 endpoints (per the API inventory) without a published contract	High	`API-inventory.md`
No request/response schema validation beyond one `registerVal` rule	High	`middlewares/validation.js`
Inconsistent response shape (envelope vs raw JSON vs plain string)	Medium	See `error-handling.md`
No deprecation policy or sunset headers	Medium	—

Risk: breaking changes ship to clients without warning. External integrators cannot generate SDKs. The web client and the API are coupled through tribal knowledge.

4.3 Data Architecture¶

Finding	Severity	Evidence
Single MySQL instance for the entire system (~140+ tables)	High	`data-model.md`
No schema migration tool — schema changes are applied manually	High	No `migrations/` directory
Mixed sync and async drivers in the same process — `sync-mysql` blocks the event loop	High	Multiple jobs
Heavy use of polling-style queries (`SELECT WHERE status IN (0, 2)` every 10–30s) creates persistent DB load	High	All `job_*.js` files
No read replicas, no separation of OLTP and analytics	Medium	Infrastructure
Analytics tables (`tSPostInsights`, `tSAccountIns`) sit alongside transactional tables in the same DB	Medium	`data-model.md`
No query-performance monitoring (no slow-query log surfacing, no APM-level DB tracing)	Medium	—
No data retention policy visible	Medium	—
Backups: not auditable from this repo	Unknown	Out of scope here

Risk: schema drift between environments, no rollback plan for destructive migrations, the database is a single point of failure for all 200+ runtime processes (HTTP + jobs), and analytics queries can starve the OLTP workload.

4.4 Background Processing¶

Finding	Severity	Evidence
108+ workers polling MySQL on cron schedules with no proper job queue	Critical	`ecosystem.config.js`, `job_*.js`
`ecosystem.config.js` has at least 7 broken or wrong-path script references	High	See `jobs-inventory.md` § Known Discrepancies
Status flow (`0 → 2 → 3`) is encoded inconsistently across jobs — some retry, some give up after one failure	High	Per-job code
No dead-letter queue, no exponential backoff	High	—
Concurrency control is per-process via an `isOnProcess` flag — does not survive multi-host deployment	High	All cron jobs
Polotno renderers (49 callsites) instantiate fresh Chromium per call — no pooling	High	`media-processing.md`
Failed external API calls (social publishing, AI providers) have no idempotency keys, so retries can double-post	Critical	All publishing jobs
No observability — there is no dashboard showing queue depth, throughput, or per-job error rate	High	—

Risk: scaling to a second host today would cause double-posts to social media and double-charges to AI providers because the polling pattern lacks distributed locking. The DB is being hammered by hundreds of needless queries per second. Memory pressure from un-pooled Chromium instances is the single biggest source of crashes.

4.5 Security & Compliance¶

Plainly: the security posture documented below would not be approved for production by a competent CISO. Any vendor security questionnaire (SIG, CAIQ, the AWS shared-responsibility checklist) submitted in this state would surface multiple disqualifying findings. Several individual items below are independently sufficient grounds for a procurement team to reject the vendor — the cumulative effect is that the platform is currently un-sellable to a buyer with mature security review.

Finding	Severity	Evidence
Hardcoded Slack bot token (`xoxb-...`) in source	Critical	Multiple `job_*.js` files
Hardcoded Polotno render key in 49 files	High	`media-processing.md`
Hardcoded Express session secret	High	`server.js`
No rate limiting anywhere — including on `/auth/login` and `/auth/register`	Critical	`security.md`
CORS open to all origins (`cors()` with no options)	High	`server.js`
`helmet` not loaded — no security headers	High	`server.js`
150 MB JSON body limit globally — DoS-friendly	High	`server.js`
Only one validation rule (`registerVal`) across ~728 endpoints	High	`middlewares/validation.js`
No global error handler — stack traces may leak to clients	Medium	See `error-handling.md`
No SAST in CI, no Dependabot/Snyk dependency scanning	High	`.github/`
No penetration testing record visible	Medium	—
No formal threat model or security review process	High	—
User input is templated into AI prompts — no prompt-injection mitigations	Medium	`helper/functionsForAi/getAiParameters.js`
AES key for token encryption: rotation policy undocumented	High	`helper/tokenGenerator.js`
GDPR data subject rights (access, deletion) — no documented endpoint or process	High (if EU customers)	—
No SOC 2 readiness — no access controls audit trail, no formal change management	High	—
Audit logging of admin actions: none observed	High	—

Risk: the platform cannot pass an enterprise security review (Vendor Risk Assessment, SOC 2 audit, GDPR DPA negotiation). Several findings are independently sufficient grounds for a procurement team to reject the vendor. The hardcoded Slack token and the open CORS configuration are the most pressing because they can be exploited today with public information.

⚠ Assume the hardcoded Slack token, Polotno key, and session secret are compromised. They have been committed to a git repository. Even if the repo is private now, treat them as exposed and rotate.

4.6 Observability¶

Finding	Severity	Evidence
No structured logging — `console.log` everywhere	Critical	`logging-observability.md`
`winston` is in `package.json` but never used	Medium	—
No request logger — incoming requests are not captured	High	`server.js`
No correlation IDs / trace IDs	High	—
No metrics endpoint, no Prometheus/StatsD/CloudWatch publication	High	—
No distributed tracing	High	—
No APM (no Datadog / New Relic / Sentry agent)	High	—
No alerting on error rate, latency, or job failures	Critical	—
`/health` and `/db-health` exist — that's the entire health-monitoring surface	(Positive baseline)	`server.js`
No SLOs / SLIs defined	High	—
Logs survive only as long as `pm2 logrotate` keeps them on the host	High	`ecosystem.config.js`

Risk: when something breaks in production, the engineer on call has pm2 logs | grep as their primary tool. There is no way to: - Correlate a customer report to a request, a request to a downstream call, or a call to a database query. - Detect a regression — there is no baseline to deviate from. - Demonstrate uptime to enterprise customers (no measured SLA).

4.7 Reliability & Resilience¶

Finding	Severity	Evidence
No circuit breakers on external API calls (Gemini, Bedrock, Vertex, social platforms)	High	All AI helpers + publishing jobs
No retry-with-backoff helper — retries are per-job with hardcoded counts	Medium	Per-job code
No idempotency keys on outbound calls	Critical	All publishing jobs
No bulkhead / resource isolation — one runaway job can starve the whole process	High	Single-process design
No graceful shutdown handling (SIGTERM → drain → exit)	Medium	`server.js`
Single MySQL = single point of failure	High	Infrastructure
Single Polotno key, single SendGrid key — no failover providers	Medium	`helper/constants.js`
No chaos testing	Medium	—
No disaster recovery plan or RTO/RPO commitments	High	—
No backup verification (untested backups = no backups)	High	—

Risk: a single bad deploy or a single AI-provider outage can cause cascading failures. The double-posting risk under retries is the most acute correctness issue in the platform: a multi-thousand-follower brand getting the same post from your platform twice is a customer-relationship event that no apology fully repairs and no contractual SLA covers. The probability is not theoretical; it is "every retry on every flaky outbound API call until idempotency keys are introduced."

4.8 Scalability¶

Finding	Severity	Evidence
Single-host deployment — workers cannot run on a second host without coordination	Critical	`ecosystem.config.js`
`socketConnection` global stores only the most recently connected client — Socket.IO does not horizontally scale	High	`server.js`, see `realtime-events.md`
Polling-based job queue does not scale (more workers = more wasted DB queries)	High	All cron jobs
Sync MySQL calls block the Node event loop	High	`sync-mysql` users
Polotno renderers (Chromium) consume ~100 MB+ each — no pooling, no quota	High	`media-processing.md`
150 MB body limit means a single request can monopolize memory	High	`server.js`
No read replicas	Medium	Infrastructure
No CDN for generated media (per the docs)	Medium	`media-processing.md`

Capacity ceiling: in its current form the platform tops out at the size of the largest VM you can buy + the largest MySQL instance you can buy. This is fine for tens of thousands of users; it is not fine for hundreds of thousands or for spiky enterprise customers.

4.9 Testing & Quality Gates¶

Finding	Severity	Evidence
Zero automated tests	Critical	`package.json` — `npm test` is a placeholder
No CI test gate	Critical	`.github/workflows/`
No type system (plain JS, not TS) — typos are caught at runtime	High	`package.json`
No linting policy (no `.eslintrc` enforced in CI)	Medium	—
No code-coverage tracking	Medium	—
Code review process: not visible from this repo	Unknown	—
No staging environment is documented as "production-like" — UAT is the only intermediate	Medium	`deployment.md`

Risk: zero automated tests on a paid B2B platform in 2026 is below the industry minimum acceptable threshold. Every refactor is a coin flip; the route files have grown to 24k and 21k lines respectively because splitting them is scarier than appending to them; the codebase actively repels the senior engineers required to fix it. Tests would change all of these calculations. They are the single highest-leverage cultural investment available — once tests exist, refactoring becomes possible, code review becomes meaningful, and onboarding accelerates.

4.10 CI/CD & Deployment¶

Finding	Severity	Evidence
Jenkins deploys via SSH + PM2 to a single host	High	`deployment.md`
GitHub Actions exist but their scope is unclear from this audit	Medium	`.github/workflows/`
No blue/green or canary deployment	High	—
No automated rollback on health-check failure	High	—
Container images are built ad-hoc (the `Dockerfile` exists but the deployment uses PM2 directly, not the container)	Medium	`Dockerfile`
No artifact signing or supply-chain controls (SBOM, cosign, Sigstore)	Medium	—
No environment promotion gate (dev → uat → prod is branch-merge based, no automated tests gate the merge)	High	`deployment.md`

Risk: a bad deploy at 4 PM Friday is a ticket to a long weekend. Rollback requires a human to SSH in and pm2 restart the prior version after manually checking out the prior commit.

4.11 Infrastructure as Code¶

Finding	Severity	Evidence
No Terraform / Pulumi / CloudFormation visible	High	Repo
nginx config and PM2 ecosystem are committed — partial IaC	Medium (positive)	`Dockerfile`, `ecosystem.config.js`
No environment definition — dev/uat/prod differences are tribal knowledge	High	—
No documented infrastructure topology	High	`deployment.md` (high level only)

Risk: rebuilding production from scratch in a disaster scenario is a guess-and-check exercise. Compliance auditors will ask "where is the documented architecture?" and the answer is "in someone's head."

4.12 Cost Visibility & FinOps¶

Finding	Severity	Evidence
No per-feature cost tracking (AI spend, S3 spend, Polotno render hours)	High	—
No per-tenant cost attribution — cannot answer "how much does customer X cost us?"	High	—
No cost-of-goods-sold tracking on AI calls	High	—
AI calls fan out across multiple providers (Gemini, Bedrock, Vertex, OpenAI) — provider mix not measured	Medium	`helper/aiLogics.js`
No anomaly detection on cloud spend	Medium	—

Risk: unit economics are opaque. A handful of high-usage customers could be costing more than they pay; without tagging and metering you cannot tell.

4.13 Documentation & Knowledge Management¶

Finding	Severity	Evidence
The `doc/` directory has been recently filled out — this is a positive	(Positive)	`doc/`
No architecture decision records (ADRs)	Medium	—
No runbooks for on-call (what to do if X is failing)	High	—
No public API documentation portal	High	—
No internal wiki referenced from this repo	Unknown	—

Risk (mitigated): documentation has improved recently. The remaining gap is operational documentation — runbooks and decision records — not architectural documentation.

4.14 Team Practices & Governance¶

These cannot be fully audited from a code repository, but observable signals:

Signal	Reading
Branching: main / uat_api / dev_api with explicit promotion	Positive — disciplined release cadence
Co-author commits visible	Positive — team collaboration
Long-lived files (24k-line `auth.js`, 21k-line `routes.js`)	Concerning — refactoring is being deferred
Hardcoded secrets in committed code	Concerning — code review is not catching security issues
7 broken references in `ecosystem.config.js`	Concerning — config drift between code and infra

Recommendation: the code-review checklist needs to include security (no committed secrets, no missing rate limit on auth routes, no missing validation) and observability (every new endpoint logs, every new job has metrics).

5. Strategic Architecture Decisions¶

This is where the assessment becomes prescriptive. These are the 5–7 decisions that everything else flows from. Get these right and the rest is execution; get them wrong and the rest is rework.

5.0 Two Streams: Feature Development and Refactoring Coexistence¶

Before any of the architectural decisions below can be executed, the platform has to solve a structural problem that defeats most modernization programs: two streams of work running in parallel without aligned standards.

The trap¶

   Without gates:                            With gates:
   ──────────────                            ──────────

   Feature stream ──► creates new mess        Feature stream ──► follows new practices
                                                                       │
                                                                       ▼
   Refactor stream ──► cleans up                                  cleans up new code's
        ▲                                                         contribution to debt
        │                                                              ▲
        └─ races feature stream ─┘                                     │
                                                          Refactor stream ──► cleans legacy
        (entropy wins long-term)                          (the legacy is finite; debt shrinks)

The feature stream and the refactor stream are not adversaries — but if they operate under different standards, they collide on every PR and the cleanup is endless. The only escape is gating new code behind the new practices so that the feature stream contributes to the cleanup rather than fighting it.

This is the single most consequential decision in the entire roadmap. Without it, the rest of the §5 architectural recommendations cannot converge.

What gates the streams¶

A documented engineering charter, a catalog of concrete practices, and a Definition of Done that every PR passes. These three documents — produced as a stake-in-the-ground before Phase 2 begins — are the gating mechanism.

Document	Purpose
Engineering Charter	The 12 commitments. Leadership-signed. The authority engineers cite when pushed to skip a practice
Engineering Practices	The full operational catalog: code style, PR rules, test rules, security defaults, observability defaults, DB hygiene, dep hygiene, doc requirements. Each practice has a rule, a why, an enforcement mechanism, and an escape valve
Definition of Done	The per-PR checklist. Lives in the GitHub PR template (see pr-template-reference.md for the content)

Three-tier enforcement¶

Practices are enforced through three concentric tiers, in descending order of strength:

Tier	Mechanism	Examples
Automated gates	Lint rules, CI checks, branch protection, pre-commit hooks	gitleaks, no-`console.log`, lint clean, tests pass, ≥1 review, no self-approval
Process gates	PR template, CODEOWNERS, ADR requirement, release checklist	Definition of Done, sensitive-path reviewers, design decisions captured
Cultural gates	Charter, onboarding, postmortem culture, recognition	"Boring code is good code"; tech-debt rotation; blameless review

Tier 1 is what we rely on most. Humans forget; CI doesn't. Investment in Tier 1 makes the right thing the easy thing.

Adoption waves¶

Practices are not adopted all at once — that approach has consistently failed in industry. They are adopted in waves, each wave closing before the next opens:

Wave	Timing	Focus	Closes when
Wave 1	Week 1	Pre-commit hooks, PR template, branch protection	Hooks installed; one week of zero direct-pushes to main
Wave 2	Weeks 2–4	CI gates: lint, typecheck, no-secrets, format	One week of zero CI bypasses
Wave 3	Months 1–2	Structured logging, validation on new endpoints, migrations framework, module-boundary linter (advisory)	One sprint of new code 100% compliant
Wave 4	Months 2–4	Patch coverage gate, OpenAPI required, module-boundary linter blocking, dep justification, ADR process	New code passes all gates routinely
Wave 5	Months 4+	SLO budgets, postmortem process, maturity rating refresh, cadence policies	Steady-state operation

The critical sequencing rule¶

Phase 2 of this roadmap (modular monolith refactor) does not start until Wave 3 of practices is in force. Otherwise the feature stream produces code in the legacy style faster than the refactor stream can clean it up, and the modular monolith never converges.

Phase 0 (security hygiene) and Phase 1 (queue, observability, migrations) can run in parallel with Waves 1–3 — they're improvements to infrastructure, orthogonal to the feature/refactor balance. But the actual code restructure waits for the gates to be in place.

How this changes day-to-day work¶

The practices fundamentally change what "ship a feature" looks like. Concrete examples in engineering-practices.md §How to use this document; summary here:

New endpoint = OpenAPI spec + validation + auth declaration + rate limit + structured logging + integration test, all in the same PR
New background job = consumes from the queue (no polling), idempotency key, retry/backoff, structured events
Schema change = migration file in migrations/, never manual ALTER TABLE
New dependency = paragraph of justification in the PR; license / size / maintenance verified
Touching a legacy file = extract the handler being modified into the appropriate module; add a test as you extract (the boy-scout rule)

The boy-scout rule is the linchpin: it converts every legacy-file touch into refactor-stream contribution. Over 6–12 months, every legacy line that anyone touches gets upgraded; lines nobody touches were stable anyway. The 21k/24k-line route files shrink as work flows through them, rather than as a separate cleanup project.

The cost of skipping this¶

If the team skips this stream-coexistence work and goes straight to Phase 2:

The feature stream keeps appending to routes/routes.js and routes/auth.js. The "modular monolith" is a half-built parallel structure that nobody trusts. Engineers don't know whether to add a new endpoint to the legacy file or the new module — and inconsistent choice is itself a debt category.
The refactor stream rebuilds modules from a moving target. As they extract, the source file is being modified concurrently. Merge conflicts dominate. Velocity collapses.
Feature delivery slows because every PR has to navigate ambiguous structure. Cleanup velocity slows because of merge conflicts. Both streams underperform.
Within ~6 months, leadership concludes that "the cleanup isn't working" and pulls engineers back to feature work. The half-completed restructure becomes a third structural pattern alongside the legacy and what was meant to replace it.

This is the most common failure mode of platform-modernization programs. The waves-and-gates approach in this section is what prevents it.

Recommendation: polyglot with a clear seam — Node.js + TypeScript for the HTTP API, Go for the background worker fleet (or JVM / Kotlin if the team has any prior exposure). The job queue is the seam; the two sides do not share code.

This recommendation was strengthened in this revision after a candid review of Node.js as an operational platform, not just as a language. The earlier "stay all-Node" position underweighted the operational debt that Node accumulates at the scale of a 100+ process worker fleet.

The honest assessment of Node.js¶

Node.js is a defensible choice for the HTTP layer and a poor choice for the worker fleet. The reasons it is poor for the worker fleet are not aesthetic — they are concrete, well-known to the platform owners, and have been ignored in earlier framings of this work:

Concern	Reality
JavaScript the language	IEEE-754-only numbers (no integers) is hostile to billing and metering math; implicit coercion and `this` semantics produce a steady drip of subtle bugs that TypeScript can mitigate but not eliminate; mutability is the default; async error propagation remains subtle even with async/await.
npm supply chain	Materially worse hygiene than Maven, Go modules, or Cargo. Recurring malicious-package incidents (event-stream, ua-parser-js, the debug/chalk attacks). Deeper transitive trees, lower per-package quality bar, micro-package culture. The 86 separate S3 client instantiations and 49 separate Polotno instantiations in this codebase are partly a cultural artifact of "just `require` it again" rather than build a shared abstraction.
Optimization ceiling	V8 is excellent within its model but the model has hard limits. Single-threaded event loop makes CPU-bound work awkward (workers help but require careful design). No shared-memory parallelism. GC tuning is opaque compared with the JVM. Memory profiling tooling is generations behind JFR / async-profiler / pprof.
Operational maturity	Node.js as a platform to operate is genuinely worse than Go or the JVM. PM2 vs. systemd vs. Docker vs. Kubernetes-native — there is no convention. Memory leaks are common and harder to diagnose. Process lifecycle (graceful shutdown, signal handling, supervision) is left to the developer to reinvent. There is no "this is how you run a Node service in production" the way there is for the JVM (systemd + JMX + tuning playbooks) or Go (single static binary under systemd, done).
Habits the language encourages	Loose typing leads to defensive `try/catch` walls instead of correctness. The "just install a package" reflex erodes engineering judgment. The lack of a standard project structure (compare to Rails, Django, Spring Boot) means every Node codebase is a snowflake — including this one.

These concerns are technically correct. None of them are old-fashioned. None of them disappear with TypeScript.

Where Node.js + TypeScript is still the right answer¶

The HTTP API layer specifically benefits from Node:

Web ecosystem maturity — Express / Fastify / Hono, Passport for OAuth, the strongest selection of social-platform SDKs, mature OpenAPI tooling, and the largest selection of AI provider SDKs (Anthropic, OpenAI, Google, AWS Bedrock, Vertex) with first-class TypeScript types.
Iteration velocity for request/response handlers is genuinely faster than the alternatives.
Workload fit — HTTP request handling is exactly what the event loop is good at: I/O-bound, short-lived, high concurrency, low CPU per request.
Team familiarity — the cost of a language migration is highest on the most-changed part of the codebase, which is the HTTP layer.
TypeScript with strict + a discipline-enforcing lint configuration is good enough for this domain. It is not as principled as a properly-typed language, but the gap is smaller than the gap between Node and Go on the worker side.

The HTTP layer's pain is architectural (two route files at 24k and 21k lines, no contract, no tests) — not language-driven. Solving it via language migration spends 12–18 months of feature delivery for benefits you can largely obtain from TypeScript + modular structure + an OpenAPI contract.

Where Node.js is the wrong answer: the worker fleet¶

The 108-job background fleet is where Node's operational weaknesses compound:

Workers are long-lived processes, not short-lived request handlers. Memory leaks are no longer an annoyance; they are an outage.
Workers do CPU-bound work (Polotno renders, image processing, AI orchestration with parsing) that the event loop is poorly suited for.
Workers fan out to multiple external APIs in parallel — the natural concurrency model is goroutines or virtual threads, not Promise.all orchestration on a shared event loop.
Workers are deployed as a fleet of similar processes — a single static binary deploying onto a fleet is a fundamentally simpler operational story than PM2 supervising N Node processes per host with no convention for log rotation, graceful drain, or memory bounding.
Workers are memory-sensitive at scale. A Node worker baseline is ~80–120 MB resident. 108 workers × ~100 MB ≈ 10–13 GB of overhead before any actual work happens. Go worker baselines are ~10–20 MB. The infrastructure-cost delta scales linearly with worker count and is real money on a multi-host fleet.

The two acceptable target architectures¶

Option A: Node.js + TypeScript (HTTP API) + Go (worker fleet).

Strongest case if the team is willing to grow Go expertise.
Single static binary deployment for workers — no PM2, no Node-runtime debugging on the worker side.
Goroutines fit the social-platform fan-out pattern naturally.
Excellent profiling tooling (pprof), runtime metrics (runtime/metrics), built-in race detector, mature memory model.
AI provider SDKs exist for Go (official Anthropic Go SDK, official OpenAI Go SDK, official AWS, Google Cloud) — fully production-ready.
Hiring quality bar is higher than Node hiring in most markets — selects for systems thinkers more reliably.

Option B: Node.js + TypeScript (HTTP API) + JVM/Kotlin (worker fleet).

Strongest case if the team has any JVM background.
Java 21+ virtual threads ("Project Loom") finally make I/O-bound concurrency clean without reactive ceremony — directly addresses the goroutine analogue.
Best-in-class observability: JFR (always-on, near-zero overhead), async-profiler, JMX, Java Mission Control.
Mature work-stealing thread pools, mature memory tuning, mature operational playbooks.
Higher memory footprint than Go, slower startup (mitigable with GraalVM AOT), more boilerplate (less so with Kotlin).
Larger ecosystem than Go.

Pick one. Do not pick both. A single auxiliary runtime is a tractable operational addition; two is a recipe for fragmented expertise and incoherent tooling.

For this platform — given its size, the team profile suggested by the codebase, and the absence of existing JVM tooling — Option A (Go) is the recommended default. It is the simpler operational story to introduce alongside Node.

The seam¶

The two runtimes communicate exclusively through the job queue (BullMQ on Redis is acceptable short-term but pins both sides to Node consumers; for a polyglot future the queue should be NATS, RabbitMQ, or Temporal — see §5.5). The HTTP API enqueues jobs; the worker fleet drains them. Neither side imports the other's code. Neither side reads the other's database tables directly. Each side optimizes for its own constraints.

What we explicitly reject¶

Rewriting the HTTP API in Go or Java/Kotlin. The HTTP layer's pain is architectural, not language-driven (see above). Spending 12–18 months migrating the layer that changes most often is the worst possible use of senior engineering time.
Adding Python for the AI work. The AI surface in this codebase is thin enough that Node and Go SDKs are sufficient. Python earns its way in only if and when significant ML training, model serving, or framework-heavy LangChain-style work emerges. Do not pre-introduce a third runtime.
Rust. Almost certainly never warranted for this domain. Mentioned here only to close it out.
Staying single-runtime indefinitely. This is the path of least resistance and has been the implicit position to date. It is no longer acceptable under enterprise-scale ambitions because the worker fleet's operational ceiling is Node's operational ceiling, which is too low.

What this costs¶

A polyglot architecture is not free. Costs to budget for:

Hiring or upskilling. Typically 3–6 months for a strong Node engineer to become productive in Go; longer for the JVM with no prior exposure.
Operational tooling for two runtimes — CI build pipelines, deployment, observability agents, security scanning, dependency management.
A genuine queue infrastructure — already in the roadmap regardless (§5.5).
A short period of duplicated worker logic during migration as old Node workers are retired one at a time. Plan on 9–15 months from first Go worker to last Node worker retired.

The benefits — predictable memory, real concurrency, single-binary deployment, lower infra cost on the largest fleet, and operational tooling that has been mature for a decade — pay this back within 12 months for a fleet of this size.

Risk of NOT acting¶

The all-Node baseline locks the platform into the current operational profile permanently. Worker memory pressure will be a recurring source of incidents; PM2-supervised Node fleets will not survive a serious enterprise availability target without continuous heroics; senior backend hiring becomes harder in 2026+ as strong systems engineers increasingly select against pure-Node shops. The longer the worker fleet stays in Node, the more the architecture petrifies around its limitations and the more expensive the eventual migration becomes.

5.2 Architecture Pattern: Modular Monolith¶

Recommendation: Modular monolith. Stop here unless an extraction is forced by a real, observed pain.

This is an engineering exercise, not a domain-modelling one. someli-api is a focused product in a single business domain (social media management for SMBs). It is not a multi-domain enterprise platform. The word "module" in this section means a unit of code organization — not a separately-deployed service, not a separate database, not a separate team's territory. Earlier drafts of this doc used "bounded context" terminology imported from Domain-Driven Design; that framing was heavier than this codebase warrants and has been removed.

The simple rules¶

Rule	What it means in practice
One process	Single Node.js application. No microservices.
One database	Single MySQL (or, after migration, single PostgreSQL). No per-module databases.
One team	The same engineers own everything. No "module owners."
One deployment	One Docker image, one PM2 / Fargate task definition for the API. (Workers may have their own task definitions, but they share the codebase.)
Direct function calls	Cross-module communication is `await content.generateDraft(...)`, not events / queues / HTTP.
Lint-enforced boundaries	A linter prevents one module from reaching into another's internals. The boundary lives in the file system, not in the network.

If anyone proposes deviating from these rules — separate services, message buses between modules, separate databases per area — the burden of proof is on them, and the bar should be a specific observed problem the simpler approach cannot solve. Theoretical purity is not a reason.

The path¶

Today (Month 0): Single Express process, no internal boundaries. Two files (routes.js, auth.js) hold most of the HTTP code.

Modular monolith (Month 6–18): Same process, same database, same team, but the codebase is reorganized into clearly-named feature modules:

src/
  modules/
    identity/         # auth, users, accounts, sessions, OAuth
      index.ts        ← public API: login, getUser, refreshToken, ...
      internal/       ← private to this module
    content/          # AI generation, RAG, design rendering, library
      index.ts        ← public API: generateDraft, getLibraryItem, ...
      internal/
    publishing/       # social platform integrations, scheduling
      index.ts
      internal/
    analytics/        # post insights, account insights
      index.ts
      internal/
    billing/          # Paddle, subscriptions, quotas
      index.ts
      internal/
    notifications/    # email queue, Slack alerts
      index.ts
      internal/
  platform/           # shared infrastructure
    db.ts             # the ONLY place that constructs a DB pool
    queue.ts          # the ONLY place that constructs a queue client
    s3.ts
    logger.ts
    config.ts

Other modules import only from modules/<name> (which resolves to index.ts). They do not import from modules/<name>/internal/.... A linter (eslint-plugin-boundaries or dependency-cruiser) enforces this in CI.

Stop here. This is the destination for the foreseeable future. Don't extract.

When (if ever) to extract a module into its own service¶

Module extraction is optional. It may never be needed. Don't plan it; respond to it.

A real trigger looks like this:

Trigger	Implication
One module's traffic spikes break unrelated modules	Extract the noisy one to isolate it
One area's compliance scope is materially different (e.g., billing must be PCI-isolated)	Extract for the audit boundary
The team grows to ~15+ engineers and release coordination is slowing everyone down	Extract along the team boundary
A specific module needs different language / runtime characteristics that the monolith can't accommodate	Extract that one

Without one of these triggers, stay in the monolith. The cost of a service (separate deploy, separate observability, networking, idempotency, eventual consistency, on-call) is real and recurring. Don't pay it without a reason.

Anti-patterns to avoid (especially with a small team)¶

These are tempting but wrong for a codebase at this scale:

Don't introduce an event bus between modules. Direct function calls are simpler, debuggable, transactional, and good enough for a single-process app. Events earn their way in only when you have multiple processes or genuinely async cross-module workflows.
Don't give each module its own database. One DB. Tables can be conceptually owned by a module — naming convention {module}_{table} (e.g., content_library, publishing_scheduled_posts) makes ownership visible — but it's all one schema in one connection pool.
Don't write "anti-corruption layers" between modules. That's DDD vocabulary for a problem you don't have. Just call the public function.
Don't introduce CQRS, event sourcing, or saga patterns. None of these solve a problem this codebase has today.
Don't enforce strict typed contracts at every module boundary. TypeScript types on the public functions are enough. JSON schemas, gRPC, protobuf — all overkill in a single process.
Don't over-decompose. Six or seven modules is plenty. If someone proposes splitting content/ into content-generation/, content-storage/, content-rendering/ — push back. Premature splitting is harder to undo than late splitting.

Concrete first steps for the team¶

If the team is small and not used to refactoring large codebases, do not attempt the whole reorganization in one PR. The wrong way to do this is to produce a giant restructure that nobody can review. The right way is incremental, with the running code working at every step.

Sequence:

Create the directory skeleton. src/modules/{identity,content,publishing,analytics,billing,notifications}/index.ts and src/platform/. All empty for now. Land that PR.
Move the centralized DB module first (this is also a §5.4 prerequisite). Every place that constructs new mysql({...}) or mysql2.createPool is replaced with import { db } from 'src/platform/db'. One PR per N call sites until done. ~892 references — expect a month or two of small PRs.
Pick one feature area to migrate first. Notifications is the right starting choice: it's small (~5 files), it already has a queue-shaped pattern (tEmailSchedule), and a mistake there is low-blast-radius (a delayed email is recoverable). Move the SendGrid / Slack code into modules/notifications/. Establish the public API. Make all other code import from modules/notifications instead of from the helper files. Validate with manual testing.
Then do Billing. Small surface, well-bounded. Paddle integration plus a few endpoints in routes.js.
Then Identity. This means slicing routes/auth.js (24,432 lines) into pieces. Hard but high-value. Allocate weeks, not days.
Content, Publishing, Analytics in roughly that order — these are the largest and most entangled.
Add the linter rule only after the structure exists. Turning on the boundaries rule with the codebase still entangled produces a PR full of // eslint-disable lines that defeats the purpose.

A useful default while the team gets comfortable: "if you can't justify a change with a current pain, don't make it." Boring, predictable code is the goal. Cleverness costs more than it pays at this team size.

Be clear-eyed about scope. "Modular monolith" is not a small change. Reorganizing 46,000 lines of route code, 108 worker files, and 892 DB-construction sites into module-shaped boxes is roughly equivalent in scope to rebuilding the platform's internal structure from the ground up. The label "no rewrite" describes a strategy for preserving running production behavior — it does not describe the size of the engineering effort. Phase 2 is a controlled, incremental refactor executed without taking the system offline. Treating it as small is the most common way these programs fail.

Risk of NOT acting: without internal module boundaries, every new feature touches the same files as every old feature. Code review degrades into "looks fine to me" because no one can hold the surface area in their head. New engineers take months to become productive. The codebase grows surface area faster than the team grows headcount, and the gap closes in only one direction (more bugs, slower delivery, more painful on-call).

5.3 API Layer: Contract-First, Versioned, Gateway-Fronted¶

Recommendation: OpenAPI 3.1 specification as the source of truth. Generate types and validators from the spec.

Specifically: - Adopt an OpenAPI-first workflow. Either: - Spec-first: write the YAML/JSON, generate request/response types and validators (e.g., openapi-typescript, zod-openapi). - Code-first: derive the spec from typed handlers (e.g., @hono/zod-openapi, tsoa). Either is acceptable; spec-first is more enterprise-friendly because it gives non-engineers (product, partners) a contract to design against. - Introduce a versioning scheme — URL-path versioning (/v1/, /v2/) is the most operationally simple. - Adopt a deprecation policy: 6-month minimum sunset with Deprecation and Sunset headers per RFC 8594. - Place an API gateway in front of the service for cross-cutting concerns: rate limiting, auth verification, request/response logging, schema validation, transformation. Options: AWS API Gateway, Kong, Apigee, Tyk, or a lightweight in-process equivalent (Hono + middleware). - Publish a developer portal (Redoc, Stoplight, ReadMe) for external consumers. Even if today's only consumer is the in-house web client, the portal is a low-cost forcing function for clean API design.

Authentication: - Consolidate the dual auth paths (Passport JWT + custom AES). The custom encryption layer adds complexity without clear benefit over a properly configured JWT. - Move toward a standard OAuth 2.0 / OIDC identity flow. The current implementation already uses Passport with OAuth providers — extend that to the first-party login flow. - Issue short-lived access tokens (e.g., 15 min) + long-lived refresh tokens. Keep tokens stateless. - Long-term: consider a managed identity provider (Auth0, Cognito, WorkOS, Okta) to handle SSO/SAML for enterprise customers.

Risk of NOT acting: every endpoint becomes a private contract. SDK generation, OpenAPI-driven testing, and API monetization are all blocked.

5.4 Data Architecture¶

Recommendation: One database for the whole platform. Introduce a managed schema migration tool. Add a read-replica when analytics queries start contending with transactional writes.

This is intentionally simpler than the typical enterprise data-architecture playbook. someli-api is a focused product; one well-managed database is correct for it and will be correct for years. The complexity of multiple databases (cross-database joins becoming application-layer joins, eventual consistency, distributed transactions, separate backup / DR plans, separate tuning) is real and should not be paid for without a specific operational reason.

Phased approach:

Phase A (Months 0–6) — Discipline in the current DB: - Adopt a migration tool: Knex migrations, Prisma Migrate, or Flyway (language-agnostic). Lock in every schema change behind a versioned migration file. No more manual ALTER TABLEs. - Eliminate sync-mysql entirely. It blocks the event loop and is a footgun. - Standardize on one async driver (mysql2/promise is fine; this also becomes the seam for the eventual PG migration if you decide to do it). - Introduce a single connection pool in platform/db.ts. Every other file imports from there. - Add slow-query logging and surface it in observability.

Phase B (Months 6–12) — Decouple analytics reads: - The analytics tables (tSPostInsights, tSAccountIns) and aggregate queries (the dashboard) should run against a read replica. RDS provides this with a few clicks; no application change is needed beyond pointing the dashboard at the replica's connection string. - This single change protects transactional writes from being slowed by dashboard scans. It does not require a separate analytics database.

Phase C (Months 12+) — Optional further moves: Only if Phase B turns out to be insufficient: - A dedicated analytics store (TimescaleDB, ClickHouse, BigQuery) populated via CDC (Debezium) or scheduled ETL. Justified once dashboard queries take seconds even on the replica. - A separate database for an extracted module — only if and when a module is genuinely extracted into its own service per §5.2. This is a consequence of service extraction, not a goal in itself.

For someli-api at its current size, Phases A and B are likely the destination. Phase C should be deferred until specific operational pain emerges. Resist the temptation to plan it earlier.

Specific technology recommendations: - PostgreSQL is a defensible target if and when migration is on the table. The gap between MySQL and PostgreSQL has widened over the past five years: PostgreSQL's transactional DDL alone is worth the effort for any system running schema migrations under load. JSONB, partial indexes, pgvector, TimescaleDB, and FOR UPDATE SKIP LOCKED all unlock real wins. But — for a focused product on a small team — a wholesale MySQL → PostgreSQL migration is a 12–18 month investment that competes for the same engineering attention as the modular-monolith refactor, the queue migration, and the security work. The honest sequence is: fix the hot queries on MySQL today, finish the modular monolith, evaluate PG migration as a separate decision in year 2. See PostgreSQL migration playbook if/when that decision is taken. - Redis for caching, session store, and as the backing for short-term job-queue infrastructure (BullMQ). - A vector store for the RAG pipeline at scale: pgvector inside PostgreSQL is the recommended default; Pinecone or Weaviate as managed alternatives; Vertex managed corpora (already in use) for the production path. - An object store with CDN for media: already on S3, but a CDN (CloudFront, Fastly, Cloudflare) in front of generated media is a low-cost performance and bandwidth-cost win that should not wait for Phase 3. - A columnar / OLAP store (ClickHouse, BigQuery, Snowflake) when analytics queries on tSPostInsights outgrow the read replica — typical trigger is when post counts exceed ~10M and dashboard queries consistently exceed 1s.

Risk of NOT acting: schema drift between environments, manual rollbacks, analytics queries blocking transactional writes, and an ever-growing monolithic DB that nobody can shard later.

5.5 Background Processing: Replace Polling with a Real Job Queue¶

Recommendation: Move from MySQL-polling cron jobs to a message-queue-backed job system. This is the single highest-impact operational change.

Options:

Option	Pros	Cons	Best for
BullMQ on Redis	Native to Node.js, low operational complexity, great DX	Tied to Node ecosystem	Default recommendation if staying on Node
Temporal	Durable workflows, handles retries / sagas / long-running ops natively, polyglot	Heavier ops, learning curve	If workflows have multiple steps with state
AWS SQS + Lambda or ECS	Managed, scales to zero, integrates with AWS estate	Vendor lock-in, harder local dev	If the rest of the infra is heavily AWS
RabbitMQ / Kafka	Industry-standard, language-agnostic, durable	More ops overhead, more concepts	If event-sourcing is on the roadmap

For this platform, BullMQ on Redis is the lowest-friction starting point. It directly replaces the polling pattern with at-most-once / exactly-once semantics and gives you out-of-the-box: - Visibility (queue depth, throughput, failed job inspection) - Retries with exponential backoff - Scheduled / delayed jobs - Concurrency control across hosts - Dead-letter queues

Migration approach: 1. Stand up Redis + BullMQ alongside the existing system. 2. New jobs go through BullMQ from day 1. 3. Migrate existing polling jobs one at a time, starting with the highest-impact (publishing — has the double-post risk) and the highest-traffic (content generation). 4. Decommission the tJobs polling pattern incrementally.

Idempotency — at the same time, every external-API-calling job must adopt idempotency keys. Generate a deterministic key from the job payload, send it on every retry; servers (Facebook, Instagram, AI providers) deduplicate on it. The current absence of this is the most acute correctness risk in the platform.

Polotno pooling — alongside the queue migration, run Polotno renderers as a renderer pool service (a small fleet of pre-warmed Chromium workers consumed via a queue). This eliminates the cold-start cost on every render and makes capacity planning tractable.

Risk of NOT acting: double-posting under retries (already a present risk, not theoretical), DB pressure that grows linearly with worker count, and an inability to scale workers horizontally.

5.6 Observability: OpenTelemetry End-to-End¶

Recommendation: Adopt OpenTelemetry (OTel) as the observability instrumentation standard. Pick one vendor for the backend.

The three signals: - Logs: structured JSON, correlation ID per request, ship to centralized aggregation (Loki, ELK, Datadog, CloudWatch Logs Insights). - Metrics: Prometheus-style or vendor-native; track the golden signals per endpoint and per job type — request rate, error rate, latency (p50/p95/p99), and saturation. Job-specific metrics: queue depth, processing time, failure rate, retry rate. - Traces: distributed traces with OTel context propagation. Even in a monolith this is valuable: a request that touches 12 helper functions and 4 external APIs becomes legible.

Vendor choices (pick one): - Datadog — best-in-class UX, expensive at scale. - New Relic — strong APM, free tier generous. - Grafana Cloud (Loki + Tempo + Mimir) — open standards, good price-performance. - Sentry + a logs/metrics solution — Sentry for errors, something else for the rest. - AWS native (CloudWatch + X-Ray) — if AWS-aligned and cost-conscious.

The one absolute requirement: instrument with OTel SDKs, not vendor-specific SDKs. This avoids vendor lock-in and makes future migration possible.

SLOs and SLIs: - Define an availability SLO (e.g., 99.9% of requests succeed) and a latency SLO (e.g., p95 < 500 ms for read endpoints, < 2000 ms for AI-generation endpoints). - Track an error budget per service. When the budget is consumed, freeze feature work and pay down reliability debt. - Make SLOs visible to the team and to customers (status page).

Status page: - Adopt a public status page (Statuspage.io, Better Uptime, Instatus, or self-hosted Cachet). Enterprise customers will ask for this in procurement.

Risk of NOT acting: mean-time-to-detect (MTTD) is currently "until a customer complains." That is incompatible with enterprise SLAs.

5.7 Security: Defense-in-Depth, Compliance-Ready¶

Recommendation: treat the security gaps as a first-quarter project, not a slow rolling effort. Most are low-effort, high-impact. A subset must happen this week and is not negotiable.

This week (zero-tolerance items):

Rotate every secret currently in source. Slack bot token, Polotno render key, Express session secret, AES token-encryption key. Treat all as compromised; they have been in git history. Rotation comes first; a secrets manager comes second.
Run gitleaks or trufflehog over the full git history. Inventory anything else that has ever been committed. Rotate everything found.
Add helmet() to the Express middleware chain. One line. Five-minute change, immediate security-header coverage.
Replace cors() with an allow-list. Fifteen minutes. Stops accepting any origin.
Add rate limiting on /auth/login and /auth/register. express-rate-limit with sensible defaults. One hour. Closes the most exploitable open door.

Total effort: under one day of focused work for a single engineer. There is no acceptable reason to defer these.

Top of the list (next 90 days):

Secrets management. Adopt AWS Secrets Manager, HashiCorp Vault, Doppler, or 1Password Secrets Automation. Wire conf.js to read from it. No secret should ever land in a .env file committed or otherwise.
Rate limiting on every high-cost endpoint (AI generation, media uploads, content posting), not just auth. Per-tenant rate limits as a precondition for any future "fair use" enforcement.
Reduce body limit to 1 MB globally; raise per-endpoint where required. Use multipart/form-data for large uploads via express-fileupload (already in use).
Global error handler that strips stack traces in production, logs structured errors, and returns the standard envelope. See error-handling.md.
Input validation everywhere. Schema-validate every request body and query parameter. Use Zod, Yup, or express-validator consistently. Generate from OpenAPI spec at the gateway level.
Audit logging. Every privileged action (admin operations, billing changes, account modifications, social-platform credential changes) writes to an append-only audit log. Required for SOC 2.
Dependency scanning. Enable Dependabot or Snyk; gate merges on no high-severity findings. Given the supply-chain history of npm specifically, this is not optional.
SAST in CI. Add SonarCloud / Semgrep / CodeQL. Block merges on critical findings.
An access-review process. Who can deploy? Who can read production logs? Who can read the database directly? Document, restrict, and review quarterly.
A documented incident response runbook. When (not if) a security incident happens, the first 24 hours determine whether the company survives the press cycle.

Compliance roadmap (12–24 months):

Standard	Effort	Why
SOC 2 Type I (then Type II)	High — 6 months Type I, +12 months Type II	Required for most enterprise customers; demonstrates security maturity
GDPR readiness	Medium	Data-subject rights endpoints, DPIA, DPA template, breach notification process
ISO 27001	High	Required for some EU enterprise customers; compounds with SOC 2
PCI DSS	N/A if Paddle handles all card data — verify and document	Otherwise: SAQ-A applies
HIPAA / GxP	Out of scope unless entering healthcare/pharma	—
EU AI Act	Watch closely	If AI features influence high-risk decisions, may apply

Threat modeling: adopt a lightweight model (STRIDE per feature) at the design phase of every new module. Most of the issues in this audit would have been caught by a 30-minute STRIDE review.

Risk of NOT acting: procurement security reviews will fail. A breach is a matter of when, not if, given the open CORS, no rate limiting, and committed secrets.

5.8 CI/CD & Infrastructure as Code¶

Recommendation: containerize, codify infrastructure, automate deployments with health gates.

Phase 1 (Months 0–3): - Make the existing Dockerfile the source of truth for runtime. Stop deploying via pm2 restart directly on a host; deploy container images. - Adopt a registry (ECR, GHCR, Docker Hub) with image signing. - GitHub Actions for CI: lint → test → build → push image → deploy. - Tag images with git SHA + semver; never use :latest.

Phase 2 (Months 3–9): - IaC — Terraform (or Pulumi, AWS CDK) for every piece of infrastructure. Networking, compute, databases, queues, observability — all codified. - Environments — at least three: dev, staging, prod, defined identically in IaC, parameterized by environment variables. - Deployment platform — ECS Fargate, AWS App Runner, Cloud Run, or Kubernetes (GKE / EKS / AKS / DigitalOcean). For most startups, ECS Fargate or Cloud Run hit the right sweet spot of "managed but not vendor-locked." - Blue/green or rolling deploys with automated rollback on health-check failure. - Secrets injection at runtime, not build time.

Phase 3 (Months 9–18): - Canary deployments for high-risk changes — route 1% of traffic to the new version, monitor error rate, ramp up. - Feature flags (LaunchDarkly, Unleash, Flagsmith, or self-hosted) — decouple deploy from release. - GitOps — declarative config in git, synchronized to runtime by ArgoCD / Flux (if Kubernetes) or via Terraform / pipeline. Eliminates drift.

DORA targets (industry medians for "elite" performers — aspirational year-2 state): - Deployment frequency: multiple times per day - Lead time for changes: < 1 day - Change failure rate: < 15% - MTTR: < 1 hour

Current state is unknown but almost certainly far from these. Measure them. What gets measured gets improved.

5.9 Testing Strategy¶

Recommendation: a pragmatic test pyramid, rolled out incrementally.

Layers:

Layer	Coverage	Tool	When to add
Static analysis	All code	TypeScript, ESLint, Semgrep	Phase 1
Unit tests	Pure functions, helpers	Jest, Vitest	Phase 1 (start with new code)
Integration tests	Module-level: HTTP handler → DB → response	Jest + supertest, testcontainers for DB	Phase 2
Contract tests	API schema conformance	Pact, Dredd, openapi-validator	Phase 2
E2E tests	Critical user journeys	Playwright	Phase 3
Load tests	Capacity validation pre-release	k6, Artillery	Phase 3
Chaos tests	Resilience validation	Chaos Mesh, Gremlin	Phase 4+

Strategy: - Do not aim for 80% coverage on day 1. That is a year-long project. Instead: - Mandate that new code ships with tests. PRs without tests do not merge — enforced by CI, not by reviewers. - Backfill tests when modifying existing code (the "boy scout rule"). - Critical-path priority: 90% coverage on auth, billing, and publishing within 6 months. These are the modules where untested code is a customer-impact event waiting to happen, and they are small enough to make the goal tractable. - Within 12 months, the codebase trends toward decent coverage organically across the rest. - Test data management: factories (factory-girl, fishery) for fixtures; never share state between tests; testcontainers for real DB integration tests, not mocks. - Coverage as a quality signal, not a target. 100% coverage of trivial code does not protect against bugs. Aim for the high-value paths first.

Risk of NOT acting: every change is a coin flip. Velocity decays. Beyond risk to the product, the absence of a test suite is a recruiting filter — strong engineers in 2026 do not voluntarily join codebases with no automated tests, and the platform cannot fix itself without those engineers. The absence of tests is both a technical debt and an HR debt.

5.10 Cost & FinOps¶

Recommendation: basic cost attribution by Month 6, per-tenant cost visibility by Month 18.

Phase 1 (Months 0–6): - Tag every resource (Environment, Service, CostCenter, Owner). - AWS Cost Explorer views per service and per environment. - AI provider cost dashboards — a daily report of $/day per provider, broken down by feature (content gen, RAG, agents). - Budgets and alerts — Slack alert if daily spend deviates more than 30% from baseline.

Phase 2 (Months 6–18): - Per-tenant cost attribution. Tag every AI call, S3 upload, and queue job with accountId. Build a daily mart that joins cost + usage to surface unit economics. - Surface internal pricing. Engineers should be able to answer "how much does this feature cost per request?" via a dashboard, not a spreadsheet.

Phase 3 (Months 18+): - FinOps process. A monthly review where finance, engineering, and product look at unit economics together. Decisions about feature tiering, throttling, pricing.

Risk of NOT acting: you cannot price the product confidently or identify customers who are unprofitable.

6. Roadmap¶

A phased plan calibrated to a small-to-mid engineering team (5–15 engineers). Larger teams can compress; smaller teams will need to extend.

6.0 Phase 0a — This Week¶

These items must precede everything else in the roadmap. They are not engineering work — they are basic hygiene that should not have been deferred and cannot be deferred further.

Item	Effort	Why
Rotate Slack bot token	30 min	Currently committed in source
Rotate Polotno render key	30 min	Currently committed in 49 files
Rotate Express session secret	30 min	Currently committed in `server.js`
Rotate AES token-encryption key (with grace period for live tokens)	1–2 hrs	Reduces blast radius if previously exposed
Run `gitleaks` over the full git history	30 min	Find anything else committed
Rotate every additional secret found	varies	Same logic as above
Add `helmet()` middleware	5 min	Security headers, instant improvement
Replace `cors()` with allow-list	15 min	Stop accepting any origin
Add `express-rate-limit` on `/auth/*`	1 hr	Close the brute-force door
Audit `/uploads` static route exposure	30 min	Reduce attack surface

Total effort: under one day of focused work for one engineer. There is no acceptable reason to defer. Every day these items remain open is a day the platform is needlessly exposed.

6.1 Phase 0 — Stop the Bleeding (Months 0–3)¶

Eliminate the most acute risks. Pure investment in security and basic operability.

Workstream	Deliverable	Effort
Secrets	All secrets removed from source, rotated, in a secrets manager	M
Auth surface	Rate limiting on all `/auth/*` endpoints	S
Headers	helmet configured	XS
CORS	Allow-list defined	XS
Body limits	Reduced to 1 MB global, raised per-route only where needed	XS
Validation	Validation chain for every endpoint that accepts input	M
Error handling	Global Express error handler	S
Logging	Structured logging via Winston / Pino with correlation IDs	M
Healthcheck	Verify `/health` and `/db-health` are wired to load balancer	XS
CI gate	Lint + Dependabot alerts in CI	S
Idempotency	Idempotency keys on social-publishing jobs	M
Backups	Verify MySQL backups + restore procedure	S
Documentation	Runbooks for top 5 incidents	M

Exit criterion: the platform passes a basic external security scan and an internal "what happens if X breaks" tabletop exercise.

6.2 Phase 1 — Foundation (Months 3–9)¶

Build the things that make further work possible.

Workstream	Deliverable
TypeScript	Migrate codebase incrementally (allowJs + strict per-file). Target: all new code in TS by Month 6.
Migrations	Adopt Knex / Prisma / Flyway for schema; backfill the current schema as a baseline migration.
DB drivers	Eliminate `sync-mysql`. Standardize on `mysql2/promise`.
Job queue	Stand up Redis + BullMQ. New jobs use it; migrate top 10 highest-impact existing jobs.
Observability	OTel SDK installed. Logs, metrics, traces flowing to chosen vendor. SLOs defined.
API contract	OpenAPI spec for the most-used 50 endpoints. Spec generated in CI.
API versioning	All endpoints behind `/v1/`. Deprecation policy published.
Testing	Test infrastructure (Jest + supertest + testcontainers). New code requires tests.
Containerization	All services deployed as containers. PM2 retired in favor of a process supervisor inside the container.
IaC	Terraform for all infrastructure. Three environments codified.
CI/CD	GitHub Actions pipeline: lint → test → build → deploy → smoke test. Automated rollback.

Exit criterion: any engineer can deploy a change end-to-end without touching a server. MTTD on a production incident is under 15 minutes.

6.3 Phase 2 — Modular Monolith (Months 9–18)¶

Reorganize the codebase into named feature modules with internal boundaries. No service extractions, no separate databases, no message buses. One process, one DB, one team — see §5.2 for the rules and the step-by-step migration sequence calibrated for a small team.

Workstream	Deliverable
Module structure	Codebase organized into `modules/{identity,content,publishing,analytics,billing,notifications}` plus `platform/`. Single deployment unit, single connection pool, one team.
Module APIs	Each module exposes a typed public surface via `index.ts`. Other modules import only from the public surface.
Lint enforcement	`eslint-plugin-boundaries` (or `dependency-cruiser`) gates cross-module imports. Turn this on after the structure exists, not before.
Polotno pool	Polotno renderers run as a pool (in-process or as a sidecar). Cold-start cost eliminated.
Vector store	RAG migrated to a single implementation. Pick one (Vertex managed corpora is the production default; pgvector is fine if you've migrated to PG). The two-implementation state is resolved.
Auth consolidation	Dual auth paths (Passport + custom AES) consolidated into one. Refresh tokens introduced.
Audit logging	Append-only audit log for privileged actions. SIEM integration optional.
SOC 2 Type I	Prepare evidence; complete Type I audit.
Status page	Public status page reflecting real SLOs.
Cost attribution	Per-feature cost dashboards.

Exit criterion: the codebase is legible to a new hire within 1 week. A change inside one module rarely requires touching another module's files. Procurement security reviews are routinely passed.

6.4 Phase 3 — Optional, Triggered Service Extraction (Months 18+ — may never happen)¶

This phase is optional and trigger-driven, not scheduled. For a focused product on a small team, the modular monolith from Phase 2 is likely the destination. Don't plan extractions; respond to specific, observed pain.

A real trigger for extracting a module looks like:

If this happens	Then consider extracting
Publishing-fleet load makes the rest of the API unstable during peak posting hours	Publishing
Dashboard queries impact transactional performance even with a read replica	Analytics (or just give it its own database first)
AI-generation costs need per-tenant quotas and rate limits at the runtime level	AI / Content Generation
Compliance requires email delivery audit trail in isolation	Notifications
The team grows to ~15+ engineers and release coordination is the bottleneck	The module aligned with the new team's ownership

If none of these triggers fire, do not extract. The cost of a service is real, recurring, and proportional to the operational maturity required to run one (separate observability, separate deploy, separate on-call rotation, networking, idempotency, eventual consistency). Pay it only when the alternative is worse.

Each extraction comes with overhead — a separate service is a separate deploy, separate observability, separate database, separate on-call. Don't extract for aesthetic reasons or because a blog post said you should.

Exit criterion (if reached): the extracted modules are genuinely independent — separate teams could own them, separate failure modes are isolated, separate scaling profiles are accommodated. Note that "exit criterion" is conditional on entering Phase 3 in the first place; not entering it is a valid outcome.

6.5 What gets cut¶

Be explicit about things not in this roadmap, so the team isn't surprised:

Multi-region active-active. Likely never needed for the customer profile. A single-region deployment with cross-region backups is sufficient until a specific customer asks.
Kubernetes, unless the team is already strong on it. ECS Fargate or Cloud Run is sufficient for years.
Event sourcing / CQRS. Buzzwords. Adopt if you have a specific domain that needs them; do not adopt prophylactically.
GraphQL, unless there's a strong product reason. REST + OpenAPI works.
Polyglot persistence on day 1. One DB tech for the whole platform is enough. Adding Mongo, DynamoDB, Cassandra, Redis-as-primary-store, etc. should each be a deliberate, justified decision — not a default.
Per-module databases. One database for the whole platform. Splitting it produces operational complexity that a small team cannot absorb without dropping the ball elsewhere.
Event-driven internal communication. Direct function calls between modules are simpler, debuggable, and transactional. Event buses earn their way in only when extraction has happened or genuinely-async cross-module workflows exist.
Anti-corruption layers, sagas, CQRS, event sourcing. None of these solve a problem this codebase has at this scale.

7. Risk Register¶

Top risks with current likelihood / impact and the mitigation cost.

ID	Risk	Likelihood	Impact	Mitigation cost	Mitigation in roadmap
R-1	Hardcoded secrets compromise	High	Critical	Low	Phase 0
R-2	Double-posting to social platforms under retry	High	High	Medium	Phase 0 + Phase 1 (idempotency + queue)
R-3	Auth brute-force / credential stuffing (no rate limit)	Medium	High	Low	Phase 0
R-4	DoS via 150 MB body	Low	High	Low	Phase 0
R-5	DB single point of failure	Low	Critical	Medium	Phase 1 (managed RDS Multi-AZ + read replica). Per-module databases are NOT recommended.
R-6	OOM under Polotno load	Medium	High	Medium	Phase 2 (renderer pool)
R-7	Failed deploy with no rollback	Medium	High	Medium	Phase 1 (CI/CD)
R-8	Long MTTD due to no observability	High	Medium	Medium	Phase 1 (OTel)
R-9	Inability to pass enterprise procurement review	High	High	High	Phase 0 + Phase 2 (SOC 2 Type I)
R-10	Schema drift between environments	High	Medium	Low	Phase 1 (migrations)
R-11	Talent retention / single-engineer dependencies	Unknown	High	Medium	Phase 2 (modular monolith, ADRs, runbooks)
R-12	Vendor lock-in (AWS, specific AI providers)	Medium	Medium	High	Phase 2+ (abstraction layers, OTel, OpenAPI)
R-13	Cost run-away on AI features	Medium	Medium	Medium	Phase 1 (cost dashboards) + Phase 2 (per-tenant attribution)
R-14	Compliance: GDPR data subject request unhandled	High (if EU customers)	High	Medium	Phase 1 (basic endpoints) + Phase 2 (full process)
R-15	Supply-chain attack via dependency	Low	Critical	Low	Phase 0 (Dependabot/Snyk)
R-16	Reputational damage from a high-profile double-post incident	Medium	Critical	Medium	Phase 0 (idempotency)
R-17	Senior engineering hires decline after reading the codebase	High	High	High	Phase 1+ — the codebase improvements are the recruiting tool
R-18	Acquisition / fundraising valuation discounted on technical due diligence	High	High	High	Phase 1–2 (passing TDD is a Phase 2 outcome)
R-19	Hidden blast radius — a small change breaks something distant due to entanglement	High	Medium	High	Phase 2 (modular monolith)
R-20	Vendor lock-in to specific AI providers under unilateral pricing/policy changes	Medium	Medium	Medium	Phase 2 (provider abstraction at the boundary)
R-21	A junior PR introduces a regression to a critical path because no test catches it	High	Medium	Medium	Phase 1 (testing on critical paths)
R-22	Long-running PM2 worker leaks memory and crashes during a large-customer batch	Medium	High	Medium	Phase 2 (Go/JVM workers)
R-23	A multi-region or large enterprise customer signs and immediately discovers the single-host scaling ceiling	Medium	Critical	High	Phase 1–2 (queue + horizontal-ready workers)

8. Standards & Best Practices Compliance Matrix¶

A quick view of where the platform stands against each major standard. C = compliant, P = partial, N = not compliant, NA = not applicable.

OWASP API Security Top 10 (2023)¶

Item	Status	Notes
API1: Broken Object Level Authorization	P	Auth middleware exists; per-resource authorization is per-handler, no central enforcement
API2: Broken Authentication	P	Two parallel auth paths; no rate limiting on login
API3: Broken Object Property Level Authorization	N	No mass-assignment protection observed
API4: Unrestricted Resource Consumption	N	No rate limit, 150 MB body limit, no per-tenant quotas
API5: Broken Function Level Authorization	P	Role checks exist but are scattered
API6: Unrestricted Access to Sensitive Business Flows	N	No rate limiting on AI generation, posting, registration
API7: Server Side Request Forgery	Unknown	Web scraping helpers may be vectors — not audited
API8: Security Misconfiguration	N	CORS *, no helmet, hardcoded session secret
API9: Improper Inventory Management	N	No API spec, no versioning, no deprecation policy
API10: Unsafe Consumption of APIs	P	External calls are not wrapped in circuit breakers / timeouts uniformly

Twelve-Factor App¶

Factor	Status
1. Codebase	C
2. Dependencies	C
3. Config	P (most in `.env`, some hardcoded)
4. Backing services	P (DB and S3 OK; no abstraction layer for swapping)
5. Build/release/run	P (Docker exists but deploy bypasses it)
6. Processes	P (sessions in memory, Socket.IO global)
7. Port binding	C
8. Concurrency	P (PM2 cluster mode not used)
9. Disposability	N (no graceful shutdown handling)
10. Dev/prod parity	P
11. Logs	N (write to stdout but unstructured)
12. Admin processes	P (jobs are co-deployed; not separable)

SOC 2 readiness (high-level)¶

Trust Service Criterion	Status
Security (CC) — access control, change management, risk assessment	N — most missing
Availability (A) — SLO, monitoring, incident response	N
Processing Integrity (PI) — data accuracy, completeness	P
Confidentiality (C) — encryption, key management	P
Privacy (P) — data subject rights, consent	N (if applicable to customer profile)

9. Operating Model Recommendations¶

Beyond technology, an enterprise platform requires operating-model maturity.

9.1 On-call & incident response¶

24×7 on-call rotation with clear escalation paths.
Alerts routed to PagerDuty / OpsGenie / Slack-native alerting.
Runbooks for top-10 incidents — pinned, versioned, reviewed quarterly.
Postmortems for every Sev-1 / Sev-2 incident. Blameless. Action items tracked in the same backlog as features.

9.2 Change management¶

All production changes via PR.
Required reviews from CODEOWNERS for security-sensitive paths (auth, billing, secrets).
Change-management ticketing for high-risk changes (DB migrations, auth changes, infrastructure changes).
Deploy windows respected for high-risk customers if contractually required.

9.3 Architecture decision records (ADRs)¶

Every "this decision is hard to reverse" choice documented as an ADR in the documentation tree (e.g. someli-doc/audit/someli-api/adr/####-title.md).
Format: context, decision, consequences. Lightweight (1 page).
Examples for this codebase that should already exist as ADRs but don't:
Why two auth paths (Passport JWT + custom AES)?
Why two RAG implementations (in-memory + Vertex)?
Why MySQL over PostgreSQL?
Why ad-hoc cron polling over a queue?

9.4 Code review checklist (security-aware)¶

Every PR review explicitly checks: - [ ] No secrets committed - [ ] Inputs validated - [ ] Errors handled and logged - [ ] New endpoints have auth + rate limit - [ ] New external calls have timeout + retry policy - [ ] New schema changes have a migration - [ ] Tests added or modified - [ ] OpenAPI spec updated

9.5 Documentation as code¶

Documentation lives in the repo, reviewed in the same PR as the code change.
doc/ already exists — keep it as the source of truth.
Generate API docs from OpenAPI; do not handwrite endpoint lists.

10. Recommended Reading¶

For the team and for whoever owns this transformation:

"Accelerate" — Forsgren, Humble, Kim. The DORA metrics, why they matter.
"Building Microservices, 2nd ed." — Sam Newman. When to split, when not to.
"Designing Data-Intensive Applications" — Martin Kleppmann. Foundational data-architecture thinking.
"The DevOps Handbook" — Kim, Humble, Debois, Willis. Operational excellence patterns.
"Software Engineering at Google" — Winters, Manshreck, Wright. How a mature org thinks about scale.
OpenTelemetry docs — for the observability migration.
AWS Well-Architected Framework whitepapers — concise statements of the six pillars.
OWASP API Security Top 10 cheat sheet.

11. What We're Asking the Reader to Decide¶

This document is structured so that reading it should produce a small set of executive decisions. Specifically:

Do we accept the modular-monolith-then-extract trajectory? Or are we committed to microservices day 1? (My recommendation: accept the trajectory; do not premature-decompose.)
Do we adopt TypeScript? (Recommendation: yes, incrementally.)
Do we adopt OpenAPI as the API contract source of truth? (Recommendation: yes.)
Do we replace polling cron jobs with a real queue? (Recommendation: yes, BullMQ + Redis as the default starting point.)
Do we standardize on OpenTelemetry? (Recommendation: yes; vendor choice can be decided later.)
Which compliance certifications do we commit to and by when? (Recommendation: SOC 2 Type I in 12 months, Type II in 18, GDPR readiness in 6.)
What is the headcount investment? (Recommendation: this roadmap requires 1 platform / SRE engineer at minimum; ideally 2. Without that, the roadmap stretches by 50–100%.)

Once those decisions are made, the rest of this document becomes an execution plan rather than a strategic question.

12. Appendix: Mapping Findings to Other Documents in `doc/`¶

For traceability, here is where each finding can be cross-referenced:

Finding theme	Reference document
Auth model, dual paths	`authentication.md`
Response shape, error contract	`error-handling.md`
Observability gaps	`logging-observability.md`
Secrets, CORS, helmet, rate limit	`security.md`
Job inventory, ecosystem.config.js bugs	`jobs-inventory.md`
Polling pattern, AI workers	`user-specific-ai.md`, `content-pipeline.md`
Polotno, S3, media	`media-processing.md`
RAG implementations	`rag-pipeline.md`
Dashboard sub-app	`dashboard-analytics.md`
DB schema, table count	`data-model.md`
Architecture, route file size	`architecture-overview.md`
Deploy via SSH + PM2	`deployment.md`
Env vars, configuration	`configuration.md`
33 external services	`Integration-inventory.md`
Socket.IO global	`realtime-events.md`
Conversational agents	`agents-and-ai.md`
Notifications	`notifications.md`
Endpoint count	`API-inventory.md`

End of assessment.