First Deployment — Standing Up a New Environment¶

This document walks through deploying someli-api to a brand-new environment from zero — for example, when creating a new staging tenant, a new region, a recovered-after-disaster instance, or a previously undocumented production setup.

For ongoing day-to-day deploys to existing environments, see deployment.md. For local-machine setup, see getting-started.md.

[VERIFY] markers in this document indicate places where the codebase doesn't fully tell the answer and the team needs to confirm. Fill them in once and they're done forever.

1. Decide the target topology¶

someli-api currently uses two production-shape topologies (per deployment.md):

Topology	Where it's used today	When to use for a new env
EC2 / Lightsail + nginx + PM2	Dev (Lightsail), production (assumed)	Simpler, lower per-month cost, good for low-to-moderate traffic
Fargate (one image, many tasks)	UAT (`uapi.someli.ai`)	Better for variable load, isolates worker types into separate tasks, native to ECS

For a brand-new environment, Fargate is the recommended pattern — it's the more modern of the two, gives per-task scaling, and aligns with the §5.5 queue migration in the readiness roadmap. The EC2 + PM2 pattern is operationally heavier (manual provisioning, no auto-recovery beyond PM2's per-process restart, no per-job autoscaling).

This doc covers the Fargate pattern. If you're standing up a Lightsail / EC2 environment instead, see Appendix A.

2. AWS account prerequisites¶

The following must exist before any deploy can happen. Most are one-time setup per AWS account.

2.1 Networking¶

[VERIFY] VPC and subnets — does the team have an existing VPC layout for someli-api environments, or does each new environment get its own VPC? Recommended: one VPC per environment with public + private subnets in 2+ AZs.

Required networking primitives: - VPC with public and private subnets in at least 2 availability zones - Internet Gateway for the public subnets - NAT Gateway in each AZ (for private-subnet egress to AWS APIs and the internet) - Security groups: - One for the ALB (allows inbound 80/443 from anywhere) - One for Fargate tasks (allows inbound from the ALB SG only) - One for RDS (allows inbound 3306 from the Fargate task SG)

2.2 RDS MySQL¶

Engine:        MySQL 8.0 (or 5.7 to match existing — verify)
Instance type: db.t4g.medium for dev, db.m6g.large+ for prod
Multi-AZ:      Yes for prod, optional for dev
Storage:       gp3, 100 GB to start, autoscaling on
Backups:       7 day retention minimum
Encryption:    KMS-encrypted (storage and snapshots)
Subnet group:  private subnets only

[VERIFY] MySQL version — production today is on which exact MySQL version? Match it for consistency.

[VERIFY] Do existing environments use Multi-AZ? Prod should. Dev usually doesn't. Confirm before standing up.

After RDS is provisioned: 1. Connect from a bastion or jumpbox (RDS is in private subnets). 2. Create the application database and user:

CREATE DATABASE someli CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'someli_app'@'%' IDENTIFIED BY '<strong-password>';
GRANT SELECT, INSERT, UPDATE, DELETE, EXECUTE ON someli.* TO 'someli_app'@'%';
-- For schema bootstrapping you may need DDL temporarily:
-- GRANT ALL ON someli.* TO 'someli_app'@'%';
FLUSH PRIVILEGES;

3. Bootstrap the schema (see §6 below).

2.3 S3 buckets¶

The application uses two S3 buckets in two regions:

Bucket	Region	Purpose	Env var
Primary	us-west-2	User uploads, generated designs, RAG documents	`S3_Bucket_Name`, `S3_Region`
Secondary	us-west-1	Region-specific delivery / media library	`S3_Bucket_Name2`, `S3_Region2`

[VERIFY] Naming convention — what's the team's standard for bucket names per environment? Suggested: someli-{env}-primary-{region} and someli-{env}-media-{region}.

For each bucket: - Versioning enabled (recommended) - Server-side encryption (SSE-S3 or SSE-KMS) - Bucket policy that allows the Fargate task IAM role to GetObject/PutObject/DeleteObject - A lifecycle policy for cost control (e.g., move objects > 90 days to Standard-IA)

[VERIFY] Are the prod buckets behind CloudFront? If yes, also provision a CloudFront distribution per bucket and document the distribution domain.

2.4 ECR repository¶

Region:     us-west-2 (matches existing UAT — account 255061853867)
Repository: someli-api  (or per-environment: someli-api-dev, someli-api-uat, ...)
Lifecycle:  Keep last 50 images

[VERIFY] Single ECR repo or per-environment? UAT today uses 255061853867.dkr.ecr.us-west-2.amazonaws.com/uat. New environments should use a parallel naming scheme.

2.5 Secrets Manager¶

Per the security recommendations in §5.7, all secrets should live in AWS Secrets Manager rather than .env files baked into images.

Create one secret per environment:

Name:        someli-{env}-config
Type:        Other type of secret (key/value pairs)
Contents:    All the .env keys (host, user, password, AWS_BEDROCK_KEY, etc.)
KMS key:     a dedicated key per environment, recommended

The Fargate task definition pulls secrets from this at runtime via the secrets: block of the container definition.

[VERIFY] Does this team currently use Secrets Manager? If not, this is a Phase 0 step that will block the new-env bootstrap. Either fix the gap (recommended) or document a temporary workaround using .env baked into the image (not recommended).

2.6 GCP service account (for Vertex AI / GCS)¶

Per rag-pipeline.md, GCP credentials are pulled from AWS Secrets Manager. So:

Create a GCP service account in the GCP project (CLOUD_PROJECT_ID).
Grant Vertex AI User + Storage Object Admin on the relevant buckets.
Download the JSON key.
Store the JSON key in AWS Secrets Manager under the name referenced by GCS_SECRET_NAME.

[VERIFY] GCP project organization — is there one shared GCP project across environments, or one per environment? Recommended: per-environment for blast-radius isolation.

2.7 ALB / Route 53¶

ALB:           Application Load Balancer
Listeners:     HTTPS:443 (with ACM cert) → forward to target group
Target group:  IP-target-type, port 3000, health check on /health, healthy threshold 2
Route 53:      A record (alias) for {env}.api.someli.ai → ALB
ACM cert:      *.someli.ai or per-host

[VERIFY] Hostname convention per environment — UAT uses uapi.someli.ai. Suggested: - dev: dev-api.someli.ai - staging: staging-api.someli.ai - production: api.someli.ai

3. Build the container image¶

[VERIFY] Dockerfile fix required first. The current Dockerfile (per deployment.md §6.3) has no yarn install step. You must fix this before standing up any new environment. See the suggested multi-stage Dockerfile in deployment.md.

Once the Dockerfile installs dependencies properly, build and push:

# Authenticate to ECR
aws ecr get-login-password --region us-west-2 \
  | docker login --username AWS --password-stdin \
    <account-id>.dkr.ecr.us-west-2.amazonaws.com

# Build (with the SHA as a tag for traceability)
GIT_SHA=$(git rev-parse --short HEAD)
docker build \
  --platform linux/amd64 \
  -t someli-api:$GIT_SHA \
  -t someli-api:latest \
  .

# Tag for ECR
docker tag someli-api:$GIT_SHA \
  <account-id>.dkr.ecr.us-west-2.amazonaws.com/someli-api:$GIT_SHA
docker tag someli-api:$GIT_SHA \
  <account-id>.dkr.ecr.us-west-2.amazonaws.com/someli-api:{env}-latest

# Push
docker push <account-id>.dkr.ecr.us-west-2.amazonaws.com/someli-api:$GIT_SHA
docker push <account-id>.dkr.ecr.us-west-2.amazonaws.com/someli-api:{env}-latest

Use a SHA-tagged image (not just :latest) in the task definition — see §5.

If you're standing up an ARM-based environment (Graviton — recommended for the worker fleet for cost), build with --platform linux/arm64 instead. See the previous discussion of ARM portability for the dependency notes.

4. IAM roles for Fargate¶

Two IAM roles are needed:

4.1 Task execution role¶

Used by ECS itself to pull the image, fetch secrets, and write logs.

Trust:    ecs-tasks.amazonaws.com
Policies: AmazonECSTaskExecutionRolePolicy
          + permission to read the Secrets Manager secret(s)
          + permission to KMS-decrypt (if KMS-encrypted secrets)

4.2 Task role¶

Used by the application code at runtime. This is what your Node.js code's AWS SDK will use.

Trust:    ecs-tasks.amazonaws.com
Policies: S3 read/write to the two app buckets
          + Bedrock InvokeModel for the relevant models
          + Secrets Manager read for the GCP creds secret (so the app can fetch it dynamically)
          + KMS Decrypt where applicable
          + CloudWatch Logs PutLogEvents (if you're logging that way)

[VERIFY] Production IAM policies — what's the team's existing IAM policy structure? Reuse rather than reinvent.

5. Task definitions¶

The Fargate "one image, many tasks" pattern means defining multiple task definitions per environment, each pointing at the same ECR image with a different command override.

5.1 API task definition¶

{
  "family": "someli-api-{env}-api",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "1024",
  "memory": "2048",
  "executionRoleArn": "<task-execution-role-arn>",
  "taskRoleArn": "<task-role-arn>",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "<ecr>/someli-api:<sha>",
      "essential": true,
      "command": ["/entrypoint.sh"],   // existing entrypoint runs nginx + node server.js
      "portMappings": [
        { "containerPort": 80, "protocol": "tcp" }
      ],
      "secrets": [
        { "name": "host",     "valueFrom": "<secrets-arn>:host::" },
        { "name": "user",     "valueFrom": "<secrets-arn>:user::" },
        { "name": "password", "valueFrom": "<secrets-arn>:password::" },
        { "name": "database", "valueFrom": "<secrets-arn>:database::" },
        // ... all the other env vars from configuration.md
      ],
      "environment": [
        { "name": "NODE_ENV", "value": "production" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/someli-api-{env}",
          "awslogs-region": "us-west-2",
          "awslogs-stream-prefix": "api"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost/health || exit 1"],
        "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60
      }
    }
  ]
}

5.2 Worker task definitions¶

For each worker that should run continuously, define a task with the same image but a different command. Example for the email-send worker:

{
  "family": "someli-api-{env}-job-send-mail",
  "...": "...",
  "containerDefinitions": [
    {
      "name": "worker",
      "image": "<ecr>/someli-api:<sha>",
      "command": ["node", "job_send_mail.js"],
      "secrets": [/* same as API */],
      "environment": [
        { "name": "NODE_ENV", "value": "production" }
      ],
      "logConfiguration": { /* per-worker log group */ }
    }
  ]
}

[VERIFY] Which workers should run as Fargate tasks vs. one consolidated PM2-inside-Fargate task? The cleanest pattern is one task per worker. The cheapest is one Fargate task that runs all workers under PM2 (matching the current EC2 PM2 pattern). Per-worker tasks scale better and isolate failures; PM2-in-Fargate is cheaper. Document the choice for this environment.

[VERIFY] Memory / CPU sizing per worker — Polotno-rendering workers need more memory than database-housekeeping workers. Establish per-job-type sizing.

[VERIFY] Which workers can be scheduled vs. always-on? Most poll continuously and should be always-on (desiredCount: 1+). Some are pure cron (e.g., daily-report style) and could run as ECS Scheduled Tasks instead, saving cost.

5.3 ECS service per task definition¶

For each task definition that should run continuously, create an ECS service:

Cluster:        someli-api-{env}
Service:        someli-api-{env}-api  (or worker name)
Launch type:    FARGATE
Desired count:  1 (more for the API tier in prod)
Network:        private subnets, task-SG only
Load balancer:  (API service only) Target group from §2.7
Auto scaling:   (API tier in prod) target tracking on CPU/memory

6. Database schema bootstrap¶

The application has no migration tool. Schema must be applied manually before the API can start.

# From a host with MySQL client and network access to RDS:
mysql -h <rds-endpoint> -u someli_app -p someli < schema.sql

[VERIFY] Where does schema.sql come from? This is the same gap noted in getting-started.md §7. Either: - Dump from a known-good environment (mysqldump --no-data ...) - A canonical schema.sql checked into the repo (does not exist today) - A Confluence / wiki version

This is a precondition for new-environment bootstrap and must be answered before this doc is fully usable.

[VERIFY] Initial seed data — for non-empty environments (e.g., a new staging tenant), what reference data is needed? Plan tiers, default templates, seed admin user, etc.

[VERIFY] Migration discipline going forward — once a migration tool is adopted (Knex / Prisma / Flyway per the readiness roadmap), this section becomes "run yarn migrate before deploy."

7. DNS, certificates, OAuth callbacks¶

7.1 DNS¶

In Route 53, create the alias record for the new environment's hostname pointing at the ALB. Allow ~5 minutes for DNS propagation.

7.2 ACM certificate¶

Either: - Use a wildcard *.someli.ai ACM cert covering the new hostname. - Issue a per-hostname cert via DNS validation.

Attach to the ALB HTTPS listener.

7.3 OAuth callback URLs¶

For each OAuth provider in use, add the new environment's callback URL:

Google:    https://{env}.api.someli.ai/social/google/callback
Facebook:  https://{env}.api.someli.ai/social/facebook/callback
LinkedIn:  https://{env}.api.someli.ai/social/linkedin/callback
GitHub:    https://{env}.api.someli.ai/social/github/callback
TikTok:    https://{env}.api.someli.ai/social/tiktok/callback
Twitter:   https://{env}.api.someli.ai/social/twitter/callback

[VERIFY] Exact callback paths — confirm against routes/social.js for each provider.

7.4 Paddle webhook URL¶

If billing is in scope for this environment, register the webhook URL with Paddle:

https://{env}.api.someli.ai/paddle/webhook

[VERIFY] Sandbox vs. production Paddle — does this environment use Paddle Sandbox or Production? Set PADDLE_ENV and the corresponding webhook key accordingly.

8. First-deploy verification¶

After standing up the environment, run these checks before announcing it:

Check	How
API task is Running	`aws ecs describe-services --cluster someli-api-{env} --services someli-api-{env}-api`
Target group is healthy	ALB target group health → expect "healthy" within ~60s of task start
`/health` returns 200 via the ALB	`curl https://{env}.api.someli.ai/health`
`/db-health` returns 200 via the ALB	`curl https://{env}.api.someli.ai/db-health`
Logs are flowing	CloudWatch Logs → `/ecs/someli-api-{env}` log group has lines
Workers are running	One task per worker in ECS, each with healthy log lines
MySQL has the schema	`mysql -h <rds> -u someli_app -p someli -e 'SHOW TABLES;'` returns ~140
S3 reachable from a task	Trigger a content-generation that writes to S3, verify object lands
Bedrock reachable	Trigger an AI-call endpoint, verify a response
Vertex AI reachable	Trigger a RAG-augmented call, verify response
Email sends	Enqueue a test email via the API; verify SendGrid delivers it
OAuth login works	Run through one provider's full login dance

[VERIFY] Post-deploy smoke-test script — would benefit from a single verify-env.sh that runs through this whole list automatically.

9. Hand off / runbook¶

Before declaring the new environment "live":

Update deployment.md with this environment's specifics (hostname, ECR tag, ALB ARN, RDS endpoint).
Add the environment to whatever monitoring / alerting is configured.
Hand the credentials (Secrets Manager ARNs, IAM role names) to whoever will operate it.
Document the deployment process for ongoing updates to this environment (which is deployment.md's job, not this doc's).

10. Cost estimate (rough order of magnitude)¶

For a single environment running 1 API task + ~30 worker tasks on Fargate:

Component	Approximate monthly cost (us-west-2, on-demand)
Fargate compute (1 vCPU, 2 GB API + ~30 workers @ 0.25 vCPU / 0.5 GB)	$200–400
RDS db.m6g.large Multi-AZ + 100 GB storage	$250
ALB	$25
NAT Gateway (per AZ × 2 AZs)	$65
Data transfer	varies
S3 storage + requests	$20–100
CloudWatch Logs	$10–50
Secrets Manager	$1
ECR storage	$5
Approximate total	$600–900 per environment per month

[VERIFY] Real costs from AWS Cost Explorer — replace these estimates with actuals from existing environments once one or two months of data exist.

Production with higher traffic, multi-AZ, larger RDS, more worker concurrency will be 3–5× this number.

11. Common problems and fixes¶

Symptom	Likely cause	Fix
Task stuck "PROVISIONING" → "STOPPED" with exit code 1	Image fails to start (most often: missing env var)	Check CloudWatch Logs for the task; usually a `.env` key is missing
ALB health check fails	Task is up but `/health` is unreachable	Confirm task SG allows inbound from ALB SG; confirm port 80 in target group matches container port
`ER_ACCESS_DENIED_ERROR` in API logs	DB credentials wrong or RDS SG doesn't allow Fargate task SG	Verify Secrets Manager values and SG rules
Workers crash-loop on startup	Often: a missing AWS or GCP credential	Check the task's logs for the specific missing env var
Polotno renders fail with "browser failed to launch"	Chromium runtime libs missing in the image (shouldn't happen with the existing Dockerfile but verify)	Re-confirm the Dockerfile installs `libnss3 libexpat1 fontconfig`
Cross-region S3 calls slow	App is in us-west-2 but media bucket is in us-west-1	Either move bucket or accept the latency (current production has this trade-off)
OAuth callback returns "redirect_uri mismatch"	New environment's callback URL not registered with the provider	Add it in the provider's developer console

Appendix A — EC2 / Lightsail + PM2 (if you must)¶

If for some reason you need to stand up an environment on the older EC2 + nginx + PM2 pattern (matches dev / production today):

Provision an EC2 / Lightsail instance in the appropriate region. Ubuntu 22.04 minimum.
Install Node 20.18.1 via NVM (nvm install 20.18.1).
Install yarn (npm install -g yarn@1.22.22).
Install pm2 (npm install -g pm2).
Install nginx (apt-get install nginx).
Clone the repo to /home/ubuntu/someli-api.
Create the .env file with all required values (manual; no Secrets Manager integration in this pattern).
Run yarn install.
Bootstrap the schema as in §6.
Start the API: pm2 start server.js --name {env}_api --node-args="--max-old-space-size=2048".
Start the worker fleet: pm2 start ecosystem.config.js.
Save the PM2 state: pm2 save && pm2 startup.
Configure nginx with the nginx.conf from the repo (proxy 80 → 3000).
Provision SSL (Let's Encrypt + Certbot, or terminate at an ALB upstream).
Add the host's IP to Route 53.

This pattern has none of the auto-scaling, isolated failure, or per-task observability of Fargate. It's documented here only because it matches existing environments.

[VERIFY] Should new environments be standardized on Fargate? If yes, deprecate Appendix A entirely. Recommended.

12. What to do when this doc is wrong¶

This document was scaffolded from codebase analysis. Several entries are marked [VERIFY] because the codebase doesn't fully tell the answer. Several others assume infrastructure choices that the team should confirm or adjust. If you stand up an environment and hit a gap or an instruction doesn't work:

Note the actual fix.
Update this doc in the same PR.
Remove the [VERIFY] marker once verified.

The first time someone successfully follows this doc end-to-end is the first time it earns its keep. Each revision after that gets faster.