First Deployment — Standing Up a New Environment¶
This document walks through deploying someli-api to a brand-new environment from zero — for example, when creating a new staging tenant, a new region, a recovered-after-disaster instance, or a previously undocumented production setup.
For ongoing day-to-day deploys to existing environments, see deployment.md. For local-machine setup, see getting-started.md.
[VERIFY]markers in this document indicate places where the codebase doesn't fully tell the answer and the team needs to confirm. Fill them in once and they're done forever.
1. Decide the target topology¶
someli-api currently uses two production-shape topologies (per deployment.md):
| Topology | Where it's used today | When to use for a new env |
|---|---|---|
| EC2 / Lightsail + nginx + PM2 | Dev (Lightsail), production (assumed) | Simpler, lower per-month cost, good for low-to-moderate traffic |
| Fargate (one image, many tasks) | UAT (uapi.someli.ai) |
Better for variable load, isolates worker types into separate tasks, native to ECS |
For a brand-new environment, Fargate is the recommended pattern — it's the more modern of the two, gives per-task scaling, and aligns with the §5.5 queue migration in the readiness roadmap. The EC2 + PM2 pattern is operationally heavier (manual provisioning, no auto-recovery beyond PM2's per-process restart, no per-job autoscaling).
This doc covers the Fargate pattern. If you're standing up a Lightsail / EC2 environment instead, see Appendix A.
2. AWS account prerequisites¶
The following must exist before any deploy can happen. Most are one-time setup per AWS account.
2.1 Networking¶
[VERIFY] VPC and subnets — does the team have an existing VPC layout for someli-api environments, or does each new environment get its own VPC? Recommended: one VPC per environment with public + private subnets in 2+ AZs.
Required networking primitives: - VPC with public and private subnets in at least 2 availability zones - Internet Gateway for the public subnets - NAT Gateway in each AZ (for private-subnet egress to AWS APIs and the internet) - Security groups: - One for the ALB (allows inbound 80/443 from anywhere) - One for Fargate tasks (allows inbound from the ALB SG only) - One for RDS (allows inbound 3306 from the Fargate task SG)
2.2 RDS MySQL¶
Engine: MySQL 8.0 (or 5.7 to match existing — verify)
Instance type: db.t4g.medium for dev, db.m6g.large+ for prod
Multi-AZ: Yes for prod, optional for dev
Storage: gp3, 100 GB to start, autoscaling on
Backups: 7 day retention minimum
Encryption: KMS-encrypted (storage and snapshots)
Subnet group: private subnets only
[VERIFY] MySQL version — production today is on which exact MySQL version? Match it for consistency.
[VERIFY] Do existing environments use Multi-AZ? Prod should. Dev usually doesn't. Confirm before standing up.
After RDS is provisioned: 1. Connect from a bastion or jumpbox (RDS is in private subnets). 2. Create the application database and user:
CREATE DATABASE someli CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'someli_app'@'%' IDENTIFIED BY '<strong-password>';
GRANT SELECT, INSERT, UPDATE, DELETE, EXECUTE ON someli.* TO 'someli_app'@'%';
-- For schema bootstrapping you may need DDL temporarily:
-- GRANT ALL ON someli.* TO 'someli_app'@'%';
FLUSH PRIVILEGES;
2.3 S3 buckets¶
The application uses two S3 buckets in two regions:
| Bucket | Region | Purpose | Env var |
|---|---|---|---|
| Primary | us-west-2 | User uploads, generated designs, RAG documents | S3_Bucket_Name, S3_Region |
| Secondary | us-west-1 | Region-specific delivery / media library | S3_Bucket_Name2, S3_Region2 |
[VERIFY] Naming convention — what's the team's standard for bucket names per environment? Suggested:
someli-{env}-primary-{region}andsomeli-{env}-media-{region}.
For each bucket: - Versioning enabled (recommended) - Server-side encryption (SSE-S3 or SSE-KMS) - Bucket policy that allows the Fargate task IAM role to GetObject/PutObject/DeleteObject - A lifecycle policy for cost control (e.g., move objects > 90 days to Standard-IA)
[VERIFY] Are the prod buckets behind CloudFront? If yes, also provision a CloudFront distribution per bucket and document the distribution domain.
2.4 ECR repository¶
Region: us-west-2 (matches existing UAT — account 255061853867)
Repository: someli-api (or per-environment: someli-api-dev, someli-api-uat, ...)
Lifecycle: Keep last 50 images
[VERIFY] Single ECR repo or per-environment? UAT today uses
255061853867.dkr.ecr.us-west-2.amazonaws.com/uat. New environments should use a parallel naming scheme.
2.5 Secrets Manager¶
Per the security recommendations in §5.7, all secrets should live in AWS Secrets Manager rather than .env files baked into images.
Create one secret per environment:
Name: someli-{env}-config
Type: Other type of secret (key/value pairs)
Contents: All the .env keys (host, user, password, AWS_BEDROCK_KEY, etc.)
KMS key: a dedicated key per environment, recommended
The Fargate task definition pulls secrets from this at runtime via the secrets: block of the container definition.
[VERIFY] Does this team currently use Secrets Manager? If not, this is a Phase 0 step that will block the new-env bootstrap. Either fix the gap (recommended) or document a temporary workaround using
.envbaked into the image (not recommended).
2.6 GCP service account (for Vertex AI / GCS)¶
Per rag-pipeline.md, GCP credentials are pulled from AWS Secrets Manager. So:
- Create a GCP service account in the GCP project (
CLOUD_PROJECT_ID). - Grant Vertex AI User + Storage Object Admin on the relevant buckets.
- Download the JSON key.
- Store the JSON key in AWS Secrets Manager under the name referenced by
GCS_SECRET_NAME.
[VERIFY] GCP project organization — is there one shared GCP project across environments, or one per environment? Recommended: per-environment for blast-radius isolation.
2.7 ALB / Route 53¶
ALB: Application Load Balancer
Listeners: HTTPS:443 (with ACM cert) → forward to target group
Target group: IP-target-type, port 3000, health check on /health, healthy threshold 2
Route 53: A record (alias) for {env}.api.someli.ai → ALB
ACM cert: *.someli.ai or per-host
[VERIFY] Hostname convention per environment — UAT uses
uapi.someli.ai. Suggested: - dev:dev-api.someli.ai- staging:staging-api.someli.ai- production:api.someli.ai
3. Build the container image¶
[VERIFY] Dockerfile fix required first. The current Dockerfile (per
deployment.md§6.3) has noyarn installstep. You must fix this before standing up any new environment. See the suggested multi-stage Dockerfile indeployment.md.
Once the Dockerfile installs dependencies properly, build and push:
# Authenticate to ECR
aws ecr get-login-password --region us-west-2 \
| docker login --username AWS --password-stdin \
<account-id>.dkr.ecr.us-west-2.amazonaws.com
# Build (with the SHA as a tag for traceability)
GIT_SHA=$(git rev-parse --short HEAD)
docker build \
--platform linux/amd64 \
-t someli-api:$GIT_SHA \
-t someli-api:latest \
.
# Tag for ECR
docker tag someli-api:$GIT_SHA \
<account-id>.dkr.ecr.us-west-2.amazonaws.com/someli-api:$GIT_SHA
docker tag someli-api:$GIT_SHA \
<account-id>.dkr.ecr.us-west-2.amazonaws.com/someli-api:{env}-latest
# Push
docker push <account-id>.dkr.ecr.us-west-2.amazonaws.com/someli-api:$GIT_SHA
docker push <account-id>.dkr.ecr.us-west-2.amazonaws.com/someli-api:{env}-latest
Use a SHA-tagged image (not just :latest) in the task definition — see §5.
If you're standing up an ARM-based environment (Graviton — recommended for the worker fleet for cost), build with --platform linux/arm64 instead. See the previous discussion of ARM portability for the dependency notes.
4. IAM roles for Fargate¶
Two IAM roles are needed:
4.1 Task execution role¶
Used by ECS itself to pull the image, fetch secrets, and write logs.
Trust: ecs-tasks.amazonaws.com
Policies: AmazonECSTaskExecutionRolePolicy
+ permission to read the Secrets Manager secret(s)
+ permission to KMS-decrypt (if KMS-encrypted secrets)
4.2 Task role¶
Used by the application code at runtime. This is what your Node.js code's AWS SDK will use.
Trust: ecs-tasks.amazonaws.com
Policies: S3 read/write to the two app buckets
+ Bedrock InvokeModel for the relevant models
+ Secrets Manager read for the GCP creds secret (so the app can fetch it dynamically)
+ KMS Decrypt where applicable
+ CloudWatch Logs PutLogEvents (if you're logging that way)
[VERIFY] Production IAM policies — what's the team's existing IAM policy structure? Reuse rather than reinvent.
5. Task definitions¶
The Fargate "one image, many tasks" pattern means defining multiple task definitions per environment, each pointing at the same ECR image with a different command override.
5.1 API task definition¶
{
"family": "someli-api-{env}-api",
"requiresCompatibilities": ["FARGATE"],
"networkMode": "awsvpc",
"cpu": "1024",
"memory": "2048",
"executionRoleArn": "<task-execution-role-arn>",
"taskRoleArn": "<task-role-arn>",
"containerDefinitions": [
{
"name": "api",
"image": "<ecr>/someli-api:<sha>",
"essential": true,
"command": ["/entrypoint.sh"], // existing entrypoint runs nginx + node server.js
"portMappings": [
{ "containerPort": 80, "protocol": "tcp" }
],
"secrets": [
{ "name": "host", "valueFrom": "<secrets-arn>:host::" },
{ "name": "user", "valueFrom": "<secrets-arn>:user::" },
{ "name": "password", "valueFrom": "<secrets-arn>:password::" },
{ "name": "database", "valueFrom": "<secrets-arn>:database::" },
// ... all the other env vars from configuration.md
],
"environment": [
{ "name": "NODE_ENV", "value": "production" }
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/someli-api-{env}",
"awslogs-region": "us-west-2",
"awslogs-stream-prefix": "api"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost/health || exit 1"],
"interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60
}
}
]
}
5.2 Worker task definitions¶
For each worker that should run continuously, define a task with the same image but a different command. Example for the email-send worker:
{
"family": "someli-api-{env}-job-send-mail",
"...": "...",
"containerDefinitions": [
{
"name": "worker",
"image": "<ecr>/someli-api:<sha>",
"command": ["node", "job_send_mail.js"],
"secrets": [/* same as API */],
"environment": [
{ "name": "NODE_ENV", "value": "production" }
],
"logConfiguration": { /* per-worker log group */ }
}
]
}
[VERIFY] Which workers should run as Fargate tasks vs. one consolidated PM2-inside-Fargate task? The cleanest pattern is one task per worker. The cheapest is one Fargate task that runs all workers under PM2 (matching the current EC2 PM2 pattern). Per-worker tasks scale better and isolate failures; PM2-in-Fargate is cheaper. Document the choice for this environment.
[VERIFY] Memory / CPU sizing per worker — Polotno-rendering workers need more memory than database-housekeeping workers. Establish per-job-type sizing.
[VERIFY] Which workers can be scheduled vs. always-on? Most poll continuously and should be always-on (
desiredCount: 1+). Some are pure cron (e.g., daily-report style) and could run as ECS Scheduled Tasks instead, saving cost.
5.3 ECS service per task definition¶
For each task definition that should run continuously, create an ECS service:
Cluster: someli-api-{env}
Service: someli-api-{env}-api (or worker name)
Launch type: FARGATE
Desired count: 1 (more for the API tier in prod)
Network: private subnets, task-SG only
Load balancer: (API service only) Target group from §2.7
Auto scaling: (API tier in prod) target tracking on CPU/memory
6. Database schema bootstrap¶
The application has no migration tool. Schema must be applied manually before the API can start.
# From a host with MySQL client and network access to RDS:
mysql -h <rds-endpoint> -u someli_app -p someli < schema.sql
[VERIFY] Where does
schema.sqlcome from? This is the same gap noted in getting-started.md §7. Either: - Dump from a known-good environment (mysqldump --no-data ...) - A canonicalschema.sqlchecked into the repo (does not exist today) - A Confluence / wiki versionThis is a precondition for new-environment bootstrap and must be answered before this doc is fully usable.
[VERIFY] Initial seed data — for non-empty environments (e.g., a new staging tenant), what reference data is needed? Plan tiers, default templates, seed admin user, etc.
[VERIFY] Migration discipline going forward — once a migration tool is adopted (Knex / Prisma / Flyway per the readiness roadmap), this section becomes "run
yarn migratebefore deploy."
7. DNS, certificates, OAuth callbacks¶
7.1 DNS¶
In Route 53, create the alias record for the new environment's hostname pointing at the ALB. Allow ~5 minutes for DNS propagation.
7.2 ACM certificate¶
Either:
- Use a wildcard *.someli.ai ACM cert covering the new hostname.
- Issue a per-hostname cert via DNS validation.
Attach to the ALB HTTPS listener.
7.3 OAuth callback URLs¶
For each OAuth provider in use, add the new environment's callback URL:
Google: https://{env}.api.someli.ai/social/google/callback
Facebook: https://{env}.api.someli.ai/social/facebook/callback
LinkedIn: https://{env}.api.someli.ai/social/linkedin/callback
GitHub: https://{env}.api.someli.ai/social/github/callback
TikTok: https://{env}.api.someli.ai/social/tiktok/callback
Twitter: https://{env}.api.someli.ai/social/twitter/callback
[VERIFY] Exact callback paths — confirm against
routes/social.jsfor each provider.
7.4 Paddle webhook URL¶
If billing is in scope for this environment, register the webhook URL with Paddle:
[VERIFY] Sandbox vs. production Paddle — does this environment use Paddle Sandbox or Production? Set
PADDLE_ENVand the corresponding webhook key accordingly.
8. First-deploy verification¶
After standing up the environment, run these checks before announcing it:
| Check | How |
|---|---|
| API task is Running | aws ecs describe-services --cluster someli-api-{env} --services someli-api-{env}-api |
| Target group is healthy | ALB target group health → expect "healthy" within ~60s of task start |
/health returns 200 via the ALB |
curl https://{env}.api.someli.ai/health |
/db-health returns 200 via the ALB |
curl https://{env}.api.someli.ai/db-health |
| Logs are flowing | CloudWatch Logs → /ecs/someli-api-{env} log group has lines |
| Workers are running | One task per worker in ECS, each with healthy log lines |
| MySQL has the schema | mysql -h <rds> -u someli_app -p someli -e 'SHOW TABLES;' returns ~140 |
| S3 reachable from a task | Trigger a content-generation that writes to S3, verify object lands |
| Bedrock reachable | Trigger an AI-call endpoint, verify a response |
| Vertex AI reachable | Trigger a RAG-augmented call, verify response |
| Email sends | Enqueue a test email via the API; verify SendGrid delivers it |
| OAuth login works | Run through one provider's full login dance |
[VERIFY] Post-deploy smoke-test script — would benefit from a single
verify-env.shthat runs through this whole list automatically.
9. Hand off / runbook¶
Before declaring the new environment "live":
- Update
deployment.mdwith this environment's specifics (hostname, ECR tag, ALB ARN, RDS endpoint). - Add the environment to whatever monitoring / alerting is configured.
- Hand the credentials (Secrets Manager ARNs, IAM role names) to whoever will operate it.
- Document the deployment process for ongoing updates to this environment (which is
deployment.md's job, not this doc's).
10. Cost estimate (rough order of magnitude)¶
For a single environment running 1 API task + ~30 worker tasks on Fargate:
| Component | Approximate monthly cost (us-west-2, on-demand) |
|---|---|
| Fargate compute (1 vCPU, 2 GB API + ~30 workers @ 0.25 vCPU / 0.5 GB) | $200–400 |
| RDS db.m6g.large Multi-AZ + 100 GB storage | $250 |
| ALB | $25 |
| NAT Gateway (per AZ × 2 AZs) | $65 |
| Data transfer | varies |
| S3 storage + requests | $20–100 |
| CloudWatch Logs | $10–50 |
| Secrets Manager | $1 |
| ECR storage | $5 |
| Approximate total | $600–900 per environment per month |
[VERIFY] Real costs from AWS Cost Explorer — replace these estimates with actuals from existing environments once one or two months of data exist.
Production with higher traffic, multi-AZ, larger RDS, more worker concurrency will be 3–5× this number.
11. Common problems and fixes¶
| Symptom | Likely cause | Fix |
|---|---|---|
| Task stuck "PROVISIONING" → "STOPPED" with exit code 1 | Image fails to start (most often: missing env var) | Check CloudWatch Logs for the task; usually a .env key is missing |
| ALB health check fails | Task is up but /health is unreachable |
Confirm task SG allows inbound from ALB SG; confirm port 80 in target group matches container port |
ER_ACCESS_DENIED_ERROR in API logs |
DB credentials wrong or RDS SG doesn't allow Fargate task SG | Verify Secrets Manager values and SG rules |
| Workers crash-loop on startup | Often: a missing AWS or GCP credential | Check the task's logs for the specific missing env var |
| Polotno renders fail with "browser failed to launch" | Chromium runtime libs missing in the image (shouldn't happen with the existing Dockerfile but verify) | Re-confirm the Dockerfile installs libnss3 libexpat1 fontconfig |
| Cross-region S3 calls slow | App is in us-west-2 but media bucket is in us-west-1 | Either move bucket or accept the latency (current production has this trade-off) |
| OAuth callback returns "redirect_uri mismatch" | New environment's callback URL not registered with the provider | Add it in the provider's developer console |
Appendix A — EC2 / Lightsail + PM2 (if you must)¶
If for some reason you need to stand up an environment on the older EC2 + nginx + PM2 pattern (matches dev / production today):
- Provision an EC2 / Lightsail instance in the appropriate region. Ubuntu 22.04 minimum.
- Install Node 20.18.1 via NVM (
nvm install 20.18.1). - Install yarn (
npm install -g yarn@1.22.22). - Install pm2 (
npm install -g pm2). - Install nginx (
apt-get install nginx). - Clone the repo to
/home/ubuntu/someli-api. - Create the
.envfile with all required values (manual; no Secrets Manager integration in this pattern). - Run
yarn install. - Bootstrap the schema as in §6.
- Start the API:
pm2 start server.js --name {env}_api --node-args="--max-old-space-size=2048". - Start the worker fleet:
pm2 start ecosystem.config.js. - Save the PM2 state:
pm2 save && pm2 startup. - Configure nginx with the
nginx.conffrom the repo (proxy 80 → 3000). - Provision SSL (Let's Encrypt + Certbot, or terminate at an ALB upstream).
- Add the host's IP to Route 53.
This pattern has none of the auto-scaling, isolated failure, or per-task observability of Fargate. It's documented here only because it matches existing environments.
[VERIFY] Should new environments be standardized on Fargate? If yes, deprecate Appendix A entirely. Recommended.
12. What to do when this doc is wrong¶
This document was scaffolded from codebase analysis. Several entries are marked [VERIFY] because the codebase doesn't fully tell the answer. Several others assume infrastructure choices that the team should confirm or adjust. If you stand up an environment and hit a gap or an instruction doesn't work:
- Note the actual fix.
- Update this doc in the same PR.
- Remove the
[VERIFY]marker once verified.
The first time someone successfully follows this doc end-to-end is the first time it earns its keep. Each revision after that gets faster.