Observability¶
Effectively none, same as sibling backends.
What exists¶
console.log/console.erroreverywhere- One job logs "Starting Job..." with
new Date()to mark cron ticks (job_check_color_with_json.js) - Slack notifications for missing content (
teamsnotification.js)
What does not exist¶
- No structured logging
- No error tracker (Sentry, Bugsnag)
- No tracing / metrics
- No
/health//readyendpoint - No DB connection pool metrics
- No OpenAI usage tracking (cost visibility)
- No Polotno render duration tracking
Specific concerns¶
- 57 cron jobs ×
console.log("\\n Starting Job... ", new Date());= a noisy log stream. With no structure, finding the right job's failure requiresgrep-by-filename. - No correlation between routes and jobs. A user-reported issue in the FE that turns out to be a stale-content problem requires reading both
routes/routes.jslogs and severaljob_*.jslogs to piece together. - OpenAI cost is invisible. The bots burn OpenAI tokens hourly; no per-bot cost reporting. Adding a
tApiCalllog table would close this gap and pay for itself in a month if it surfaces a bot calling wastefully. - Polotno render duration is invisible. If renders slow down (Chrome memory leak, network slow to S3), no one knows until the queue backs up.
Minimal improvements¶
In order of cost:
- Add
/healthendpoint (1-liner) - Add morgan access log to
server.js(1 line) - Add structured logging via
pino— replaceconsole.logwithlogger.info({...}) - Add request-id middleware for log correlation
- Tag every job's log lines with
[<job-name>]prefix — minimal change, big readability win - Add
tJobRunaudit table — every cron tick inserts a row with start, end, rows-processed, errors - Track Polotno render duration — wrap
instance.render(...)in a timer; warn on > 10s - Track OpenAI token usage — every OpenAI call inserts a row in
tApiCall - Sentry integration for unhandled errors
- Prometheus metrics exposed at
/metrics(if a Grafana board exists)
Recommendations¶
Same as sibling backends. The biggest leverage here is tagging logs with job names and adding a tJobRun audit table — both are cheap and immediately useful for the content team.
Cross-component implication¶
Whatever observability stack is adopted platform-wide should support mixed Node.js + PM2 + bots/jobs. Sentry, OpenTelemetry, and Pino all do this. The pattern would be:
- Pick the stack (recommend Sentry for errors + Pino for logs)
- Implement in one of the smaller services first (e.g.,
someli-dashboard-beorSomeli-admin-api) - Roll out to
designer-apinext (high benefit because of the cron-job sprawl) - Roll out to
someli-apilast (biggest blast radius)