Replace the fragile in-memory buffer with a durable outbox, and let Sequin stream it to PCM and to your customers' webhooks.
Today regulated evidence is buffered in process memory and submitted by a dev script on a cron. The target: producers write durable semantic events to an outbox; Sequin change-data-captures that one table and delivers it — to NATS for the PCM/OPIN workers, and to customer webhook endpoints directly. One backbone, run the same everywhere.
Two problems, both structural. Capture buffers regulated evidence in an in-memory queue that drops on overflow and vanishes on crash. And the part that submits to the regulator is a dev-tooling script run by a cron — not a service.
A bounded queue with drop-oldest sheds regulated evidence under load — and counts it as captured, so the SLA ratio silently slips.
Events live in process memory until drained. A pod restart or deploy mid-window is unrecoverable.
The pipeline submits to OFB and nothing else. Webhooks, OPIN, analytics — every new consumer is a rebuild.
The production regulatory submitter is a category:'dev' script ticked by a cron — no health, no continuity.
PCM is not a feature on one route. The regulators require PCM metrics for essentially every regulated API — so capture has to be ambient across the whole surface. That scale is the whole argument: it decides where evidence can safely live, and it's exactly where an in-memory queue fails hardest.
Why a shared backbone wins at this scale
A single edge middleware instruments all 131 endpoints — and OPIN's — at once. A new route is reportable the moment it ships; no per-endpoint PCM wiring to remember.
OPIN is an adapter plus its own subjects. The capture boundary and the backbone are untouched. Scaling the obligation is additive.
The same captured event feeds PCM submission, customer webhooks and analytics. You capture once and Sequin routes many — never re-instrument per consumer.
Funnel 131+ endpoints at high request rate through one bounded in-memory queue per pod, and every routine operation becomes a silent data-loss event.
Rolling deploys and autoscale-downs happen several times a week. Each terminating pod's in-flight queue is discarded — routine operations cause routine, silent under-reporting.
131 endpoints share a single 10k-cap queue per pod. A burst overflows it; drop-oldest sheds the SLA-closest evidence first.
A dropped event is still counted as captured. Dashboards stay green until the regulator's pairing report exposes the gap.
Keep the durable store and the crash-safe submit logic. Delete the in-memory queue. Producers write a semantic event durably to an outbox; Sequin change-data-captures that one table and delivers it to its sinks. Nothing is buffered in memory; nothing has a single consumer; and the relay is a configured product, not code you maintain.
Producers write a semantic event (claim.approved) to the outbox — never a raw row change, and Sequin CDCs the outbox, never the business tables. That event is the versioned, stable contract every sink reads. Storage stays private; the event is the API. You can evolve the database without breaking a single customer webhook or regulator mapping.
Four moves: producers write durable semantic events to the outbox; Sequin CDCs that one table; it fans the stream to a NATS sink (for the PCM/OPIN workers) and a webhook sink (straight to customer endpoints). The only custom code is the regulator brain in the workers.
Domain logic writes a semantic event in its transaction; PCM capture records interactions. One durable write — no dual-write, no in-memory hop.
Sequin CDCs the outbox (one slot, that table only) and fans it out. It replaces the relay you'd otherwise build and maintain.
NATS feeds the worker pool (durable, replayable). The webhook sink delivers straight to customers with retry/DLQ/ordering built in.
PCM/OPIN workers do the regulator-specific work — batch, mTLS, JWS, receipts, SLA. That's the one piece that is genuinely yours.
Two flows on the same outbox. A PCM report takes the NATS path to a worker that talks to the regulator; a customer webhook is delivered by Sequin's webhook sink directly. Then: how one event fans out to both, and the delivery lifecycle behind each.
"Outbox vs Sequin" was a false choice: the outbox is the source of truth, Sequin is the transport that reads it. Checked against current docs, Sequin's delivery model is exactly the one we want — so it replaces the relay we'd otherwise hand-build, and most of webhook delivery too.
Verified via context7 · /sequinstream/sequin
| Capability | What Sequin provides | |
|---|---|---|
| Sinks | NATS · Kafka · GCP Pub/Sub · Webhook/HTTP · more | fits |
| Delivery guarantee | at-least-once + idempotency keys (exactly-once isn't always possible) | exactly our model |
| Retries | exponential backoff up to 3 min; configurable max-retry → discard / DLQ | built in |
| Ordering | per-group serial (default = row PK), ordered by commit timestamp | per-entity order |
| Batching · filters · transforms · routing · backfill | all configurable | yes |
| Replication slot | required (pgoutput); docs warn "slots accumulate data if not in use" | manage it |
CDC the outbox → NATS (for the workers) and → customer webhooks directly. Retries, backoff, dead-letter, ordering and idempotency come configured. The hand-built NOTIFY relay and a bespoke webhook worker both disappear.
Sequin can't be the PCM submitter: batching to ≤5000/10 MB, mTLS client_credentials token rotation, OPIN JWS PS256 signing, per-report receipts, 207/429 semantics, the posting crash-safety, D+1/D+7 reconcile. Sequin feeds the worker; it can't be it.
Sequin's webhook sink can POST straight to customers (least code). The pivot is per-customer HMAC signing: if Sequin signs requests with a per-subscription secret, deliver direct; if not, route webhooks through NATS to a thin worker that signs, and let Sequin still own the relay. Verify Sequin's request-signing before committing to direct delivery.
The outbox still exists (Sequin needs a table to CDC, and it remains the regulatory authority) — so it's always "outbox + Sequin." One replication slot, on the outbox table only — monitor its lag and set max_slot_wal_keep_size so a stalled Sequin can't fill the disk. And Sequin is a standing service in every deployment — the single-architecture cost, traded for deleting the relay + delivery code.
The capture middleware is a proper package, but the submitter is a dev-tooling script on a cron. The fix isn't a rewrite — the domain logic is already a package — it's promoting the entrypoint to a first-class event-runtime service that hosts the workers, exposes health, and drains gracefully.
The production regulatory submitter and reconcile are dev-category scripts, invoked by path by two Kubernetes CronJobs.
Why a script-on-a-cron is structurally unsafe
Each tick boots a fresh process, opens a 1-connection pool, re-scans the table, and exits. No durable consumer cursor, no warm state — polling disguised as a service.
Depending on concurrencyPolicy, an overrunning tick either races a second pod over the same tables or is silently skipped — a gap against a hard deadline.
activeDeadlineSeconds (600s/1800s) or a node drain kills a slow run mid-batch. Recovery waits for the stale-draft sweep on a later tick — the gap widens.
Cron fires every N minutes regardless of the D+1 08:00 deadline or a near-real-time need. It can't flush proactively or escalate aging events before D+7.
A failed tick surfaces only via failedJobsHistoryLimit — no alert, no SLO. Against a hard D+7 reject, silent failure for hours is a breach.
It polls a table — no durable delivery, no ack, no DLQ. A permanently-bad (poison) batch fails every tick forever with no escalation.
A k8s CronJob running a dev script is a polling batch job masquerading as a service: crude scheduling, killed-mid-flight risk, invisible failure, no backpressure, no continuity, no observability. The correctness primitives in the package are sound; the form factor around them is not.
What the service needs
outbox (semantic events; the table Sequin CDCs) · export_batches (crash-safe FSM) · submission_receipts · submission_attempts · consumer_idempotency (dedup) · payload_evidence (hashes) · webhook_subscriptions · append-only audit_ledger · config. All carry tenant_id + ecosystem; partition by (tenant, day) + retention/purge.
DB-backed, editable without restart: enabled, environment, submission mode (near-real-time / scheduled) + cadence, dry-run, retention, capture durability. Credential references (BRCAC, client_id, JWS keys) — never the secrets. Eligibility rules, webhook subscriptions + secrets + retry policy, access tiers (viewer / approver).
Structured pino logs (event names in the event field). OTEL health mirror (low-cardinality, identifier denylist): capture rate, capture-persist-failure, outbox depth, Sequin slot lag, consumer lag, delivery/retry/DLQ counts, submission latency, SLA-window freshness. Liveness/readiness endpoints. Alerts on capture-failure, SLA-at-risk, DLQ growth, slot lag.
Scripts that do not help — retire or replace
The scripts are thin shells over @finnest/banking-pcm (runPcmSubmitterOnce, createPcmConfigProvider…). Building the service is mostly a new entrypoint + deployment, not new logic — and it deletes the cron scripts, the relay, and a bespoke webhook worker in one move.
The worker is the one piece you build — and it is a long-running consumer, not a script. It pulls events durably, decides ack/retry/dead-letter explicitly, accumulates batches on a flush policy, submits with crash-safety, and records what happened in three separate places. Here's the anatomy.
The cursor survives restarts — no table re-scan. Each message resolves to ack (done), nak (transient → backoff redeliver), or term (poison → DLQ + alert). Bounded MaxAckPending is backpressure. Handlers are idempotent (dedup on event id), so at-least-once redelivery is safe.
Submission timing is a per-tenant flush policy — near-real-time (micro-batch on size/short timer) or scheduled (window) — and deadline-aware: flush proactively as 08:00 D+1 nears, escalate aging events before D+7. The timer lives in the service (leader-elected or tenant-partitioned), not a CronJob.
Sequin's slot cursor marks transport (the outbox stays append-only). consumer_idempotency marks dedup. The export_batches FSM marks submission state. Receipts mark evidence. A trace_id + idempotency key thread them. No double-marking.
The cron model couples cadence to two places (schedule × DB gate) and is blind to the deadline. The service owns timing: a per-tenant scheduler that flushes on size, time, or approaching deadline, coordinated across replicas by a single durable consumer per partition — demand-driven and SLA-aware, with one source of truth.
What happened, and how long. Structured logs (event name in the event field) + low-cardinality OTEL spans/metrics with a hard identifier denylist — no PII, no tokens, no raw paths. Drives dashboards and alerts. Short retention.
Who changed or approved what. Config edits, submission-mode changes, batch approvals, retention changes — actor + old/new hash, tamper-evident. ≥5 years. Not metrics, not evidence.
What we actually submitted. Regulator receipts + canonical payload hashes, fail-closed (only a real receipt mints an evidence id). 13 months. The compliance record, distinct from logs.
A single distributed trace spans capture → outbox → Sequin → NATS → worker.map → worker.submit → regulator. Capture stores the W3C traceparent on the event; the worker links its spans by trace_id. Attributes stay low-cardinality (tenant, ecosystem, regulator, endpoint template, status class, batch size). You can see end-to-end latency and exactly where a report stalled — impossible with a per-tick script. The rule: each fact has one authority; the other signals are mirrors. Never log PII, never use metrics as audit, never use audit as evidence.
No "minimal vs standard vs SaaS" profiles — that matrix is its own burden. There is one topology: outbox + Sequin + NATS + workers, run identically from a single self-hosted box (N=1) to a Finnest-operated multi-tenant plane. You scale by adding worker replicas and tenants, never by swapping components.
tenant_id + ecosystem on every row, subject and Sequin sink. At N=1 it's one tenant — same code. You chose multi-tenant, so it's load-bearing from row one.
Near-real-time ↔ scheduled is a flush policy on the worker's batch stage, set per tenant + ecosystem. Identical pipeline; only the flush trigger differs.
OFB / OPIN are adapters; a new consumer is a new Sequin sink. The capture boundary, outbox and relay never change.
Every deployment runs Sequin + a worker deployment, even at N=1. Accepted because it deletes more code than it adds (no hand-built relay, no bespoke webhook delivery) and keeps dev / self-hosted / SaaS on one code path. What gets deployed on top of the existing services + Postgres + NATS: Sequin, one event-runtime deployment, and the outbox/evidence tables.
It's an assembly of proven, verified building blocks at a scale inside the tools' envelope. The risk is whether the handful of genuinely hard problems are designed correctly — and most are now provided by Sequin rather than hand-built.
The scale, at 50M events/day (~580/s avg · ~6–12k/s peak)
| Layer | Load | Headroom |
|---|---|---|
| Outbox writes (Postgres) | 580–12k INSERT/s, batched, partitioned by (tenant, day) | 10k–50k+/s batched ✓ |
| Sequin CDC + NATS | tails one slot → 580–12k msg/s | CDC + NATS handle it ✓ |
| Submission (regulator) | bounded by batching — a few k POSTs/day even at 50M | batch API caps it ✓ |
| Storage + slot | ~10 GB/day raw · 13-mo retention · 1 replication slot | partition-purge + slot monitoring |
The hard problems — and how each is solved
Transactional outbox: one durable write, everything else off the committed row.
Verified: Sequin delivers at-least-once with idempotency keys + exponential backoff + DLQ. Workers dedup on event id; the regulator path adds posting-marker/receipts.
Per-group serial by commit timestamp. Group by entity for per-entity order; carry a version for consumers.
Configurable max-retry → dead-letter, with health surfaced per sink consumer.
The outbox is the buffer — spikes land as durable rows. Sequin's load_shedding: pause_on_full bounds the rest.
Monitor slot lag; set max_slot_wal_keep_size so a stalled Sequin can't fill the disk. The one new operational duty.
Strict exactly-once to external endpoints (use at-least-once + idempotency — Sequin's model) · global total ordering (use per-entity + versioning) · raw CDC-on-domain-tables as the contract (CDC the outbox, semantic events only) · Sequin as the PCM submitter (the regulator brain is a custom worker).
Feasible, conditioned on: at-least-once + idempotency, per-entity ordering, the custom workers, and slot monitoring. Before committing, prove it with a soak test (capture → outbox → Sequin → NATS/webhook → worker/customer at peak rate) and a chaos test (kill mid-POST, kill the worker, fail over Postgres, stall Sequin to watch the slot, stall the endpoint → assert no loss, no duplicate submission, clean recovery, and that the slot can't fill the disk). One open item to close first: does Sequin's webhook sink do per-customer HMAC signing?