Target Architecture OFB / OPIN · Webhooks 2026·06·26

In-memory queue
one outbox, Sequin to everywhere.

Replace the fragile in-memory buffer with a durable outbox, and let Sequin stream it to PCM and to your customers' webhooks.

Today regulated evidence is buffered in process memory and submitted by a dev script on a cron. The target: producers write durable semantic events to an outbox; Sequin change-data-captures that one table and delivers it — to NATS for the PCM/OPIN workers, and to customer webhook endpoints directly. One backbone, run the same everywhere.

Verdict

The outbox is the authority; Sequin is the relay — it replaces the hand-built relay and most of webhook delivery. Custom workers do only the regulator brain (batch, mTLS, JWS, receipts, SLA). Less code than building the relay ourselves.

Authority: Postgres outbox Relay + delivery: Sequin CDC → NATS / webhook sinks Guarantee: at-least-once + idempotency (verified)
01

What exists today — and why it can't stay

Two problems, both structural. Capture buffers regulated evidence in an in-memory queue that drops on overflow and vanishes on crash. And the part that submits to the regulator is a dev-tooling script run by a cron — not a service.

request capturemiddleware in-memory queuedrop on overflowlost on crash Postgresauthority cron → dev scripttools/dx/dev-stackcategory: 'dev' OFB onlysubmit lossy buffer before the durable store · single consumer · single regulator · dev-script submitter on a cron
The durable store and the crash-safe submit logic are sound. The in-memory queue and the dev-script-on-a-cron are the fragility — covered in detail in §05 and §06.
FAILURE 01

Loss on overflow

A bounded queue with drop-oldest sheds regulated evidence under load — and counts it as captured, so the SLA ratio silently slips.

FAILURE 02

Loss on crash

Events live in process memory until drained. A pod restart or deploy mid-window is unrecoverable.

FAILURE 03

One consumer, one regulator

The pipeline submits to OFB and nothing else. Webhooks, OPIN, analytics — every new consumer is a rebuild.

FAILURE 04

A dev script as submitter

The production regulatory submitter is a category:'dev' script ticked by a cron — no health, no continuity.

02

Almost every endpoint feeds PCM

PCM is not a feature on one route. The regulators require PCM metrics for essentially every regulated API — so capture has to be ambient across the whole surface. That scale is the whole argument: it decides where evidence can safely live, and it's exactly where an in-memory queue fails hardest.

131
OFB endpoints report to PCM today — channels, open-data, accounts, cards, loans, financings, payments…
+ OPIN
transactional-metrics + consent-funnel across every insurance family — roughly doubles the surface and adds domain events
every
regulated call is a reportable interaction — instrumentation must be one boundary, not per-route

Why a shared backbone wins at this scale

CAPTURE ONCE

One boundary, whole surface

A single edge middleware instruments all 131 endpoints — and OPIN's — at once. A new route is reportable the moment it ships; no per-endpoint PCM wiring to remember.

ADAPTER, NOT REWORK

Add a regulator, not 100 integrations

OPIN is an adapter plus its own subjects. The capture boundary and the backbone are untouched. Scaling the obligation is additive.

FAN-OUT

One signal, many sinks

The same captured event feeds PCM submission, customer webhooks and analytics. You capture once and Sequin routes many — never re-instrument per consumer.

…which is exactly why an in-memory queue is the wrong place to hold it

Funnel 131+ endpoints at high request rate through one bounded in-memory queue per pod, and every routine operation becomes a silent data-loss event.

131 OFB endpoints + OPIN captureambient IN-MEMORY QUEUEone per pod · 10k capregulated evidence held in RAM Postgresbanking.pcm_* ⚠ overflow → drop-oldest ⚠ pod crash → buffer lost ⚠ rolling deploy → buffer lost every drop is still counted as captured → silent under-report across the whole surface, against a hard D+1 / D+7 SLA that may be breached at most 4× per 30 days
At 131+ endpoints the queue is always hot — and it sits before the durable store, so every loss is unrecoverable and invisible.
ROUTINE OPS = LOSS

Every deploy drops the buffer

Rolling deploys and autoscale-downs happen several times a week. Each terminating pod's in-flight queue is discarded — routine operations cause routine, silent under-reporting.

SHARED CAP

One queue for the whole surface

131 endpoints share a single 10k-cap queue per pod. A burst overflows it; drop-oldest sheds the SLA-closest evidence first.

INVISIBLE

Loss looks like success

A dropped event is still counted as captured. Dashboards stay green until the regulator's pairing report exposes the gap.

03

The shift, in one move

Keep the durable store and the crash-safe submit logic. Delete the in-memory queue. Producers write a semantic event durably to an outbox; Sequin change-data-captures that one table and delivers it to its sinks. Nothing is buffered in memory; nothing has a single consumer; and the relay is a configured product, not code you maintain.

✕ In-memory queue + cron script

  • Buffered in process memory — lost on crash, dropped on overflow
  • One consumer — the OFB submitter, hard-wired
  • Hand-built movement — and a dev script on a cron to submit
  • Batch-only — cron cadence, no near-real-time path
  • Loss is invisible — counted as captured, erodes the SLA silently

✓ Outbox + Sequin CDC

  • Durable on write — committed to Postgres before release; the outbox is the buffer
  • Many sinks — NATS for workers, webhook for customers, more later
  • Sequin is the relay — CDC + at-least-once delivery + retry + DLQ, configured not coded
  • Both modes — near-real-time or scheduled, a per-customer flush policy
  • Loss is impossible-by-default — and any residual is a hard alert
The one rule that makes it safe

Producers write a semantic event (claim.approved) to the outbox — never a raw row change, and Sequin CDCs the outbox, never the business tables. That event is the versioned, stable contract every sink reads. Storage stays private; the event is the API. You can evolve the database without breaking a single customer webhook or regulator mapping.

04

The architecture

Four moves: producers write durable semantic events to the outbox; Sequin CDCs that one table; it fans the stream to a NATS sink (for the PCM/OPIN workers) and a webhook sink (straight to customer endpoints). The only custom code is the regulator brain in the workers.

PRODUCERS OUTBOX SEQUIN · relay SINKS WORKERS / DEST domain logicclaim.approved (in-txn) PCM captureinteraction events OUTBOXPostgres·AUTHORITYsemantic eventstenant+eco · by day SEQUINCDC relay1 slot · outbox only NATS sinkdurable stream Webhook sinkretry·DLQ·order PCM·OFB workermTLS · batch PCM·OPIN workerJWS · batch → Regulator · receipts customer endpoints(Sequin delivers direct) receipts · payload hashes · delivery evidence → Postgres custom regulator brain: batch·mTLS·JWS·receipts·SLA one write on the hot path (the outbox) · Sequin replaces the relay · NATS carries the worker stream · workers are the only custom code
The outbox is the authority. Sequin is the relay + webhook delivery. NATS carries the worker stream. The PCM/OPIN workers are the only thing you build.
PRODUCE

Emit, durably

Domain logic writes a semantic event in its transaction; PCM capture records interactions. One durable write — no dual-write, no in-memory hop.

RELAY

Sequin, not handwritten

Sequin CDCs the outbox (one slot, that table only) and fans it out. It replaces the relay you'd otherwise build and maintain.

SINKS

NATS + webhook

NATS feeds the worker pool (durable, replayable). The webhook sink delivers straight to customers with retry/DLQ/ordering built in.

WORKERS

Only the brain

PCM/OPIN workers do the regulator-specific work — batch, mTLS, JWS, receipts, SLA. That's the one piece that is genuinely yours.

05

How it works — step by step

Two flows on the same outbox. A PCM report takes the NATS path to a worker that talks to the regulator; a customer webhook is delivered by Sequin's webhook sink directly. Then: how one event fans out to both, and the delivery lifecycle behind each.

Flow A · PCM report (NATS sink → worker → regulator)
API edge Outbox·PG Sequin NATS PCM worker Regulator ① durable write ② CDC tails slot ③ → NATS sink ④ pull ≥1× ⑤ map(adapter)+ batch · flush policy ⑥ posting · POST mTLS/JWS ⑦ receipts ⑧ persist receipt = evidence · ack (idempotent)
Sequin relays the outbox to NATS; the worker owns the regulator brain. Reconciliation runs as a separate pass.
  1. Durable capture. The middleware writes one interaction event to the outbox synchronously — fail-open + hard capture-loss alert if Postgres is unreachable (never breaks the regulated call).
  2. Sequin CDC. Sequin tails the outbox's logical-replication slot — that one table only.
  3. NATS sink. It publishes to interactions.<tenant> on NATS.
  4. Durable pull. The PCM worker consumes at-least-once from its durable consumer.
  5. Map & batch. The OFB/OPIN adapter maps the payload and accumulates a batch — micro-batch (near-real-time) or windowed to ≤5000/10 MB (scheduled), per the tenant's flush policy.
  6. Submit. The worker sets the posting marker before the call, then POSTs over mTLS (OFB) or signs JWS (OPIN) — a crash here fails closed to submit_unconfirmed, never a blind re-POST.
  7. Receipts. The regulator returns per-report outcomes.
  8. Evidence & ack. The worker persists the receipt (the only evidence-id source) and acks. Replays are idempotent. A separate reconcile pass handles failed/discarded/unexported and the D+1/D+7 SLA.
Flow B · Customer webhook (Sequin's webhook sink, direct)
domain logic Outbox·PG Sequin · webhook sink customer ① emit claim.approved (in-txn) ② CDC tails slot ③ route · transform+ idempotency key · sign(?) ④ POST /webhook (≥1×) ⑤ 2xx → delivered ✕ non-2xx / timeout → Sequin backoff retry after max retries → dead-letter + alert · idempotency key lets the customer dedup · HMAC signing = open item to verify
Sequin's webhook sink owns delivery — at-least-once, backoff, dead-letter, per-group ordering — POSTing straight to the customer.
  1. Emit in transaction. The business handler commits its state change and a semantic claim.approved event to the outbox in one transaction — atomic, no lost notification.
  2. Sequin CDC. Sequin tails the outbox slot.
  3. Route & shape. A routing function picks the customer's endpoint; a transform builds the versioned payload and idempotency key. (HMAC signing: confirm Sequin signs requests — else add a thin signing step.)
  4. Deliver. Sequin POSTs to the customer endpoint, at-least-once.
  5. Confirm or retry. 2xx → delivered. Non-2xx/timeout → exponential backoff (verified: up to 3 min); after max retries → dead-letter + alert. The customer dedups via the idempotency key.
Flow C · One event, fanned out
consent.advancedone outbox event Sequin NATS sink Webhook sink PCM·OPIN → regulator customer endpoint
One write, two sinks. A new consumer is a new Sequin sink — not a new pipeline.
Flow D · Delivery lifecycle
pending claimed delivering delivered fail retrying backoff deadletter max retries idempotent: a replay never double-delivers webhooks: Sequin owns this · PCM: the worker's batch FSM every state is durable · resumable after any crash
Sequin runs this lifecycle for webhooks; the PCM worker runs its own crash-safe batch FSM for regulator submission.
06

Sequin — what it replaces, what stays yours

"Outbox vs Sequin" was a false choice: the outbox is the source of truth, Sequin is the transport that reads it. Checked against current docs, Sequin's delivery model is exactly the one we want — so it replaces the relay we'd otherwise hand-build, and most of webhook delivery too.

Verified via context7 · /sequinstream/sequin

CapabilityWhat Sequin provides
SinksNATS · Kafka · GCP Pub/Sub · Webhook/HTTP · morefits
Delivery guaranteeat-least-once + idempotency keys (exactly-once isn't always possible)exactly our model
Retriesexponential backoff up to 3 min; configurable max-retry → discard / DLQbuilt in
Orderingper-group serial (default = row PK), ordered by commit timestampper-entity order
Batching · filters · transforms · routing · backfillall configurableyes
Replication slotrequired (pgoutput); docs warn "slots accumulate data if not in use"manage it
SEQUIN REPLACES

The relay + webhook delivery

CDC the outbox → NATS (for the workers) and → customer webhooks directly. Retries, backoff, dead-letter, ordering and idempotency come configured. The hand-built NOTIFY relay and a bespoke webhook worker both disappear.

STAYS YOURS

The regulator brain

Sequin can't be the PCM submitter: batching to ≤5000/10 MB, mTLS client_credentials token rotation, OPIN JWS PS256 signing, per-report receipts, 207/429 semantics, the posting crash-safety, D+1/D+7 reconcile. Sequin feeds the worker; it can't be it.

One open decision — webhooks: direct vs via a worker

Sequin's webhook sink can POST straight to customers (least code). The pivot is per-customer HMAC signing: if Sequin signs requests with a per-subscription secret, deliver direct; if not, route webhooks through NATS to a thin worker that signs, and let Sequin still own the relay. Verify Sequin's request-signing before committing to direct delivery.

The price of letting Sequin be the relay

The outbox still exists (Sequin needs a table to CDC, and it remains the regulatory authority) — so it's always "outbox + Sequin." One replication slot, on the outbox table only — monitor its lag and set max_slot_wal_keep_size so a stalled Sequin can't fill the disk. And Sequin is a standing service in every deployment — the single-architecture cost, traded for deleting the relay + delivery code.

07

The PCM service — and the scripts to retire

The capture middleware is a proper package, but the submitter is a dev-tooling script on a cron. The fix isn't a rewrite — the domain logic is already a package — it's promoting the entrypoint to a first-class event-runtime service that hosts the workers, exposes health, and drains gracefully.

Verified in the codebase package.json → "pcm:submit": "bun run tools/dx/dev-stack/src/pcm/pcm-submit.ts"
pcm-submit.ts → meta = { category: 'dev' }
infra/…/pcm-cron.ts → scriptPath: 'tools/dx/dev-stack/src/pcm/pcm-submit.ts' (runtime-ops image)

The production regulatory submitter and reconcile are dev-category scripts, invoked by path by two Kubernetes CronJobs.

Why a script-on-a-cron is structurally unsafe

DANGER 01

Cold process every tick

Each tick boots a fresh process, opens a 1-connection pool, re-scans the table, and exits. No durable consumer cursor, no warm state — polling disguised as a service.

DANGER 02

Overlap or silent gaps

Depending on concurrencyPolicy, an overrunning tick either races a second pod over the same tables or is silently skipped — a gap against a hard deadline.

DANGER 03

Killed mid-flight

activeDeadlineSeconds (600s/1800s) or a node drain kills a slow run mid-batch. Recovery waits for the stale-draft sweep on a later tick — the gap widens.

DANGER 04

A fixed clock, blind to deadlines

Cron fires every N minutes regardless of the D+1 08:00 deadline or a near-real-time need. It can't flush proactively or escalate aging events before D+7.

DANGER 05

Failure is invisible

A failed tick surfaces only via failedJobsHistoryLimit — no alert, no SLO. Against a hard D+7 reject, silent failure for hours is a breach.

DANGER 06

No consumption model

It polls a table — no durable delivery, no ack, no DLQ. A permanently-bad (poison) batch fails every tick forever with no escalation.

The core problem

A k8s CronJob running a dev script is a polling batch job masquerading as a service: crude scheduling, killed-mid-flight risk, invisible failure, no backpressure, no continuity, no observability. The correctness primitives in the package are sound; the form factor around them is not.

What the service needs

TABLES

Outbox + evidence

outbox (semantic events; the table Sequin CDCs) · export_batches (crash-safe FSM) · submission_receipts · submission_attempts · consumer_idempotency (dedup) · payload_evidence (hashes) · webhook_subscriptions · append-only audit_ledger · config. All carry tenant_id + ecosystem; partition by (tenant, day) + retention/purge.

CUSTOMIZATION

Per-tenant + per-eco config

DB-backed, editable without restart: enabled, environment, submission mode (near-real-time / scheduled) + cadence, dry-run, retention, capture durability. Credential references (BRCAC, client_id, JWS keys) — never the secrets. Eligibility rules, webhook subscriptions + secrets + retry policy, access tiers (viewer / approver).

LOGS & HEALTH

Observable by design

Structured pino logs (event names in the event field). OTEL health mirror (low-cardinality, identifier denylist): capture rate, capture-persist-failure, outbox depth, Sequin slot lag, consumer lag, delivery/retry/DLQ counts, submission latency, SLA-window freshness. Liveness/readiness endpoints. Alerts on capture-failure, SLA-at-risk, DLQ growth, slot lag.

Scripts that do not help — retire or replace

✕ Dev scripts on a cron

  • dev-stack/pcm-submit.ts as the prod entrypoint → logic moves into the service
  • pcm-reconcile.ts cron script → a continuous consumer or first-class Job
  • pcm-status.ts → replaced by the service's health/metrics endpoints
  • pcm-runtime.ts dev-stack wiring → folded into the service
  • k8s CronJobs (pcm-cron.ts) → a Deployment (+ optional catch-up Job)
  • dry-run artifact files + committed ofb-pcm-report.json as a UI source → dev/evidence only

✓ event-runtime service

  • First-class Deployment hosting the workers — no dev path in prod
  • Continuous durable consumers off NATS — near-real-time or scheduled by flush policy
  • Health endpoints + structured metrics — failures surfaced, not discovered late
  • Graceful drain on SIGTERM; scales by replicas
  • Reconcile is a consumer / first-class Job, coordinated by design
  • Domain logic unchanged — it already lives in @finnest/banking-pcm
Why the move is small

The scripts are thin shells over @finnest/banking-pcm (runPcmSubmitterOnce, createPcmConfigProvider…). Building the service is mostly a new entrypoint + deployment, not new logic — and it deletes the cron scripts, the relay, and a bespoke webhook worker in one move.

08

Inside the worker

The worker is the one piece you build — and it is a long-running consumer, not a script. It pulls events durably, decides ack/retry/dead-letter explicitly, accumulates batches on a flush policy, submits with crash-safety, and records what happened in three separate places. Here's the anatomy.

NATS consumerdurable pullMaxAckPending classifyeligibility mapadapter batchflush policyposting marker submitmTLS / JWS evidencereceipts ack ✓ nak → backoff redeliver (transient) term → DLQ + alert (poison) SIGTERM → stop pulling · finish in-flight acks · flush the open batch durably · exit (no mid-flight loss)
A durable consumer with explicit ack / nak / term decisions — not a cold process per tick.
CONSUME

Durable pull, explicit decisions

The cursor survives restarts — no table re-scan. Each message resolves to ack (done), nak (transient → backoff redeliver), or term (poison → DLQ + alert). Bounded MaxAckPending is backpressure. Handlers are idempotent (dedup on event id), so at-least-once redelivery is safe.

SCHEDULE

Flush policy, not a cron tick

Submission timing is a per-tenant flush policy — near-real-time (micro-batch on size/short timer) or scheduled (window) — and deadline-aware: flush proactively as 08:00 D+1 nears, escalate aging events before D+7. The timer lives in the service (leader-elected or tenant-partitioned), not a CronJob.

MARK

One marker per concern

Sequin's slot cursor marks transport (the outbox stays append-only). consumer_idempotency marks dedup. The export_batches FSM marks submission state. Receipts mark evidence. A trace_id + idempotency key thread them. No double-marking.

Scheduling — the difference

The cron model couples cadence to two places (schedule × DB gate) and is blind to the deadline. The service owns timing: a per-tenant scheduler that flushes on size, time, or approaching deadline, coordinated across replicas by a single durable consumer per partition — demand-driven and SLA-aware, with one source of truth.

Marking the event — its trail through the pipeline
outboxappend-only idempotency_key+ trace_id Sequin → NATSslot cursor = transportprogress consumer_idempotency dedup(event_id) export_batchFSM draft→posting→submitted receiptevidence evidence_id one trace spans the whole path · correlation by trace_id with CDC the outbox is never stamped published — Sequin's cursor is the transport marker
Each fact is marked exactly once, by its single authority — append-only source, slot cursor, dedup table, batch FSM, receipt.
Spans & logging — three signals, one authority each
OPERATIONAL

pino + OTEL

What happened, and how long. Structured logs (event name in the event field) + low-cardinality OTEL spans/metrics with a hard identifier denylist — no PII, no tokens, no raw paths. Drives dashboards and alerts. Short retention.

AUDIT

Append-only ledger

Who changed or approved what. Config edits, submission-mode changes, batch approvals, retention changes — actor + old/new hash, tamper-evident. ≥5 years. Not metrics, not evidence.

EVIDENCE

Receipts + hashes

What we actually submitted. Regulator receipts + canonical payload hashes, fail-closed (only a real receipt mints an evidence id). 13 months. The compliance record, distinct from logs.

One trace, end to end

A single distributed trace spans capture → outbox → Sequin → NATS → worker.map → worker.submit → regulator. Capture stores the W3C traceparent on the event; the worker links its spans by trace_id. Attributes stay low-cardinality (tenant, ecosystem, regulator, endpoint template, status class, batch size). You can see end-to-end latency and exactly where a report stalled — impossible with a per-tick script. The rule: each fact has one authority; the other signals are mirrors. Never log PII, never use metrics as audit, never use audit as evidence.

09

One architecture, every deployment

No "minimal vs standard vs SaaS" profiles — that matrix is its own burden. There is one topology: outbox + Sequin + NATS + workers, run identically from a single self-hosted box (N=1) to a Finnest-operated multi-tenant plane. You scale by adding worker replicas and tenants, never by swapping components.

FLEXES AS CONFIG

Tenancy

tenant_id + ecosystem on every row, subject and Sequin sink. At N=1 it's one tenant — same code. You chose multi-tenant, so it's load-bearing from row one.

FLEXES AS CONFIG

Submission mode

Near-real-time ↔ scheduled is a flush policy on the worker's batch stage, set per tenant + ecosystem. Identical pipeline; only the flush trigger differs.

FLEXES AS CONFIG

Regulator & consumers

OFB / OPIN are adapters; a new consumer is a new Sequin sink. The capture boundary, outbox and relay never change.

The deal "single architecture" makes

Every deployment runs Sequin + a worker deployment, even at N=1. Accepted because it deletes more code than it adds (no hand-built relay, no bespoke webhook delivery) and keeps dev / self-hosted / SaaS on one code path. What gets deployed on top of the existing services + Postgres + NATS: Sequin, one event-runtime deployment, and the outbox/evidence tables.

10

Is it feasible? Yes — here's the gate

It's an assembly of proven, verified building blocks at a scale inside the tools' envelope. The risk is whether the handful of genuinely hard problems are designed correctly — and most are now provided by Sequin rather than hand-built.

The scale, at 50M events/day (~580/s avg · ~6–12k/s peak)

LayerLoadHeadroom
Outbox writes (Postgres)580–12k INSERT/s, batched, partitioned by (tenant, day)10k–50k+/s batched ✓
Sequin CDC + NATStails one slot → 580–12k msg/sCDC + NATS handle it ✓
Submission (regulator)bounded by batching — a few k POSTs/day even at 50Mbatch API caps it ✓
Storage + slot~10 GB/day raw · 13-mo retention · 1 replication slotpartition-purge + slot monitoring

The hard problems — and how each is solved

SOLVED

Dual-write

Transactional outbox: one durable write, everything else off the committed row.

SEQUIN PROVIDES

At-least-once + idempotency

Verified: Sequin delivers at-least-once with idempotency keys + exponential backoff + DLQ. Workers dedup on event id; the regulator path adds posting-marker/receipts.

SEQUIN PROVIDES

Ordering

Per-group serial by commit timestamp. Group by entity for per-entity order; carry a version for consumers.

SEQUIN PROVIDES

Poison messages

Configurable max-retry → dead-letter, with health surfaced per sink consumer.

SOLVED

Backpressure

The outbox is the buffer — spikes land as durable rows. Sequin's load_shedding: pause_on_full bounds the rest.

MUST MANAGE

The replication slot

Monitor slot lag; set max_slot_wal_keep_size so a stalled Sequin can't fill the disk. The one new operational duty.

Not feasible — so we design around it

Strict exactly-once to external endpoints (use at-least-once + idempotency — Sequin's model) · global total ordering (use per-entity + versioning) · raw CDC-on-domain-tables as the contract (CDC the outbox, semantic events only) · Sequin as the PCM submitter (the regulator brain is a custom worker).

Verdict — and how to prove it for real

Feasible, conditioned on: at-least-once + idempotency, per-entity ordering, the custom workers, and slot monitoring. Before committing, prove it with a soak test (capture → outbox → Sequin → NATS/webhook → worker/customer at peak rate) and a chaos test (kill mid-POST, kill the worker, fail over Postgres, stall Sequin to watch the slot, stall the endpoint → assert no loss, no duplicate submission, clean recovery, and that the slot can't fill the disk). One open item to close first: does Sequin's webhook sink do per-customer HMAC signing?