Deploy Platforms Instead Of Charts

The central object in this article is the Torque stack file. It is not a values file for one chart. It is an ordered deployment contract for a full fraud platform: host setup, Kubernetes access, cloud storage, data services, application workloads, public checks, batch processing, replay, and final verification.

Helmfile can coordinate Helm releases. Argo CD and Flux can keep Kubernetes objects in sync with Git. Terraform and Pulumi can create cloud resources. Argo Workflows can run jobs. This stack uses that kind of tooling where it fits, but the stack boundary is wider. The deployment starts with a remote Linux host, creates a Firecracker-backed k3s deployment environment, opens a controlled tunnel for local Kubernetes access, creates or checks the S3 bucket, installs the platform services, deploys the fraud workloads, runs Spark, runs replay, and verifies the data path from outside the cluster.

The source package lives under stacks/fraud-platform. The rendered stack file is attached here as stack.yaml. The production shaped entrypoint is stack-packaged.yaml, with profile values such as values-prod.yaml. The proof-run profile expects TORQUE_LAB_SSH for the target host and TORQUE_LAB_PUBLIC_IP for public endpoint checks.

Platform Architecture

The platform has two data paths. The stream path starts at the payments API. That API is the public event ingress. It accepts generated card payment events, writes each raw event to the payments.raw Redpanda topic, and stores a raw JSON object in S3. Redpanda is more than a queue in this stack. It also exposes a schema registry with subjects for raw payment events, risk events, and payment decisions, and the verification step checks those subjects for backward compatibility.

Flink consumes the raw topic and builds the scoring request. It keeps the stream processor separate from the model service. Ray Serve owns the model endpoint and returns the score. Flink writes the decision to ClickHouse and also produces risk output back through Redpanda. ClickHouse is the fast operator store in this design: decisions, fraud rates, merchant behavior, and batch summaries are available without reading lake files.

The batch path starts from S3. Argo submits a Spark workflow after the platform is reachable. Spark reads raw payment objects and risk decisions, computes aggregate fraud features, writes a curated JSON artifact back to S3, inserts batch rows into ClickHouse, and commits three Iceberg tables through the REST catalog: raw_payments, risk_events, and batch_feature_summary. Trino is the analyst surface over both stores. It queries ClickHouse for low latency decisions and Iceberg for lakehouse tables. SigNoz covers service health and uses ClickHouse as its telemetry store.

The Kubernetes layout is also part of the architecture. The platform labels Firecracker nodes by workload role: control, observability, events, processing, machine learning and batch, and analytics. Redpanda runs in the data namespace. Flink runs in the stream namespace. Ray and Spark run in the machine learning namespace. The API and generator run in the apps namespace. SigNoz and its ClickHouse store run in observability. Trino and Iceberg REST run on the analytics node so SQL access is separate from stream processing and model serving.

Stream And Scoring

stream

Payments APIpublic event ingress

raw

Redpandapayments.raw topic

consume

Flinkcontinuous risk job

scoring

Flinkfeature window

score

Ray Serverisk model endpoint

write

ClickHousepayment decisions

Batch And Telemetry

batch

Argoscheduled workflow

run

Sparkbounded feature job

persist

S3 + Icebergcurated features

telemetry

WorkloadsAPI, jobs, model service

signals

SigNozservice health surface

store

ClickHouseobservability backend

Stack Structure

The stack graph makes the operating order explicit. The host must exist before the tunnel can be opened. S3 must exist before workloads receive bucket credentials. The platform services must be ready before the application workloads are installed. Public access must be available before the batch job and final checks run.

nodes:
  - name: fc-k8s-bootstrap
  - name: fc-k8s-tunnel
    needs: [fc-k8s-bootstrap]
  - name: aws-s3-bootstrap
    needs: [fc-k8s-tunnel]
  - name: platform-install
    needs: [aws-s3-bootstrap]
  - name: workloads-install
    needs: [platform-install]
  - name: replay-backfill
    needs: [argo-spark-batch]
  - name: verify-e2e
    needs: [replay-backfill]

Each node can own a different kind of work. The bootstrap nodes run host commands. The platform node applies Kubernetes resources and labels nodes for control, events, processing, batch, and analytics workloads. The workload node installs Redpanda, Flink, Ray, Spark, Trino, Iceberg REST, the payments API, and schema registration jobs. The batch and replay nodes submit Argo workflows. The final node is a verification program, not a note in a runbook.

Profiles keep environment choices out of the command text. The deployment profile owns the Firecracker host, public IP checks, NodePort exposure, and generated traffic. A stage or production profile can point at an existing kubeconfig, pinned images, real ingress, and secret references. The same graph remains readable while the operational inputs change by environment.

host.command.run:
  command: scripts/replay-backfill.sh apply
  deleteCommand: scripts/replay-backfill.sh delete
  timeout: 20m

The replay node consumes from the start of payments.raw, submits a Spark backfill workflow, writes a new Iceberg run id, and queries Trino to prove that the rows exist. This is why the stack is more useful than a directory of manifests. The deploy step and the data proof live in the same graph.

Verification

The final verification step checks public endpoints first: payments API, SigNoz, Ray, Spark, Flink, and Trino. It then runs inside the cluster to count S3 raw and curated objects, query ClickHouse decision and batch rows, check Redpanda schema subjects, query Trino over ClickHouse and Iceberg, and consume a sample risk message from Redpanda.

s3_raw_objects=4365
s3_curated_objects=11
trino_payment_decisions=632
iceberg_raw_payments=1100
iceberg_risk_events=1100
iceberg_batch_features=108
risk_high_watermark=855
backfill_run_id=replay-20260524130620
iceberg_backfill_batch_features=36

Those checks are the release evidence. The stack does not only install services. It proves that data entered the API, reached Redpanda, was scored through Flink and Ray, landed in ClickHouse, was rebuilt by Spark, was committed to Iceberg, and was readable through Trino. That is the practical difference between deploying a chart and deploying a platform.

Deploy Platforms Instead Of Charts

Payments stream in real time while batch features rebuild from the lake

Platform Architecture

Stack Structure

Verification