External Kafka & Redis

This guide covers how to point the Rulebricks Helm chart at your own Kafka and Redis infrastructure instead of using the bundled instances. It is written for platform engineers deploying Rulebricks into a self-hosted Kubernetes environment.

Supabase externalization is not covered here. See the brief section on Supabase at the end of this document for the full rationale.

How Rulebricks Uses These Services

For the overall component layout and request flow, see Architecture. The part that matters here: HPS (the rule execution server) uses a correlation-ID request/response pattern over Kafka. When a solve request arrives:

HPS produces a message to the solution Kafka topic with a unique correlation ID and a designated response partition.
A worker pod consumes the message, evaluates the rule, and produces the result to the solution-response topic on the exact partition the originating HPS replica is listening on.
HPS resolves the pending request and returns the result to the caller.

The idempotent Kafka producer guarantees exactly-once delivery at the broker level, and the correlation-ID mechanism ensures responses always route back to the correct HPS replica. This is why partition counts matter so much on an external cluster.

Redis sits in front of Supabase as a shared cache layer. API key authentication payloads are cached with a short TTL (60 seconds), while rule/flow definitions and named-environment lookups are cached without expiry and invalidated over a Redis pub/sub channel when they change; every pod also keeps an in-process LRU in front of Redis. In practice, Supabase sees very few direct queries under normal operation.

Externalizing Kafka

How Rulebricks Uses Kafka

Rulebricks uses three Kafka topics. Topic names carry the configured rulebricks.app.logging.kafkaTopicPrefix (default com.rulebricks., so the actual topic is com.rulebricks.solution and so on). The prefix exists so Rulebricks topics don't collide on shared or managed clusters; set it to "" to disable prefixing.

Topic	Purpose	Producers	Consumers
`solution`	Inbound solve requests	HPS API pods	Worker pods (`generic-workers` consumer group)
`solution-response`	Outbound results routed back to originating HPS replica	Worker pods	HPS API pods (`hps-response-consumer` consumer group)
`logs`	Structured decision logs	HPS API and worker pods	Vector aggregator (`vector-consumers` consumer group)

The logs topic is only truly optional if both vector.enabled: false and rulebricks.app.logging.enabled: false. In the default chart configuration the Vector pod is deployed and consumes this topic, so it needs to exist on your cluster. See Vector and the logs topic below.

Messages are plain JSON: no schema registry, no Kafka Connect, no custom broker plugins. The client library is KafkaJS, which speaks the standard Kafka wire protocol.

Helm Values for External Kafka

To switch from the bundled Kafka to your own cluster, set the following in your values override:

kafka:
  enabled: false
 
rulebricks:
  app:
    logging:
      enabled: true
      kafkaBrokers: 'broker-1.example.com:9092,broker-2.example.com:9092'

The kafkaBrokers value drives every consumer in the stack:

HPS, workers, and the main app receive it as the KAFKA_BROKERS environment variable via the shared ConfigMap. When kafkaBrokers is empty (the default), the chart auto-generates the internal cluster address <release>-kafka.<namespace>.svc.cluster.local:9092.
Vector is wired automatically through a templated vector-kafka-env ConfigMap that derives its bootstrap servers, TLS/SASL settings, and the prefixed log topic from the same rulebricks.app.logging.* values. You do not configure Vector's Kafka connection by hand. (For token-auth mechanisms, the ConfigMap points Vector at the local kafkaBridge proxy instead; see Authentication.)
KEDA's worker-scaling trigger also reads kafkaBrokers and points at your external cluster automatically (see KEDA Autoscaling with External Kafka).

Authentication (SSL/SASL)

The chart supports TLS and SASL for external Kafka through rulebricks.app.logging.kafkaSsl and kafkaSasl, which it exposes to the app, HPS, workers, and Vector. Supported SASL mechanisms are plain, scram-sha-256, scram-sha-512, aws-iam, and oauthbearer.

Two consumers connect to Kafka with different auth capabilities: HPS (KafkaJS, which handles all supported mechanisms including token-based ones) and Vector (which natively supports SSL plus SASL PLAIN/SCRAM, but not token mechanisms like IAM or OAUTHBEARER). For token mechanisms, the chart provides a kafkaBridge kafka-proxy sidecar that authenticates upstream using the Vector pod's workload identity and exposes a local plaintext listener Vector consumes from. The CLI configures all of this automatically when you externalize Kafka; the examples below are for hand-installs.

AWS MSK with IAM auth (credentials from pod identity via IRSA, no static secrets):

kafka:
  enabled: false
rulebricks:
  app:
    logging:
      enabled: true
      kafkaBrokers: 'b-1.msk.example:9098,b-2.msk.example:9098'
      kafkaSsl: true
      kafkaSasl:
        mechanism: 'aws-iam'
        region: 'us-east-1'
  hps:
    serviceAccount:
      create: true
      annotations:
        eks.amazonaws.com/role-arn: 'arn:aws:iam::ACCOUNT:role/msk-access'
# Vector cannot speak MSK IAM directly, so it uses the kafka-proxy bridge:
kafkaBridge:
  enabled: true
  provider: 'aws'
  region: 'us-east-1'
  brokers: 'b-1.msk.example:9098,b-2.msk.example:9098'
  awsRoleArn: 'arn:aws:iam::ACCOUNT:role/msk-access'
vector:
  serviceAccount:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: 'arn:aws:iam::ACCOUNT:role/msk-access'

Azure Event Hubs (Kafka endpoint) uses SASL PLAIN with the namespace connection string. Both HPS and Vector connect directly; no bridge sidecar required:

kafka:
  enabled: false
rulebricks:
  app:
    logging:
      enabled: true
      kafkaBrokers: 'my-namespace.servicebus.windows.net:9093'
      kafkaSsl: true
      kafkaSasl:
        mechanism: 'plain'
        username: '$ConnectionString'
        password: 'Endpoint=sb://my-namespace.servicebus.windows.net/;SharedAccessKeyName=...;SharedAccessKey=...'
        # or: existingSecret / existingSecretUsernameKey / existingSecretPasswordKey

GCP Managed Service for Apache Kafka prefers OAUTHBEARER with Workload Identity; the Vector bridge runs GCP's local auth-token server sidecar (kafkaBridge.provider: "gcp" with gcpServiceAccountEmail). A simpler plain/SCRAM credential also works for both consumers but uses static credentials.

Topics to Pre-Create

The HPS producer sets allowAutoTopicCreation: true, but auto-created topics inherit the broker's default partition count, often just 1. A single-partition solution-response will cause request timeouts under any meaningful replica count. Always pre-create topics explicitly.

In-cluster chart deployments handle all of this automatically with Strimzi KafkaTopic resources rendered from kafka.topics. For external managed Kafka, the chart can also create topics from the same list when kafka.enabled: false and kafkaBridge.enabled: true; otherwise topics are customer-managed.

Ask your Kafka team to create (names take your configured topic prefix):

Topic	Partition Count	Replication	Retention
`solution`	~2x your maximum worker replica count (default 128)	1 is acceptable (transient RPC traffic)	Short: `retention.ms=300000`, `segment.ms=300000`, small segments
`solution-response`	Must equal `rulebricks.hps.workers.solutionPartitions`	Same as `solution`	Same as `solution`
`logs`	8-24 depending on volume	2-3 in production (long-lived data)	Size to your Vector outage tolerance, e.g. `retention.ms=86400000`

Set max.message.bytes to at least 2097152 (2 MB) on all three topics. HPS chunks are byte-bounded well below this, but the headroom prevents edge-case produce failures, and a response chunk can exceed its request chunk when rules expand payloads.

Partition count on solution is the worker fleet's concurrency ceiling, not a worker quota; the 2x headroom recommendation and the full sizing model are explained in Performance & Scaling.

For replication, 1 is acceptable on the RPC topics because the traffic is transient and the HPS producer uses acks=-1, so higher replication adds ISR-wait latency to every request. Use 2+ only if broker loss must never produce a brief window of request failures.

Partition Sizing for solution-response

The solution-response topic is partition-sensitive. HPS replicas share a consumer group over this topic, so each replica is assigned a subset of partitions. When producing a response, the worker writes to the exact partition the originating HPS replica is consuming. If the topic has fewer partitions than expected, responses land on partitions no replica is watching, causing 30-second timeouts.

The expected partition count is set in the Helm values:

rulebricks:
  hps:
    workers:
      solutionPartitions: 128

The HPS deployment template passes solutionPartitions to HPS as the MAX_WORKERS environment variable.

⚠️

Keep solutionPartitions equal to the actual partition count of solution-response on your external cluster. A mismatch here is the most common cause of 30-second timeouts on solve requests.

If you raise a topic's partition count later, HPS Kafka clients observe the change within about 30 seconds (metadataMaxAge: 30000), and consumers fully rebalance onto new partitions at their next group rebalance. Chart upgrades roll the workers, which forces one. HPS pods gate readiness on the response consumer owning partitions (GET /ready returns 503 until group join completes), so traffic never reaches an instance that can't receive its responses.

ZooKeeper vs KRaft

HPS does not care how your Kafka cluster manages metadata. It never connects to ZooKeeper directly; it only speaks the Kafka wire protocol to brokers via a bootstrap address. Both ZooKeeper-backed clusters and KRaft-mode clusters work identically. Managed services like AWS MSK (in either mode), Confluent Cloud, Aiven, and Redpanda are all compatible.

The bundled Kafka deployment is managed by Strimzi and runs in KRaft mode. That is an internal broker implementation detail and has no bearing on external cluster compatibility.

Tuning and Idempotency

The HPS Kafka client is pre-tuned for a low-latency request/response workload. These settings are baked into the application and do not need external configuration, but are worth understanding:

Producer:

Idempotent mode (idempotent: true, acks: -1) guarantees exactly-once produce semantics per session. If a network blip causes a retry, Kafka deduplicates the message at the broker. This is Kafka-level idempotency, not HTTP-level: a client retrying an HTTP request produces a new message with a new correlation ID.
Snappy compression reduces wire bytes. The broker must support Snappy (enabled by default on Apache Kafka and Confluent). If your cluster has disabled Snappy, contact Rulebricks support.
lingerMs: 0 sends immediately. Latency is prioritized over batch throughput because these are synchronous, user-facing requests.

Consumer:

The HPS response consumer uses sessionTimeout: 60000, heartbeatInterval: 15000, tolerating brief idle periods without spurious rebalancing.
Workers use a tighter 30s session timeout with a 3s heartbeat (tunable via WORKER_SESSION_TIMEOUT_MS / WORKER_HEARTBEAT_INTERVAL_MS) so a hung worker releases its partitions quickly.
maxWaitTimeInMs: 50 makes the broker return fetched messages quickly for low-latency response delivery.
Workers process up to 4 partitions concurrently per pod. Throughput scales by adding worker replicas, not by raising concurrency.

No tuning is needed on the external Kafka side beyond ensuring the three topics exist with correct partition counts and reasonable ISR settings (e.g., min.insync.replicas=1 for single-broker, 2 for multi-broker production).

Request Size Limits

Admission is byte-first; item counts are not limited by default. Useful when setting client expectations:

Total request body: 6 MiB hard ceiling by default (HTTP 413 above it; configurable via HTTP_BODY_LIMIT_BYTES on HPS). We recommend clients stay around 1 MB or less for the best latency profile. Larger bodies execute fine but cost proportionally more parse time and fan out into more chunks.
Single payload: 1.5 MiB serialized hard ceiling (HTTP 413 naming the offending array index; configurable via ITEM_MAX_BYTES, which must stay below the topics' max.message.bytes minus envelope headroom).
Item count: unlimited by default. Operators can opt into a cap with BULK_MAX_ITEMS (HTTP 400 above it).
Response amplification: each chunk's response must fit the topic's max.message.bytes (2 MB default). Rules that expand outputs beyond roughly 16 KB average per payload will fail the request with an explicit error.
The whole request shares one 30-second execution deadline regardless of size.

KEDA Autoscaling with External Kafka

The KEDA ScaledObject for HPS workers monitors consumer lag on the solution topic. When you externalize Kafka, KEDA's bootstrapServers is automatically derived from your kafkaBrokers value. No additional KEDA configuration is needed.

However, ensure KEDA can reach your external Kafka brokers from within the cluster (network policies, VPC peering, security groups, etc.). If your Kafka requires SASL, KEDA's Kafka trigger also needs authentication; see the KEDA Kafka trigger docs (opens in a new tab) for details.

Vector and the logs topic

The Rulebricks chart ships a Vector pod by default as the consumer of the logs topic. HPS API and worker pods produce structured decision-log entries to Kafka after each request completes (non-blocking, post-response); Vector reads them and forwards to whatever sink the chart is configured for, commonly the object storage archive that ClickHouse queries, S3, or an HTTP endpoint into a SIEM.

A few things to know when externalizing Kafka:

Consumer group. Vector joins Kafka as vector-consumers (configured in vector.customConfig.sources.kafka.group_id). This is the group ID to use in kafka-consumer-groups.sh commands and any ACL rules.
Consume-only. In the default chart configuration Vector only reads from the logs topic; it does not produce back to Kafka. If your cluster enforces ACLs, the Vector principal needs Read and Describe on logs with group id vector-consumers, and no Write.
Network reachability. Vector must be able to reach the external brokers from inside the cluster, the same as HPS and the workers. Check network policies, security groups, and VPC peering accordingly.

To verify Vector is healthy after pointing it at external Kafka:

kubectl get pods | grep vector
kubectl logs <vector-pod> | grep -i "kafka\|partition"
kafka-consumer-groups.sh --bootstrap-server <broker> \
  --describe --group vector-consumers

The pod should be Running, the logs should show a successful Kafka connection and partition assignment at startup, and consumer-group lag on logs should stay near zero under normal load. Growing lag almost always means Vector cannot reach the brokers or is missing ACL permissions.

Disabling Kafka Entirely

Setting kafka.enabled: false without providing external kafkaBrokers effectively disables HPS. All solve endpoints return HTTP 503. Only the /health endpoint continues to respond. This is only useful for running a control-plane-only deployment (dashboard and admin APIs without rule execution).

It also breaks the decision-log pipeline: Vector loses its data source, so any downstream sink (ClickHouse, S3, SIEM) will see a complete gap for the duration of the Kafka outage. See Vector and the logs topic for the consumer-side details.

Externalizing Redis

How Rulebricks Uses Redis

Redis serves as a shared cache between all Rulebricks components: the main app, HPS, and workers. It sits in a three-tier caching hierarchy:

L1: In-process LRU (per pod, always present)
L2: Redis (shared across all pods and replicas)
L3: Supabase (source of truth, queried on full cache miss)

What lives in Redis:

Data	TTL	Written By
API key auth payloads	60s	HPS
Rule/flow definitions (compressed)	No expiry (pub/sub invalidation)	Main app (not HPS)
Named-environment release mappings	No expiry	Main app (not HPS)
Flow node results (API, SOAP, DB, vault)	1–300s	Workers

Redis is accessed using basic commands (GET, SET with EX, EXPIRE, DEL) plus one pub/sub channel: the main app publishes to rulebricks:cache:invalidate after rule, flow, or API-key changes, and HPS and worker pods subscribe to clear their in-process caches immediately. No Lua scripts, no streams. A vanilla Redis instance is all that's needed, with no clustering, persistence configuration, or eviction policy requirements — but the endpoint must support PUBLISH/SUBSCRIBE. If pub/sub is unavailable, pods log a warning and cache freshness degrades to TTL-based expiry only.

What You Provide

To use your own Redis instance, you provide one thing: a Redis host and port.

rulebricks:
  redis:
    enabled: false
    external:
      host: 'your-redis.example.com'
      port: 6379
      password: 'your-password'
      # existingSecret: "my-redis-secret"
      # existingSecretKey: "redis-password"
      tls:
        enabled: false

Setting redis.enabled: false stops the chart from deploying its own internal Redis pod and PVC. The chart takes care of everything else; all internal components are wired to your Redis automatically.

What the Chart Does Behind the Scenes

The Rulebricks stack has two types of Redis consumers:

HPS and workers connect to Redis directly via the native Redis protocol (ioredis). This is the fast path (~1–2ms per operation) and is configured via the REDIS_URL environment variable, which the chart constructs from your external.host and external.port.
The main app (Next.js) uses an HTTP-based Redis client (@vercel/kv). It cannot connect to Redis natively, so it needs an HTTP translation layer.

To bridge this, the chart always deploys a lightweight internal proxy (serverless-redis-http) that speaks HTTP on one side and the Redis protocol on the other. When you externalize Redis, this proxy is automatically pointed at your external instance. You do not need to configure or think about it; it's an internal implementation detail.

In short: you provide a standard Redis endpoint, and the chart handles routing each component to it through the appropriate protocol.

Connection Examples

Provider	`host`	`port`	`tls.enabled`
AWS ElastiCache (no TLS)	`primary.my-cluster.use1.cache.amazonaws.com`	`6379`	`false`
AWS ElastiCache (TLS)	`primary.my-cluster.use1.cache.amazonaws.com`	`6379`	`true`
AWS MemoryDB	`clustercfg.my-cluster.use1.memorydb.amazonaws.com`	`6379`	`true`
Redis Cloud	`redis-12345.c1.us-east-1.redns.redis-cloud.com`	`12345`	`true`
GCP Memorystore	`10.x.x.x`	`6379`	`false`
Self-hosted (another namespace)	`redis-svc.other-ns.svc.cluster.local`	`6379`	`false`

The Redis client is pre-configured with TCP keepalive, auto-pipelining, and exponential-backoff retries. No client-side tuning is typically required.

Disabling Redis Entirely

If redis.enabled is false and no external host is provided, Redis operates in a no-op mode where cache writes are silently discarded and reads return empty. HPS will start and log a warning, but consequences are significant:

Every authentication and entity lookup hits Supabase directly. Expect a ~60x increase in database queries at steady state.
Named-environment URLs stop working. Requests like /api/v1/solve/my-rule/prod return 404 because the releases_* cache keys are only written by the main app into Redis. Numeric versions and latest still work.
Flow-node caching is lost. API, SOAP, and database nodes inside flows execute on every invocation.

Supabase

The Supabase stack itself stays in-cluster (or on Supabase Cloud); what you can externalize is the PostgreSQL server underneath it. Set supabase.externalDatabase.enabled: true and configure supabase.externalDatabase.* to point the Supabase services at a server you manage. See External PostgreSQL for the values.

The full self-hosted Supabase stack includes the following services:

PostgreSQL stores users, teams, rules, flows, API keys, usage data, and all application state.
GoTrue (Auth) handles user signup, login, password recovery, email verification, SSO/OIDC, and JWT issuance. The main app delegates all authentication to Supabase Auth.
PostgREST provides the REST API layer over PostgreSQL that the application queries.
Realtime powers live-update features in the dashboard.
Kong is the API gateway that routes and authenticates requests to the above services.

These services stay chart-managed even with an external database. Beyond providing a clear database layer for Rulebricks, Supabase is our interface for authentication, JWT management, and real-time features. There are database triggers between Supabase-managed identity tables (e.g., auth.users) and Rulebricks-managed application tables that depend on Supabase's internal wiring, so the external server must be a vanilla PostgreSQL that the Supabase services can fully manage.

Supported PostgreSQL Versions

For an external database, run PostgreSQL 17 (recommended — it matches the bundled in-cluster database and the CLI's managed-RDS default) or PostgreSQL 16. PostgreSQL 15 and below are not supported: the bootstrap job's replication grants require PostgreSQL 16+ role-inheritance syntax on AWS RDS, and Supabase Realtime does not support older majors without superuser access, which managed providers do not offer. The server must also have logical replication enabled (wal_level = logical; on RDS set rds.logical_replication = 1 in the parameter group) for Realtime.

It is worth noting the database is likely not a bottleneck. Because Redis aggressively caches entity definitions, and every pod has an in-process LRU in front of Redis, actual database read/write volumes under normal operation are generally minimal.

Database Backups

While the Supabase PostgreSQL instance is lightweight, it holds all application state and should be backed up regularly if self-hosting. The chart manages this for you with scheduled logical backups (pg_dump/pg_dumpall uploaded to shared object storage), plus on-demand backup and restore through the CLI. See Storage & Backups.

Verification Checklist

After switching to external Kafka or Redis, verify the deployment is healthy:

1. HPS health endpoints

HPS exposes two endpoints: /health (process-alive, used by the liveness probe) and /ready (used by the readiness probe; returns 503 until the Kafka response consumer has joined its group and owns partitions, so pods never receive traffic before they can route responses).

kubectl port-forward svc/<release>-hps 3000:3000
curl -s http://localhost:3000/health | jq
curl -si http://localhost:3000/ready | head -1

Expected from /health:

{
  "status": "ok",
  "redis": true,
  "cacheInvalidation": {
    "enabled": true,
    "ready": true,
    "channel": "rulebricks:cache:invalidate"
  },
  "kafka": {
    "enabled": true,
    "ready": true,
    "partitions": 21
  }
}

redis: true confirms the KV client connected. If false, check your REDIS_URL or redis.external.* values.
cacheInvalidation.ready: true confirms the pod's pub/sub subscription against your Redis. If false, the endpoint likely does not support SUBSCRIBE (or the connection failed) and cache freshness falls back to TTL-based expiry.
kafka.ready: true confirms the consumer joined its group. If false on startup, retry after 10 to 15 seconds; group join can take a moment. /ready returns 200 once this completes.
kafka.partitions shows how many solution-response partitions this pod owns. In a 3-replica HPS against a 64-partition topic, each pod should report ~21.

2. End-to-end smoke test

curl -X POST http://localhost:3000/api/v1/solve/<rule-slug>/latest \
  -H "x-api-key: $API_KEY" \
  -H "content-type: application/json" \
  -d '{"input": "value"}'

A 200 response confirms Kafka, Redis, Supabase, and the worker pipeline are all healthy.

3. Common failure modes

Symptom	Likely Cause	Fix
All solve requests return 503	Kafka disabled or unreachable	Check `kafkaBrokers`, verify network connectivity to brokers
Requests timeout after 30s	`solution-response` partition count ≠ `solutionPartitions`	Raise the topic's partition count, or update `solutionPartitions` to match
Named-environment URLs return 404	Redis not connected, or main app and HPS point at different Redis instances	Verify `redis: true` on `/health`; ensure shared Redis
Persistent consumer lag on `solution`	Not enough worker replicas	Scale up `hps.workers.replicas` or `hps.workers.keda.minReplicaCount`
Decision logs stop reaching ClickHouse / S3 / SIEM after externalizing Kafka	Vector can't reach or authenticate to the external broker	Check Vector pod logs and network reachability; for token-auth Kafka verify the `kafkaBridge` sidecar is enabled

Troubleshooting SSO Overview