Senior Kafka Platform Engineer; Automation & Kubernetes
Listed on 2026-03-01
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, Data Engineer
Join to apply for the Senior Kafka Platform Engineer (Automation & Kubernetes) role at Balyasny Asset Management L.P.
We’re seeking a seasoned Kafka engineer to design, operate, and scale our event streaming platform. You’ll own the Kafka core (brokers, storage, security, observability) and the automation that powers it—building infrastructure-as-code, operators/Helm charts, and CI/CD to enable safe, self‑service provisioning. You’ll run Kafka on Kubernetes and/or cloud‑managed offerings, ensure reliability and performance, and partner with application teams on best practices.
What you’ll do- Architect, deploy, and operate production‑grade Kafka clusters (self‑managed and/or Confluent/MSK), including upgrades, capacity planning, multi‑AZ/region DR, and performance tuning.
- Operate Kafka on Kubernetes using Operators, Helm, and Git Ops, and build IaC‑driven automation with guardrails for repeatable, compliant, zero‑downtime provisioning and deployments.
- Implement and manage Kafka Connect, Schema Registry, and Mirror Maker 2/Cluster Linking; standardize connectors (e.g., Debezium) and build self‑service patterns.
- Drive reliability: define SLOs/error budgets, on‑call rotations, incident response, postmortems, runbooks, and automated remediation.
- Implement observability: metrics, logs, traces, lag monitoring, and capacity dashboards (e.g., Prometheus/Grafana, Burrow, Cruise Control, Open Telemetry).
- Secure the platform: TLS/mTLS, SASL (OAuth/SCRAM), RBAC/ACLs, secrets management, network policies, audit, and compliance automation.
- Guide event‑streaming best practices: topic design, partitioning, compaction/retention, idempotency, ordering, schema evolution/compatibility, DLQs, EOS semantics.
- Partner with app, data, and SRE teams; provide enablement, documentation, and internal tooling for a great developer experience.
- Lead/mentor engineers and contribute to roadmap, standards, and platform strategy.
- Excellent communication and partnership skills with platform and application teams.
- Deep hands‑on experience operating Kafka in production at scale (brokers, controllers, partitions, ISR, tiered storage/retention, rebalancing, replication, recovery).
- Automation first:
Infrastructure as Code (Terraform), Helm, Operators, Git Ops (Argo CD/Flux), and CI/CD (e.g., Git Hub Actions/Jenkins) for platform lifecycle. - Proficiency with one or more languages for tooling/automation:
Python, Go, or Java; plus Bash and solid Linux fundamentals (networking, file systems, JVM tuning basics). - Observability and reliability engineering for Kafka:
Prometheus/Grafana, logging, alerting, lag monitoring, capacity/throughput modeling, performance tuning. - Security for data in motion: TLS/mTLS, SASL/OAuth, ACL/RBAC, secrets management (e.g., Vault), and audit/compliance practices.
- Experience with Kafka ecosystem components:
Kafka Connect, Schema Registry, Mirror Maker 2/Cluster Linking; familiarity with Cruise Control. - Cloud experience (AWS/Azure/GCP) with networking, IAM, and one or more managed offerings (e.g., Confluent Cloud or AWS MSK).
- Proven track record designing runbooks, leading incidents/postmortems, and driving platform roadmaps.
- Data processing frameworks (Kafka Streams, Flink, Spark Structured Streaming) and EOS semantics.
- Experience with Strimzi or Confluent for Kubernetes in production.
- Knowledge of CDC patterns and tools (e.g., Debezium) and database connectors at scale.
- Multi‑region architectures, cluster linking strategies, and disaster recovery drills.
Associate
Employment typeFull‑time
Job functionEngineering, Finance, and Strategy/Planning
IndustriesFinancial Services, Capital Markets, and Investment Management
Referrals increase your chances of interviewing at Balyasny Asset Management L.P. by 2x
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).