Platform Engineer Job Fort Worth area,Texas USA,IT/Tech

Position: Staff Platform Engineer

Responsibilities

Architect and ship multitenant, planet-scale services (checkout, subscriptions, payments orchestration, invoicing, and tax hooks) with clear domain boundaries (DDD) and hard SLOs.
Be the custodian of API and schema design: own protobuf/Connect

RPC conventions, versioning policy, deprecation playbooks, and Buf breaking change checks so our contracts stand the test of time.
Guarantee resilience and availability of core payment paths: timeouts, retries with jitter, circuit breakers, idempotency keys, outbox/Saga patterns, hedged requests, and graceful degradation.
Ensure complete auditability: append-only double-entry ledger, immutable event streams, tracelinked entities (OTel trace/span IDs), tamper-evident trails, and reconciliations that tie out to the cent.
Own error boundaries end-to-end: enumerate failure domains (PSP, network, data, concurrency, quota, browser, device); design uniform error contracts; implement compensations/backfills and automated replay.
Keep track of every deployed thing: services, workers, triggers, cron, subscriptions, the service catalog, and scorecards (owners, SLOs, runbooks, PDBs, HPA/VPA, budgets, quotas, and timeouts).
Configuration and limits stewardship: enforce sane defaults across GKE, Pub/Sub, Redis, Firestore/Mongo, Click House connection pools, ACK deadlines, batch sizes, TTLs, memory/FD limits, and GCP quotas.
Observability as a product: pervasive Open Telemetry, RED/USE metrics, exemplars, trace sampling, SLO dashboards, and alerting that wakes humans only for user-impacting issues.
Production excellence: canary/blue-green rollouts, automated rollbacks, chaos drills, DR playbooks (RPO/RTO), multiregion failover strategies, and incident command on rotation.
Security and compliance by design: PCI scope minimization, tokenization/vaulting, secrets/KMS hygiene, data retention/archival, and privacy control embedded checks in CI/CD.
Developer acceleration: pave golden paths (service templates, ADR/RFC process, linting/formatting, contract tests, ephemeral envs, load/perf harnesses) to make the right thing the easy thing
Core domain evolution: orchestration ledger reconciliation flows with crisp in variants and consistency guarantees (read-your-writes where needed, eventual where appropriate).
Reliability strategy: SLIs/SLOs, error budgets, capacity planning, cost/Fin Ops guardrails, multiregion posture, and DR exercises.
API and data governance: canonical models, schema lifecycle (compatibility matrix, migrations), and data lifecycle (retention, archival, and compliance).
Practice leadership for High Level: design reviews, postmortems, technical strategy, coding standards, and mentorship across teams raise the bar for the org.
Hiring and team growth: help us hire, scale, and train the right team; shape interview loops, rubrics, onboarding, and ongoing learning (brown bags, reviews, and pair design).
Cross-functional partnership: collaborate with Product/Marketing/Support to translate platform capabilities and constraints into roadmaps, GTM narratives, and reliable customer outcomes.
Risk and roadmap: maintain a technical risk register, make build vs. buy calls, and propose simplifications or deprecations that meaningfully reduce complexity and MTTR.

Requirements

10+ years building and operating backend systems (at least 5+ years in Go), with 2-3+ years acting as a staff/principal-level IC or tech lead for critical paths.
Deep proficiency with protobuf + Connect

RPC/gRPC and API lifecycle management (versioning, compatibility, contract testing, Buf).
Distributed systems fundamentals: idempotency, exactly once via dedupe/outbox, ordering, consensus basics, back pressure, and concurrency control.
Event-driven architectures on GCP (Pub/Sub), plus Redis for fast paths; strong schema design in Mongo

DB/Firestore and analytics/reporting patterns on Click House.
Kubernetes/GKE operations at scale: autoscaling (HPA/VPA), PDBs, resource limits/requests, multiregion topologies, CI/CD, and canary/blue-green.
Reliability engineering: SLIs/SLOs, error budgets, capacity and load testing, incident management, and DR/BCP.
Security and compliance: secrets/KMS best practices, PCI basics (scope reduction, key rotation), and data governance (retention/archival).
Testing discipline: unit, integration, contract, property-based, and performance; test data management and deterministic environments.
Frontend collaboration: solid understanding of Vue.js + Tan Stack Query to shape clean API surfaces and performance budgets across the boundary.
Exceptional technical writing and communication: design docs, ADRs/RFCs, postmortems, and stakeholder updates.

Nice to have

Hands-on integrations with major PSPs/local rails (e. g., UPI, wallets, BNPL, cards/3DS2) and reconciliation at scale.
Experience with active-active or multiregion designs, chaos engineering, and traffic management.
Observability leadership with Open Telemetry at org scale (tail-based sampling, exemplars).
Fin Ops experience: cost baselining, quotas, budget…


Increase/decrease your Search Radius (miles)



Job Posting Language