More jobs:
Job Description & How to Apply Below
About App Direct
Become a digital, global citizen and enable the new generation of digital entrepreneurs around the world. App Direct offers a subscription commerce platform to sell any product, through any channel, on any device - as a service. We power millions of subscriptions worldwide for organizations. We do this by our values-driven culture - one that enables you to Be Seen, Be Yourself, and Do Your Best Work.
About the Dev Ops Platform Team
Our mission is to provide a robust Internal Developer Platform to App Direct's engineering teams, which makes it easy, safe and fun to design, implement, release and maintain the world's leading subscription commerce platform. We are proud to be core contributors and maintainers of App Direct's Software Development Lifecycle (SDLC), through close alignment with Reliability, Quality, Data, Info Sec, Cloud, and other technology leadership.
We enable Dev Ops culture through our self-service, automated CI/CD platform. Currently, teams are leveraging the platform to make more than 3000 code deliveries every month, to 700 applications, on AWS, Azure, and on-premise environments, while remaining ISO
27001, SOC2 and PCI compliant. Our Datadog instrumentation allows teams to have clear insights, monitoring, and alerting, in order to maintain the availability of their experiences.
What you'll do and how you'll have an impact
Be the founding SRE for India within the Dev Ops Platform Team, establishing operating rhythms, guardrails, and best practices that raise reliability across hundreds of services and 30+ Kubernetes clusters.
Lead global incident management from India time zones: triage and drive resolution as Incident Commander, coordinate war rooms, manage stakeholder communications, and publish timely status page updates.
Maintain automations to enable on-call rotations, escalation policies, and incident workflows in Pager Duty, Datadog and Slack.
Create actionable runbooks to reduce MTTA/MTTR.
Define and operationalize SLIs/SLOs and error budgets with product and engineering teams; coach teams on using error budgets for release decisions and reliability trade-offs.
Create high-signal observability: instrument services, tune alerts to reduce noise, and build reliability dashboards in Datadog.
Own planned maintenance: plan and schedule maintenance windows, coordinate execution across teams and environments (AWS, Azure, on-prem), communicate broadly, and verify recovery with clear rollback plans.
Eliminate toil through automation: build Chat Ops, status page automation, auto-remediation workflows, and runbooks-as-code; integrate incident and maintenance workflows into CI/CD (Jenkins, Argo).
Drive production readiness: define PRR checklists, bake reliability gates into pipelines, and improve deployment strategies (blue/green, progressive delivery).
Partner with Dev Ops Platform Engineers to harden the Internal Developer Platform and improve developer experience while maintaining compliance requirements (e.g., ISO
27001, SOC2, PCI).
Lead blameless postmortems, track corrective actions, and maintain a reliability backlog that measurably improves availability, latency, and change success rate.
Mentor engineers and evangelize SRE principles through documentation, training, and a reliability guild/community of practice.
What we're looking for
4+ years in SRE/Production Engineering/Dev Ops operating distributed systems and microservices at scale, including Kubernetes and containerized workloads.
Proven incident response leadership: incident triage and coordination, clear stakeholder/customer communications, status page management, and creation of robust runbooks.
Strong observability skills: ideally in Datadog (metrics, logs, traces, dashboards, monitors) or familiarity with Prometheus/Grafana, New Relic, Dynatrace, or similar tools.
Expertise designing actionable alerts tied to SLIs/SLOs and managing error budgets.
Hands-on with CI/CD and release engineering:
Git Hub Actions, Argo (or similar), progressive delivery, feature flags, and safe rollout/rollback patterns.
Proficiency in at least one programming language (Golang preferred) plus Bash.
Ability to automate incident workflows,…
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×