Job Description & How to Apply Below
Senior Site Reliability Engineer (SRE)
Company: Pocket FM
About the Role
Pocket FM is a global audio entertainment platform serving millions of listeners across multiple geographies. We are looking for an experienced Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our large-scale audio streaming platform built on Kubernetes-first, cloud-native architecture .
In this role, you will own platform stability, improve operational excellence, and work closely with engineering teams to deliver a seamless listening experience to users worldwide.
Key Responsibilities
Reliability & Engineering Excellence
Own and improve the reliability, availability, and performance of globally distributed, Kubernetes-based production systems .
Define and continuously improve SLIs, SLOs, and SLAs using metrics derived from Prometheus and Grafana .
Drive reliability best practices across the entire software development lifecycle.
Kubernetes & Platform Operations
Operate and scale production-grade Kubernetes clusters (EKS/GKE) running critical audio streaming and backend services.
Troubleshoot complex production issues across pods, nodes, networking, storage, and the Kubernetes control plane.
Implement autoscaling, rollout strategies, and resilience patterns for containerized workloads.
CI/CD & Git Ops
Own and improve CI/CD pipelines using Git Hub Actions and Jenkins to ensure safe, reliable, and repeatable deployments.
Implement and operate Git Ops workflows using Argo CD for Kubernetes application and configuration management.
Enforce deployment best practices including canary, blue-green, and rollback strategies.
Observability & Monitoring
Build and maintain a strong observability stack using Prometheus (metrics), Grafana (visualization), and Loki (logs) .
Design effective alerting strategies that reduce noise and improve signal quality.
Use observability insights to drive performance tuning, capacity planning, and reliability improvements.
Incident Management & Operational Excellence
Lead and participate in incident response for platform, Kubernetes, and database-related issues.
Perform post-incident reviews (PIRs) with clear root cause analysis and preventive actions.
Improve on-call readiness, runbooks, and operational maturity for 24x7 global systems .
Databases & State Management
Support and improve reliability of MySQL in production, including monitoring, backups, failover, and performance tuning.
Collaborate with backend teams on schema changes, query performance, and scaling strategies.
Infrastructure & Automation
Design and manage cloud infrastructure integrated with Kubernetes using Infrastructure-as-Code (Terraform) .
Automate operational tasks using Python and/or Go to reduce toil and improve system resilience.
Drive cost and capacity optimization across cloud and Kubernetes environments.
Collaboration & Innovation
Work closely with backend, mobile, data, product, and QA teams to embed reliability principles early.
Contribute to Pocket FM’s engineering roadmap with focus on scale, resilience, and operational efficiency .
Apply modern SRE and cloud-native best practices pragmatically in production.
Required Skills & Experience
Experience
3+ years of experience in Site Reliability Engineering or platform engineering roles .
Proven experience operating large-scale, Kubernetes-based, consumer-facing systems .
Technical Expertise (Must-Have)
Strong hands-on expertise with Kubernetes in production environments.
Experience with Prometheus, Grafana, and Loki for monitoring, alerting, and logging.
Strong experience with CI/CD systems such as Git Hub Actions and Jenkins .
Hands-on experience with Git Ops workflows using Argo CD .
Solid experience managing and supporting MySQL in production.
Strong experience with AWS and/or GCP .
Proficiency in Python and/or Go .
Strong Infrastructure-as-Code experience using Terraform .
Solid understanding of Linux, networking, and cloud security fundamentals.
Preferred Qualifications
Kubernetes certifications ( CKA / CKAD / CKS ).
Cloud certifications (AWS / GCP).
Experience supporting platforms with millions of users across multiple regions .
Familiarity with structured incident management practices.
Why Pocket FM?
Pocket FM is a global product with a rapidly growing international user base , offering the opportunity to work deeply across Kubernetes, observability, and Git Ops while solving complex reliability challenges at scale.
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×