×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Bengaluru, 560001, Bangalore, Karnataka, India
Listing for: Pocket FM
Full Time position
Listed on 2026-02-26
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, Cybersecurity, AI Engineer
Job Description & How to Apply Below
Position: Staff Site Reliability Engineer
Location: Bengaluru

Staff Site Reliability Engineer

Pocket FM is a leading audio entertainment platform that brings engaging, serialized fiction to millions of listeners across genres like romance, thriller, fantasy, and more. With over 130 million users globally and strong traction in markets like the US and Europe, we’re revolutionizing storytelling through audio.
Our unique model combines free listening with micropayments for premium content, powering strong business growth. In FY25, we reached an ARR of INR 2,000 crore, with over 100,000 hours of content on the platform. We're also at the forefront of innovation, leveraging AI-generated content to scale efficiently.

Role Overview
We are looking for a  Staff SRE  to lead reliability engineering efforts while driving AI-native solutioning and platform strategy. This role requires a blend of deep SRE expertise, distributed systems knowledge, applied AI/ML understanding, and strong security fundamentals to build resilient, scalable, intelligent, and secure infrastructure.
You will play a key role in shaping how AI-powered systems are designed, deployed, monitored, optimized — and secured — across the organization.

The Role:

What You Build and Own

SRE & Platform Engineering
Design, build, and operate highly reliable, scalable distributed systems
Define and implement SLIs, SLOs, SLAs, and error budgets
Lead incident management, root cause analysis (RCA), and postmortems
Drive an automation-first approach for operations, deployment, and recovery
Improve observability (logs, metrics, tracing) across systems

AI-Native Solutioning
Architect and implement AI-driven operational workflows (AIOps)
Build systems leveraging LLMs, intelligent automation, and predictive analytics
Integrate AI into monitoring, alerting, anomaly detection, and remediation
Evaluate and adopt AI-powered developer and SRE tooling (e.g., LLM-based copilots, auto-debugging tools)

Information Security & Resilience
Embed  security-by-design  principles into infrastructure and platform architecture
Partner with Security teams to implement cloud security best practices (IAM, RBAC, network segmentation, encryption)
Lead secure configuration and hardening of Kubernetes clusters and cloud environments
Implement and maintain Dev Sec Ops  practices across CI/CD pipelines
Drive vulnerability management, patching strategy, and secure dependency management
Define and monitor security-related SLIs/SLOs (e.g., patch latency, vulnerability remediation time)
Implement runtime security, anomaly detection, and threat monitoring for AI and distributed systems
Ensure compliance with relevant frameworks (SOC2, ISO 27001, GDPR, etc.)
Conduct security reviews, threat modeling, and participate in incident response for security events
Secure AI/ML systems, including model security, prompt injection mitigation, data protection, and access controls

Strategy & Leadership
Define and drive AI-native SRE strategy and roadmap
Partner with engineering, platform, product, and security teams to embed reliability and security by design
Mentor engineers and establish best practices for SRE + AI + Security integration
Lead initiatives for cost optimization, performance tuning, system resilience, and risk reduction

The Ideal Candidate — Who You AreZ

Experience
8–12+ years in SRE / Dev Ops / Platform Engineering
Proven experience operating production-grade distributed systems at scale

Strong Experience With
Cloud platforms (AWS / GCP)
Kubernetes & container orchestration
Infrastructure as Code (Terraform,etc.)
CI/CD systems and automation frameworks

Deep Understanding Of
Distributed systems, scalability, and fault tolerance
Observability tools (Prometheus, Grafana, Datadog, Open Telemetry)
Incident management frameworks and reliability engineering best practices
Cloud security architecture and Dev Sec Ops  principles

Programming
Strong programming experience in Python / Go

Your AI/ML Toolkit

Hands-On Experience With
LLMs (OpenAI, open-source models, etc.)
AI/ML pipelines or inference systems

Understanding Of
Prompt engineering, embeddings, vector databases
AI-driven automation or AIOps platforms
Secure AI system design and model lifecycle governance

Experience Integrating AI Into
Monitoring / alerting
Incident response
Developer productivity workflows
Security monitoring and anomaly detection
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary