Senior Site Reliability Engineer Job Bangalore area,Bengaluru Karnataka India,IT/Tech

Location: Bengaluru

Position Summary:

We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems. This role is highly hands-on and focuses on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes, with a strong preference for Google Cloud Platform (GCP) and AWS.

Job Responsibilities :

Design, implement, and operate highly available and resilient Kubernetes-based systems .
Define, monitor, and enforce SLIs, SLOs, and error budgets to ensure service reliability.
Lead incident response, root cause analysis (RCA), and postmortems , driving continuous improvement.
Architect and manage observability platforms for metrics, logging, tracing, and alerting.
Work hands-on with Prometheus, Alert manager, Open Telemetry, Grafana , and Loki / ELK / Open Search .
Implement cloud-native monitoring and logging , with preference for GCP Cloud Monitoring & Logging .
Establish actionable alerting standards to reduce noise and improve response effectiveness.
Build and manage cloud infrastructure on GCP (preferred) or AWS .
Operate and scale Kubernetes clusters (GKE preferred) and deploy services using Helm .
Manage containerized workloads using Docker .
Develop automation and internal tooling using Python to improve reliability and observability.
Integrate CI/CD pipelines with reliability and monitoring checks.
Mentor junior engineers, influence architectural decisions, and collaborate across engineering teams.

Required Skills and Qualifications:

6+ years of experience as a Dev Ops Engineer, SRE, or related software engineering role , supporting production-grade systems.
Strong hands-on experience with cloud infrastructure on GCP (preferred) or AWS .
Proven expertise in operating Kubernetes-based platforms in production environments ( GKE preferred ).
Solid experience designing and maintaining highly available and resilient systems using SRE best practices.
Hands-on knowledge of SLIs, SLOs, error budgets , and reliability engineering principles.
Strong experience with observability and monitoring tools , including Prometheus, Grafana, Alert manager, Open Telemetry, and log platforms such as Loki / ELK / Open Search.
Demonstrated experience in incident management, on-call support, root cause analysis, and postmortems .
Proficiency in automation and tooling using Python , with additional scripting experience in Shell or Groovy.
Experience integrating CI/CD pipelines (Jenkins, Git Hub) with deployment, monitoring, and reliability checks.
Strong understanding of microservices architectures, distributed systems , and containerized workloads.
Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform or Cloud Formation.
Good knowledge of cloud networking, security fundamentals, and access controls .
Strong analytical and problem-solving skills with a proactive operational mindset.
Excellent communication skills and the ability to collaborate effectively with cross-functional engineering teams.


Increase/decrease your Search Radius (miles)



Job Posting Language