More jobs:
Job Description & How to Apply Below
Position Summary:
We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems. This role is highly hands-on and focuses on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes, with a strong preference for Google Cloud Platform (GCP) and AWS.
Job Responsibilities :
Design, implement, and operate highly available and resilient Kubernetes-based systems .
Define, monitor, and enforce SLIs, SLOs, and error budgets to ensure service reliability.
Lead incident response, root cause analysis (RCA), and postmortems , driving continuous improvement.
Architect and manage observability platforms for metrics, logging, tracing, and alerting.
Work hands-on with Prometheus, Alert manager, Open Telemetry, Grafana , and Loki / ELK / Open Search .
Implement cloud-native monitoring and logging , with preference for GCP Cloud Monitoring & Logging .
Establish actionable alerting standards to reduce noise and improve response effectiveness.
Build and manage cloud infrastructure on GCP (preferred) or AWS .
Operate and scale Kubernetes clusters (GKE preferred) and deploy services using Helm .
Manage containerized workloads using Docker .
Develop automation and internal tooling using Python to improve reliability and observability.
Integrate CI/CD pipelines with reliability and monitoring checks.
Mentor junior engineers, influence architectural decisions, and collaborate across engineering teams.
Required Skills and Qualifications:
6+ years of experience as a Dev Ops Engineer, SRE, or related software engineering role , supporting production-grade systems.
Strong hands-on experience with cloud infrastructure on GCP (preferred) or AWS .
Proven expertise in operating Kubernetes-based platforms in production environments ( GKE preferred ).
Solid experience designing and maintaining highly available and resilient systems using SRE best practices.
Hands-on knowledge of SLIs, SLOs, error budgets , and reliability engineering principles.
Strong experience with observability and monitoring tools , including Prometheus, Grafana, Alert manager, Open Telemetry, and log platforms such as Loki / ELK / Open Search.
Demonstrated experience in incident management, on-call support, root cause analysis, and postmortems .
Proficiency in automation and tooling using Python , with additional scripting experience in Shell or Groovy.
Experience integrating CI/CD pipelines (Jenkins, Git Hub) with deployment, monitoring, and reliability checks.
Strong understanding of microservices architectures, distributed systems , and containerized workloads.
Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform or Cloud Formation.
Good knowledge of cloud networking, security fundamentals, and access controls .
Strong analytical and problem-solving skills with a proactive operational mindset.
Excellent communication skills and the ability to collaborate effectively with cross-functional engineering teams.
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×