More jobs:
Job Description & How to Apply Below
Location: Bengaluru
Title:
Site Reliability Engineer (SRE)
Experience:
10+ Years
Location:
Bangalore(Onsite)
Notice Period:
Immediate Joiner
Job Description
We are seeking a highly experienced Site Reliability Engineer (SRE) with 10+ years of experience in designing, implementing, and maintaining highly available, scalable, and resilient systems. The ideal candidate will have deep expertise in AWS, Kubernetes, Elasticsearch, Grafana, and modern SRE practices, with a strong focus on automation, observability, and operational excellence.
Key Responsibilities:
Design, build, and operate highly reliable, scalable, and fault-tolerant systems in AWS cloud environments.
Implement and manage Kubernetes (EKS) clusters, including deployment strategies, scaling, upgrades, and security hardening.
Own and improve SLIs, SLOs, and SLAs, driving reliability through data-driven decisions.
Architect and maintain observability platforms using Grafana, Prometheus, and Elasticsearch.
Manage and optimize Elasticsearch clusters, including indexing strategies, performance tuning, scaling, and backup/restore.
Develop and maintain monitoring, alerting, and logging solutions to ensure proactive incident detection and response.
Lead incident management, root cause analysis (RCA), postmortems, and continuous improvement initiatives.
Automate infrastructure and operations using Infrastructure as Code (IaC) and scripting.
Collaborate with development teams to improve system reliability, deployment pipelines, and release processes.
Implement CI/CD best practices and reduce deployment risk through canary, blue-green, and rolling deployments.
Ensure security, compliance, and cost optimization across cloud infrastructure.
Mentor junior SREs and drive adoption of SRE best practices across teams.
Required
Skills & Qualifications Core Technical Skills
10+ years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering.
Strong hands-on experience with AWS services (EC2, EKS, S3, RDS, IAM, VPC, Cloud Watch, Auto Scaling).
Advanced expertise in Kubernetes (EKS preferred), Helm, and container orchestration.
Deep knowledge of Elasticsearch (cluster management, indexing, search optimization, performance tuning).
Strong experience with Grafana and observability stacks (Prometheus, Loki, ELK).
Proficiency in Linux system administration and networking fundamentals.
Experience with Infrastructure as Code tools (Terraform, Cloud Formation).
Strong scripting skills in Python, Bash, or Go.
Skills:
sre,kubernetes,linux,site reliability engineer,grafana,aws,python,elasticsearch,reliability,elk,devops
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×