More jobs:
Job Description & How to Apply Below
The Site Reliability Engineer is responsible for ensuring the reliability, availability, performance, and scalability of infrastructure and applications. The role emphasizes automation, monitoring, incident management, and continuous improvement , working closely with development and operations teams.
Key Responsibilities
Ensure high availability, reliability, and performance of production systems
Design, implement, and maintain monitoring, alerting, and observability solutions
Automate infrastructure provisioning, deployments, and operational tasks
Lead incident response, troubleshooting, and root cause analysis (RCA)
Optimize system performance, scalability, and capacity planning
Collaborate with development teams to improve application reliability and operability
Define, track, and improve SLAs
Reduce operational toil through automation and process improvement
Ensure security, compliance, and best operational practices
Participate in on-call rotations and providing production support
Required Skills / Must-Have
Technical Skills
Linux/Unix system administration
Kubernetes / Open Shift administration and troubleshooting
Cloud platforms: AWS / Azure
Monitoring & observability:
Prometheus, Grafana, ELK, Datadog
Scripting:
Shell, Python, or Go
Infrastructure as Code:
Terraform, Ansible, Helm
CI/CD pipelines and Dev Ops practices
Experience
Experience in SRE / Dev Ops / Platform Engineering / Production Support
Experience managing production-grade distributed systems
Nice-to-Have / Preferred Skills
Service mesh experience (Istio, Linkerd)
Messaging systems:
Kafka, Active
MQ, RabbitMQ
Performance testing and load testing tools
Security and compliance experience in regulated environments
Exposure to Google SRE principles and practices
Education & Qualifications
Primary / Preferred Education
Bachelor's degree in Computer Science, Information Technology, or related field (preferred)
Certifications / Licenses
Preferred (Not Mandatory)
Red Hat Open Shift certification
Skills Grouping & Synonyms (for AI Matching)
Operations & Reliability
Site Reliability Engineering / Production Support / Platform Engineering
Incident management / Major incident / RCA / Postmortem
Cloud & Containers
Kubernetes / Open Shift / Container orchestration
Cloud infrastructure / IaaS / PaaS
Automation & Dev Ops
Infrastructure as Code / IaC / Terraform / Ansible
CI/CD / Continuous delivery / Automation
Monitoring & Observability
Monitoring / Alerting / Metrics / Logging / Tracing
Prometheus / Grafana / ELK / APM
Location & Work Mode
Location:
Gurugram, Haryana
Work Mode:
Onsite
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×