Job Description & How to Apply Below
RESPONSIBILITIES
Operate and optimize Kubernetes-based infrastructure using HELM/ kustomize for deployment and configuration management.
Build and maintain CI/CD pipelines for infrastructure and application deployments.
Manage and monitor cloud infrastructure on AWS (EKS, EC2, S3, IAM, VPC, etc.). and on premise infrastructure
Ensure observability through logging, monitoring, and alerting systems (e.g., Prometheus, Grafana, Cloudwatch, Data Dog ).
Implement and enforce security best practices across infrastructure components.
Participate in on-call rotations, incident response, and root cause analysis.
Support scaling of systems to meet demand while maintaining reliability.
Collaborate with engineering and security teams on architecture and deployment strategies.
Ensure the implementation of security standards and compliance requirements across all operational aspects of the cloud platforms.
MUST HAVE SKILLS
3 - 6+ years of hands-on experience in SRE roles
2 - 4+ years of managing production Kubernetes environments
Currently operating production EKS clusters (hands-on, not observational)
Deep expertise in Kubernetes (EKS or self-managed) and Helm
Strong understanding of networking fundamentals: TCP/IP, DNS, VPNs, firewalls, load balancing
Practical experience with AWS services: EKS, EC2, IAM, S3, Cloud Watch, VPC
Solid exposure to containerization (Docker) and CI/CD pipelines (e.g., Bitbucket Pipelines, Git Hub Actions, ArgoCD, Flux CD)
Proven experience handling production systems, on-call rotations, and real-time incident response
Proficiency in at least one programming language (Python or Go preferred)
Clear understanding of the Software Development Life Cycle (SDLC)
Strong automation mindset with a bias toward eliminating manual toil
Ability to build and maintain Grafana dashboards using PromQL (or equivalent)
Strong grasp of SRE principles: SLIs, SLOs, error budgets, incident and post-incident management
NICE TO HAVE
Experience in regulated industries (healthcare,fintech).
Experience with incident management and disaster recovery.
QUALIFICATIONS/EXPERIENCE
Minimum of 3 years with 2+ years of SRE experience.
BTech/BE/BS or MTech/MCA/ME/MS
2+ years of work experience with Amazon Web Services (AWS)
2+ years of work experience with Kubernetes
2+ years of work experience with Site Reliability Engineering
Working in a hybrid setting
WHAT DAY TO DAY LOOKS LIKE
Monitoring Service-Level Indicators (SLIs)
Setting Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs)
Responding to Incidents
Writing Postmortems
Automating System Tasks
Cross-Department Collaboration
Building Software for Dev Ops, SRE, and Support Teams
Fixing Support Escalation Issues
Optimizing On-Call Rotations and Processes
Documenting "Tribal" Knowledge
Conducting Post-Incident Reviews
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×