More jobs:
Job Description & How to Apply Below
We are looking for a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our cloud-native infrastructure. The ideal candidate will bring strong hands-on experience in AWS, Kubernetes, Docker, CI/CD pipelines, monitoring, and automation using Python, and will work closely with development and operations teams to build resilient, highly available systems.
Key Responsibilities
- Design, deploy, and maintain highly available and scalable systems on AWS
- Manage and operate containerized applications using Docker and Kubernetes (EKS)
- Build, maintain, and optimize CI/CD pipelines using Jenkins
- Automate operational workflows and routine tasks using Python scripting
- Implement and manage monitoring, alerting, and observability using Grafana and Prometheus
- Ensure system reliability, performance, uptime, and scalability
- Participate in incident response, root cause analysis (RCA), and post-incident reviews
- Implement Infrastructure as Code (IaC) and automation best practices
- Collaborate with development teams to improve system architecture and deployment strategies
- Enforce security, compliance, and operational best practices in cloud environments
- Continuously improve system efficiency through automation, tooling, and process optimization
Required
Skills & Qualifications
- Strong hands-on experience with AWS services (EC2, S3, IAM, VPC, RDS, EKS, etc.)
- Solid experience with Kubernetes (EKS) and Docker
- Proficiency in Python scripting for automation and monitoring
- Experience designing and managing CI/CD pipelines using Jenkins
- Strong understanding of Dev Ops principles and CI/CD best practices
- Hands-on experience with Grafana and Prometheus for monitoring and alerting
- Strong knowledge of Linux systems and networking fundamentals
- Experience with Git or other version control systems
- Understanding of microservices architecture
Good to Have
- Experience with Terraform or Cloud Formation
- Knowledge of Helm, ArgoCD, or similar deployment tools
- Familiarity with log management tools (ELK / EFK stack)
- Understanding of SRE practices such as SLIs, SLOs, SLAs, and error budgets
- AWS and/or Kubernetes certifications (CKA / CKAD)
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×