Senior DevOps Engineer
Listed on 2026-02-28
-
IT/Tech
Cloud Computing, SRE/Site Reliability, Systems Engineer, IT Support
We are seeking a highly experienced Senior Dev Ops Engineer (Production Support) with deep expertise in AWS, Kubernetes, CI/CD, and cloud-native platforms. This role will focus on operating, stabilizing, and continuously improving production environments, ensuring high availability, performance, and scalability of mission-critical applications.
The ideal candidate is a hands-on Dev Ops/SRE professional who thrives in fast-paced production environments and can automate, troubleshoot, and optimize distributed systems at scale.
You will work extensively with AWS, Kubernetes (Rancher), Jenkins, Git Hub, Terraform, Kafka, Harness, and Python while partnering with engineering, platform, and product teams.
Key Responsibilities Production Operations & Reliability- Provide L2/L3 production support for cloud-native applications running on AWS and Kubernetes.
- Own incident triage, root cause analysis (RCA), and resolution for high-severity production issues.
- Participate in on-call rotations and drive post-incident improvements.
- Improve system reliability, resilience, and observability using SRE best practices.
- Design and operate scalable AWS environments using:
- EC2, EKS, VPC, ALB/NLB
- S3, RDS, DynamoDB
- IAM, Cloud Watch, Event Bridge
- Optimize cloud cost, performance, and security posture.
- Manage and operate Kubernetes clusters (Rancher-managed or EKS).
- Troubleshoot:
- Pod failures
- Resource constraints
- Improve:
- Autoscaling strategies
- Deployment reliability
- Design and maintain CI/CD pipelines using:
- Jenkins
- Git Hub Actions
- Harness (preferred)
- Implement:
- Blue/green and canary deployments
- Git Ops workflows
- Automated rollbacks
- Build and maintain infrastructure using:
- IaC modules
- Platform templates
- Deployment accelerators
- Automate provisioning, scaling, and recovery workflows.
- Design and manage Kafka infrastructure including:
- Producers/consumers
- Ensure:
- High availability
- Throughput optimization
- Secure connectivity
- Integrate Kafka with AWS and Kubernetes ecosystems.
- Implement monitoring and alerting using:
- Cloud Watch / Splunk Observability
- Define:
- SLIs/SLOs
- Alerting thresholds
- Runbooks
- Proactively identify bottlenecks and prevent outages.
- Secrets management
- IAM least privilege
- Container scanning
- Supply chain security
- Ensure infrastructure adheres to security and compliance standards.
- Partner with development teams to:
- Reduce operational toil
- Increase automation coverage
- Drive:
- Developer experience improvements
- Operational excellence initiatives
- 4 - 10 years in Dev Ops / SRE / Production Support roles
- Strong experience managing production-grade cloud environments
- Proven track record handling live incident management
- Splunk
- Terraform
- Jenkins / Git Hub
- Kafka
- Python or Shell scripting
- Harness CI/CD
- Observability tools (New Relic, Datadog, Prometheus)
- Strong troubleshooting and debugging mindset
- Ability to work in high-pressure production environments
- Ownership-driven and automation-first approach
Overall Dev Ops, AWS, Kubernetes/Helm, Terraform/Ansible, Jenkins/Harness, Python/Groovy scripting, Linux, Splunk, Production Support
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).