Senior DevOps Engineer
Listed on 2026-03-02
-
IT/Tech
Systems Engineer, Cloud Computing, Cybersecurity
At Forrester, we're trusted to work on trailblazing, mission critical problems that business and technology leaders face today. That's why we're always looking to empower talented individuals to perform at their best every single day. We're proud of our community of smart people and vibrant voices who come together to do what's right by our clients and each other. Our success is driven by curiosity, courage and customer obsession.
The confidence and drive to be bold n us and build an extraordinary future.
The Senior Dev Ops Systems Engineer will play a pivotal role in designing, securing, scaling, and operating modern cloud-native platforms with a strong emphasis on agentic AI systems, Kubernetes, Karpenter, Ray Serve, Terraform-based infrastructure-as-code, and Amazon Web Services (AWS)-centric architectures. This role demands a hands-on technical leader who thrives in a highly collaborative environment, takes ownership of complex problems, and drives them to resolution with minimal oversight.
You will partner across engineering, SRE, security, product, data, and AI-focused teams to ensure system resiliency, observability, and strong security practices at every layer. This role expects a strong troubleshooting instinct, the ability to navigate a broad observability stack, and an obsession with identifying root cause rather than symptoms.
Job Description:- Design, build, maintain, and automate infrastructure supporting various platforms and technologies across the organization.
- Implement and enforce security best practices across cloud, network, and application layers; security must be foundational, not an afterthought.
- Ensure maximum availability and reliability of our mission‑critical platforms, complying with our SLAs.
- Drive root cause analysis using logs, traces, metrics, and dashboards across multiple observability platforms.
- Troubleshoot complex production issues across the stack (infrastructure, network, and application), ensuring minimal downtime and rapid recovery.
- Collaborate closely with engineering, SRE, QA, security, data/AI, and product teams.
- Participate in the disaster recovery/business continuity (DRBC) routine exercises.
- Participate in an oncall rotation, improving incident response, runbooks, and documentation.
- Lead initiatives with minimal oversight, clearly communicating progress, risks, and outcomes to technical and nontechnical stakeholders.
- Master's degree in technology related, engineering, or computer science (a plus).
- Relevant work experience (eight-plus years) in software development or systems engineering.
- Deep experience with AWS (EC2, EKS, IAM, VPC, networking, load balancers, S3, Lambda, RDS, MSK, Secrets Manager, etc.).
- Experience in supporting AI/ML or agentic AI systems, especially in production environments.
- Extensive experience with continuous integration/continuous delivery tools (CI/CD) - Argo CD, Jenkins, etc.
- Experience in working collaboratively with various applications development teams throughout the organization to resolve problems.
- Strong Kubernetes proficiency: cluster operations, Karpenter, Helm, networking, and cluster security.
- Expertise with Terraform: maintaining/developing modules from scratch.
- Strong troubleshooting capabilities across distributed systems, with the ability to interpret logs, metrics, and traces to rapidly identify root cause.
- Familiarity with observability stacks (e.g., Prometheus/Grafana, Cloud Watch, Open Telemetry, Dynatrace, etc.).
- Solid understanding of security best practices (network segmentation, IAM least privilege, secrets management, pipeline integrity, and patching).
- Excellent written and oral communication skills necessary to produce and process technical documentation.
- Demonstrated ability to independently lead initiatives, drive tasks to completion, and manage priorities in a fast‑paced environment.
- Professional IT certifications, such as CKA, CKS, and AWS certifications (a plus).
- The ability to participate in an on‑call rotation.
- Provide mission‑critical production support in case of an outage during off business hours if necessary.
Please note that the base salary range indicated here is inclusive…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).