Site Reliability Engineer II
Listed on 2026-03-01
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
Job Posting
Title:
Site Reliability Engineer (Irving, TX)
About Gartner:
Join a world-class team of skilled engineers who build creative digital solutions to support our colleagues and clients. We make a broad organizational impact by delivering cutting-edge technology solutions that power Gartner. Gartner IT values its culture of nonstop innovation, an outcome-driven approach to success, and the notion that great ideas can come from anyone on the team.
What we're looking for:
Gartner is looking for a Site Reliability Engineer to join our collaborative, Agile team. This position will improve Gartner's customer experience and increase the value of our products by increasing the reliability and performance of our client-facing application and service offerings.
Why you'll want to come to work:
Measure performance against SLOs in partnership with stakeholders, and ensure systems continue to meet SLOs over time.
Work to improve performance, scalability, and stability of applications.
Participate in operational support and on-call rotation shifts for supported systems and products.
Respond to incidents in production and help triage the application/system issues and identify root causes or remediations to help restore services quickly
Conduct blameless post mortems to troubleshoot priority incidents.
Use automation to reduce the probability and/or impact of problem recurrence.
Identify and evaluate alerting posture
Create dashboards and reports to communicate key metrics.
Implement, and manage Dev Ops capabilities using continuous integration/continuous delivery toolsets and automation
Collaborate and share lessons learned regarding performance and reliability issues with all stakeholders including developers, other SREs, operations teams,
and project management teams.
Participate in continuous improvement in software quality and infrastructure reliability and resilience.
Build and maintain documentation for all assigned projects.
Build and maintain performance testing frameworks, tools, and methodologies
Automate manual operational work (i.e., "toil") using pipelines or by using new software or any other appropriate mechanisms
Conduct analytics on previous incidents to understand root causes and better predict and prevent future issues. Keep a proactive approach to spotting
problems, areas for improvement, and performance bottlenecks.
Participate with stakeholders such as Dev teams or product owners to define service level objectives (SLOs) for application & system operations.
Collaborate with development teams to promote the concept of reliability engineering during all phases of the SDLC to detect and correct performance issuesand meet availability goals.
What you'll bring to the team:
Must have:
5+ years of information technology experience with 3+ years working on Dev Ops/SRE team or similar
Experience with incident and response management.
Experience with AWS cloud, specifically services such as EC2, EKS, API GW, Lambda, etc. or similar cloud technologies & services
Experience with back-end technologies such as J2EE, JDBC, Tomcat, .NET Core/ C#, Spring, Hibernate, etc.
Experience with building tools to automate production support activities that enable efficiency and productivity of Support teams
Prior experience in working as a Cloud Dev Ops Engineer, Build & Release Engineer, System Administrator is preferred.
Prior experience in Integrated Docker container orchestration framework using Kubernetes by creating pods, config Maps, deployments using Jenkins
Working knowledge of client-side technologies such as NodeJS/ JavaScript / React JQuery
Experience with troubleshooting, root-cause analysis, application design, and implementing components .
Working experience with monitoring tools like Splunk and APM tools such as Dynatrace, Data Dog, New Relic, App Dynamics, etc.
Working knowledge of production support processes such as incident/change/problem management, call triaging and escalation procedures.
Exposure on Akamai/Cloudflare/Cloudfront as CDN
Strong Operating Systems (UNIX/Linux) background.
Preferred:
Exposure to Performance Engineering conceptsDesired:
Exposure to chaos testing or chaos engineeringExperience in collaborating with Dev/DBA/Architecture teams or other relevant teams and performing root cause analysis with good working knowledge of
application, processes, operating system
Advanced analytical, problem-solving skills, oral and written communication skills
Highly adaptable to changing circumstances. Interested and capable in continuously learning new skills and technologies.
Don't meet every single requirement? We encourage you to apply anyway. You might just be the right candidate for this, or other roles.
What you'll receive:
Competitive compensation.
Limitless growth and learning opportunities.
A collaborative and positive culture - join a diverse team of professionals that are as smart and driven as you.
A chance to make an impact - your work will contribute directly to our strategy.
Hybrid Work Environment - enjoy the flexibility of…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).