Site Reliability Engineer
Listed on 2026-01-12
-
IT/Tech
Systems Engineer, Cloud Computing
Position Summary
The (USA)
Staff, Site Reliability Engineer will design and implement scalable, secure, and resilient infrastructure solutions to support critical systems. This role involves developing automation, optimizing performance, and ensuring high availability through rigorous reliability testing and monitoring. The engineer will collaborate across teams to translate business requirements into detailed designs, drive continuous improvement, and lead root cause analysis efforts. A strong focus on disaster recovery, system security, and operational excellence is essential to maintain robust and efficient environments aligned with organizational goals.
the team
Focusing on customer, associate and business needs, this team works with Walmart International, which includes more than 5,200 retail units, operating in 23 countries such as Canada, Central America, Chile, Mexico and South Africa to name a few. The Site Reliability Engineering team designs and develops scalable applications to enhance system reliability and performance. Utilizing technologies such as Node.js, Java, Python, and cloud platforms like Google Cloud and Microsoft Azure, the team automates operational tasks through CI/CD pipelines.
They work with databases and messaging systems including MySQL, Redis, and Kafka, and deploy AI/ML models to address complex challenges. Collaborating across functions, the team ensures high‑quality, responsive solutions while focusing on software architecture, automation, and root cause analysis to maintain robust and efficient infrastructure.
- Design and develop scalable, modular infrastructure solutions aligned with business and technical requirements.
- Implement automation scripts to enhance system operability and streamline deployment pipelines.
- Optimize system performance through tuning and reliability testing using open-source chaos engineering tools.
- Conduct root cause analysis and corrective actions to resolve performance and availability issues.
- Collaborate on disaster recovery planning and ensure compliance with security standards and frameworks.
- Monitor system health using key performance indicators and refine alerting logic to maintain service reliability.
- Train team members on reliability tools and best practices to support continuous improvement.
- Proven expertise in software architecture, distributed systems, and scalability design patterns.
- Strong knowledge of infrastructure automation, coding standards, and scripting for CI/CD pipelines.
- Experience with performance tuning and optimization on Unix/Linux platforms and JavaScript/Node.js environments.
- Hands‑on experience with AI/ML technologies, frameworks, and libraries such as Tensorflow, PyTorch, etc.
- Proficiency in reliability engineering, including root cause analysis and corrective action implementation.
- Familiarity with disaster recovery planning and execution within enterprise environments.
- Ability to design and implement monitoring, alerting, and telemetry solutions aligned with business goals.
- Skilled in Kubernetes and open-source chaos engineering tools to validate system resiliency.
Imagine working in an environment where one line of code can make life easier for hundreds of millions of people. That’s what we do at Walmart Global Tech. We’re a team of software engineers, data scientists, cybersecurity experts and service professionals within the world’s leading retailers who make an epic impact and are at the forefront of the next retail disruption.
People are why we innovate, and people power our innovations. We are people‑led and tech‑empowered. We train our team in the skillsets of the future and bring in experts like you to help us grow. We have roles for those chasing their first opportunity as well as those looking for the opportunity that will define their career. Here, you can kickstart a great career in tech, gain new skills and experience for virtually every industry, or leverage your expertise to innovate on a scale that impacts millions and reimagine the future of retail.
Beyond our great compensation package, you can receive incentive awards for your…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).