×
Register Here to Apply for Jobs or Post Jobs. X

Distinguished, Architect - AI​/ML

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: Walmart
Full Time position
Listed on 2026-01-06
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Job Description & How to Apply Below

Position Summary...

Building the right technology foundation for Infrastructure & platforms is vital to success at the scale of Walmart. Our team builds and maintains the foundational technologies that support the tech organization. Included in this are data platforms, enterprise architecture, Dev Ops, cloud computing, and infrastructure. All of these products and services are supported by scalable and powerful infrastructure, ensuring a secure and seamless employee and customer experience across stores, digital channels, and distribution centers.

What

you'll do...

Join Walmart Global Tech's Site Reliability Engineering organization as a Distinguished AI/ML Engineer to architect revolutionary agentic AI systems that autonomously monitor, predict, and resolve issues across the world's largest retailer's technology ecosystem, impacting millions of customers and associates globally. You'll lead the transformation of traditional SRE practices into cutting‑edge, self‑healing platforms that serve as the intelligent backbone for reliability engineering across all of Walmart’s systems, from e‑commerce to stores to supply chain.

You’ll be responsible for designing and building Tier 0 high‑availability, resilient agentic platforms that serve as the backbone for reliability engineering across all of Walmart’s systems, stores and facilities across US and international markets while defining and implementing unified, intelligent, operationally robust technical solutions and tools for all Walmart Technology organizations across all channels and geographies.

What you’ll do:

AI/ML & Agentic Systems Technical Leadership:
  • Architect and develop advanced agentic AI systems that can autonomously handle complex reliability engineering workflows, predictive failure analysis, and self‑optimization across all Walmart technology systems.
  • Design and implement multi‑agent orchestration platforms that coordinate between different AI agents for automated incident response, capacity planning, and performance optimization across e‑commerce, supply chain, and in‑store systems.
  • Build intelligent observability and monitoring systems using ML‑driven anomaly detection, predictive analytics, and autonomous incident resolution capabilities that span all of Walmart’s technology ecosystem.
  • Develop self‑healing infrastructure platforms that leverage AI to predict, prevent, and automatically resolve system issues before they impact customers, associates, or business operations across any Walmart system.
Site Reliability Engineering Technical Excellence:
  • Design, write and build advanced tools to improve reliability, latency, availability, and scalability of all Walmart Tech systems including:
    1) Engineer reliability and availability starting with metrics and measurements across all domains,
    2) Enable scaling by providing technical solutions, developing automation and/or optimizing processes for all engineering teams,
    3) Build tools/automate to prevent re‑occurrence of problems across all mission critical Walmart services,
    4) Augment existing instrumentation to build a cohesive picture of system characteristics across the entire Walmart technology landscape with special attention to points of failure.
  • Architect and implement fault‑tolerant systems and services across Walmart’s hybrid cloud infrastructure with focus on autonomous recovery and intelligent failure prediction for e‑commerce, supply chain, financial services, and in‑store technology.
  • Collaborate with engineering teams and leadership across all Walmart technology organizations to establish technical strategies and solutions to improve mean time to detect (MTTD) and mean time to restore (MTTR) through intelligent automation and predictive capabilities.
  • Work with service owners across all domains (e‑commerce, supply chain, stores, fintech, etc.) to define SLOs and build SLIs to ensure all critical systems are meeting SLAs while maintaining optimal performance and user experience.
  • Perform complex troubleshooting and analysis of large‑scale distributed systems across Walmart’s entire technology stack, using expertise in coding, algorithms, and distributed system design.
Strategic Technical

Inn…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary