Field Reliability Engineer Job Sunnyvale area,California USA,Engineering

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.

Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.

About

The Role

Quality, reliability, and uptime are foundational to scaling Cerebras systems and impact. We are looking for engineers passionate about diagnosing complex field failures, extracting insights from large-scale telemetry and service datasets, and partnering across hardware, software, operations, and supply chain teams to improve reliability at fleet scale. This role blends deep engineering domain knowledge with data analytics and reliability statistics to drive continuous improvement across Cerebras’ growing deployed base.

Responsibilities

Use reliability statistics (e.g., Weibull and other parametric/non-parametric survival models) to identify and address trends, risks, and fleet-level performance of Cerebras’ datacenter compute hardware
Lead physics-of-failure–based root-cause investigations using telemetry, log data, stress/usage analysis, and engineering intuition.
Build and maintain statistical and large-scale data analyses (e.g., event logs, thermal/power telemetry, workload patterns).
Develop reliability forecasts to inform design decisions, manufacturing quality, capacity planning, service readiness, and supply chain strategy.
Build warranty cost and failure-forecast models by integrating failure rates, usage profiles, reliability statistics, and component risk factors.
Analyze real-world stress, workload, thermal, and environmental conditions to refine design requirements, qualification plans, and reliability tests.
Partner cross-functionally to prioritize issues, align mitigations, drive corrective actions, and turn learnings into design/process guidelines to prevent issue recurrence.

Skills & Qualifications

Required

Bachelor’s degree in Electrical Engineering, Materials Science, Mechanical Engineering, or a related field.
5+ years of industry experience in reliability engineering, hardware quality, or field failure analysis.
Strong proficiency in applied statistics and reliability methods (e.g., Weibull/survival analysis modeling, accelerated aging models).
Experience applying Weibull analysis and fleet-scale failure modeling to drive reliability priorities and quantify risk.
Working knowledge of Python and SQL for data extraction, cleaning, time-series analysis, reliability modeling, and visualization.
Demonstrated ability to build structured problem-solving approaches and lead cross-functional teams through complex root-cause investigations.
Excellent communication skills, with the ability to distill complex data and engineering concepts into clear, concise insights for technical and executive audiences.

Preferred

Physics-of-failure knowledge related to datacenter compute: thermal cycling, solder/interconnect fatigue, power electronics degradation, connector reliability, and cooling system failure modes.
Familiarity with the design and manufacturing process for IC packaging, server hardware, and PCBA.
Understanding of datacenter operating conditions: airflow, thermal management, power quality, workload variation, and system-level interactions.
Experience analyzing large-scale system telemetry, preferably from instrumented hardware fleets.

The base salary range for this position is $150,000 to $250,000 annually. Actual compensation may include bonus and equity, and…


Increase/decrease your Search Radius (miles)



Job Posting Language