×
Register Here to Apply for Jobs or Post Jobs. X

Field Reliability Engineer

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: Cerebras Systems Inc.
Full Time position
Listed on 2026-01-12
Job specializations:
  • Engineering
    Systems Engineer, Electrical Engineering
Salary/Wage Range or Industry Benchmark: 150000 - 250000 USD Yearly USD 150000.00 250000.00 YEAR
Job Description & How to Apply Below

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.

Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.

About

The Role

Quality, reliability, and uptime are foundational to scaling Cerebras systems and impact. We are looking for engineers passionate about diagnosing complex field failures, extracting insights from large-scale telemetry and service datasets, and partnering across hardware, software, operations, and supply chain teams to improve reliability at fleet scale. This role blends deep engineering domain knowledge with data analytics and reliability statistics to drive continuous improvement across Cerebras’ growing deployed base.

Responsibilities
  • Use reliability statistics (e.g., Weibull and other parametric/non-parametric survival models) to identify and address trends, risks, and fleet-level performance of Cerebras’ datacenter compute hardware
  • Lead physics-of-failure–based root-cause investigations using telemetry, log data, stress/usage analysis, and engineering intuition.
  • Build and maintain statistical and large-scale data analyses (e.g., event logs, thermal/power telemetry, workload patterns).
  • Develop reliability forecasts to inform design decisions, manufacturing quality, capacity planning, service readiness, and supply chain strategy.
  • Build warranty cost and failure-forecast models by integrating failure rates, usage profiles, reliability statistics, and component risk factors.
  • Analyze real-world stress, workload, thermal, and environmental conditions to refine design requirements, qualification plans, and reliability tests.
  • Partner cross-functionally to prioritize issues, align mitigations, drive corrective actions, and turn learnings into design/process guidelines to prevent issue recurrence.
Skills & Qualifications

Required

  • Bachelor’s degree in Electrical Engineering, Materials Science, Mechanical Engineering, or a related field.
  • 5+ years of industry experience in reliability engineering, hardware quality, or field failure analysis.
  • Strong proficiency in applied statistics and reliability methods (e.g., Weibull/survival analysis modeling, accelerated aging models).
  • Experience applying Weibull analysis and fleet-scale failure modeling to drive reliability priorities and quantify risk.
  • Working knowledge of Python and SQL for data extraction, cleaning, time-series analysis, reliability modeling, and visualization.
  • Demonstrated ability to build structured problem-solving approaches and lead cross-functional teams through complex root-cause investigations.
  • Excellent communication skills, with the ability to distill complex data and engineering concepts into clear, concise insights for technical and executive audiences.

Preferred

  • Physics-of-failure knowledge related to datacenter compute: thermal cycling, solder/interconnect fatigue, power electronics degradation, connector reliability, and cooling system failure modes.
  • Familiarity with the design and manufacturing process for IC packaging, server hardware, and PCBA.
  • Understanding of datacenter operating conditions: airflow, thermal management, power quality, workload variation, and system-level interactions.
  • Experience analyzing large-scale system telemetry, preferably from instrumented hardware fleets.

The base salary range for this position is $150,000 to $250,000 annually. Actual compensation may include bonus and equity, and…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary