Principal GPU and Diagnostic Software Architect
Listed on 2026-03-01
-
IT/Tech
Systems Engineer, AI Engineer
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture.
We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond.
Together, we advance your career.
THE ROLE:AMD is seeking a Sr level Engineer to lead company‑level innovation in GPU microarchitecture performance measurement, parallel programming optimization, and advanced software diagnostics. This role centers on deep technical leadership in performance attribution, hardware/software observability, and defect localization across GPU compute stacks. In this role you will define and architect next‑generation methodologies for microarchitectural analysis, counter design, instrumentation, and diagnostic tooling that enable precise performance understanding from silicon through runtime and application layers.
This individual will work closely with GPU architecture, silicon design, firmware, drivers, compiler/runtime, and tools teams to ensure AMD platforms deliver measurable, explainable, and reproducible performance across generations.
THE PERSON:Are you a hands‑on architect in areas like GPU/accelerator or HPC performance engineering, microarchitecture analysis, compilers, runtime systems, and diagnostics?
Microarchitecture Performance Measurement & Attribution- Define AMD’s methodology for cycle‑accurate and counter‑driven performance attribution across GPU generations.
- Architect performance measurement frameworks that correlate workload behavior to microarchitectural structures (CUs/SIMDs, wavefront schedulers, issue pipelines, register files, memory hierarchy, cache systems, fabric/interconnect).
- Drive counter architecture definition and validation to ensure observability of pipeline stalls, cache contention, memory divergence, synchronization overhead, and scheduling inefficiencies.
- Establish rigorous approaches for bottleneck classification: compute‑bound, memory‑bound, latency‑bound, fabric‑bound, and occupancy‑limited regimes.
- Develop scalable performance modeling techniques linking pre‑silicon simulation, emulation, and post‑silicon telemetry.
- Architect end‑to‑end performance workflows: microbenchmarks, workload decomposition, instrumentation, trace capture, and guided optimization.
- Lead development of profiling and visualization systems exposing pipeline stages, wave occupancy, cache behavior, memory bandwidth utilization, atomic/synchronization costs, and interconnect utilization.
- Influence compiler and runtime optimizations including code generation, scheduling, register allocation, vectorization, tiling, kernel fusion, and launch configuration strategies.
- Drive auto‑tuning and kernel optimization frameworks for AI/HPC workloads (GEMM, convolution, attention, graph workloads) across GPU generations and heterogeneous system configurations.
- Ensure strong correlation between synthetic benchmarks, application kernels, and real‑world workloads.
- Architect diagnostic frameworks capable of detecting, isolating, and reproducing defects across silicon, firmware, driver, runtime, and application layers.
- Develop static and dynamic analysis tools tailored to GPU execution and memory consistency models.
- Lead development of GPU‑focused sanitizers, race detectors, memory checkers, hang analysis tools, and fuzzing frameworks.
- Build automated triage systems integrating telemetry, crash signatures, counter anomalies, and workload traces to accelerate root cause identification.
- Drive methodologies for deterministic repro, workload minimization, and differential testing across hardware stepping and driver/compiler…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).