Mechanical Engineer
Listed on 2026-02-28
-
Engineering
Systems Engineer, Electrical Engineering, Automation Engineering
Description
LOCATION & ON-SITE EXPECTATIONS:
This role requires regular onsite presence at the designated data center (one role will be in MWE, on in PHX). Onsite work is critical for:
- Understanding mechanical and liquid cooling system behavior
- Building trust with mechanical, electrical, and site services teams
- Validating telemetry, dashboards, and fault detection logic directly within the environment
Some travel may be required for cross‑site alignment and system comparison. All travel outside of a 75-mile round‑trip commute is eligible for reimbursement.
Team OverviewThe Signals Quality team within Microsoft CO+I IDEA group's Signal Quality team specializes in designing, customizing, and deploying telemetry‑driven detections across global datacenter environments. You will join a cross‑functional engineering + data/telemetry project composed of your PM, a data scientist, and two developers. The team's mission is to ensure high‑density compute racks operate with minimal downtime and zero customer impact by improving monitoring, diagnostics, and liquid‑cooling system performance.
ProjectOverview
High‑density compute racks rely on advanced liquid and hybrid cooling systems (including CDUs, HRUs, and rack‑level cooling loops). The program's goal is to strengthen:
- Telemetry completeness
- Monitoring and alerting capabilities
- Diagnostic logic
- System reliability and operational efficiency
Telemetry from Atlanta (a more mature site) will serve as an early reference model for what the MWE/PHX environment will eventually receive as data onboarding progresses.
Role ContributionThe Mechanical Engineer will serve as the diagnostics and monitoring lead for liquid cooling. Your work directly supports the project by:
- Ensuring relevant telemetry is available, complete, and accurate
- Rapidly detecting cooling system degradations
- Defining requirements for dashboards, alerts, and monitoring tools
- Ground‑truthing data signals through onsite inspection
- Helping developers and PMs align monitoring logic with real equipment behavior
- Bridging operational engineering teams with the data/analytics team
- Lead detection, monitoring, alarm triage, and fault isolation for CDUs, HRUs, and rack‑level liquid‑cooling systems.
- Respond to alarms and perform rapid root‑cause analysis using telemetry, system data, logs, and trend patterns.
- Conduct onsite validation of telemetry, dashboard logic, and system behavior by spending time on the datacenter floor.
- Identify and close telemetry gaps, including missing, inaccurate, or incomplete signals critical to system health and performance.
- Analyze system performance trends to detect early signs of degradation and recommend corrective actions.
- Build and maintain strong working relationships with mechanical, electrical, and site services engineers to ensure seamless collaboration without disrupting operations.
- Define requirements for dashboards, alerting logic, KPIs, and visualization tools, ensuring they support fast and accurate diagnostics.
- Collaborate closely with PMs and developers to translate real‑world system behavior into monitoring logic and tooling features.
- Provide weekly progress updates, documenting findings, issues, risks, and opportunities for improvement.
- Support continuous enhancements to reliability, thermal performance, redundancy, and energy efficiency across liquid‑cooling systems.
- Hands‑on experience with CDUs, HRUs, direct‑to‑chip cooling, racks, and cooling loops.
- Ability to diagnose system behavior, understand flow, temperature, pressure dynamics, and respond to system faults.
- Ability to perform root‑cause analysis based on telemetry streams, sensors, alarms, and system logs.
- Skilled at identifying degraded performance early through trend analysis and correlating telemetry with physical system behavior.
- 8+ years in data centers, HPC facilities, or mechanically intensive industrial environments.
- Strong understanding of mechanical, thermal, and facilities engineering fundamentals in…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).