Senior IT Platform Engineer, Compute & Resilience
Listed on 2026-03-03
-
IT/Tech
Systems Engineer, Cloud Computing
Job Category: Exempt
Requisition Number: SENIO
003356
Apply now
- Posted :
February 24, 2026 - Full-Time
- Hybrid
Petaluma
109 Kentucky Street
Petaluma, CA 94952, USA
The Senior Platform Engineer designs, builds, modernizes, and operates the enterprise compute, virtualization, storage, and backup platform across plants, datacenter, offices, cloud environments, and remote users. This role owns the compute and resilience platform end to end, including architecture, automation, capacity management, disaster re covery , and operational performance.
The position emphasizes Infrastructure as Code, automation first practices , AI-enabled operations, disaster recoveryreadines s and reduction of technical debt to deliver resilient, scalable, and secure compute services aligned to enterprise strategy.
The full salary range for this position is $111,200 – $166,800
. However, our current budget for a new hire is $111,200 – $150,000
, depending on the candidate's specific experience and skills.
ESSENTIAL DUTIES AND RESPONSIBILITIES may include the following. Other duties may be assigned.
PLATFORM OWNERSHIP- Own the enterprise compute, virtualization , storage, and backup platforms across plants, warehouses, offices, cloud, and remote environments
- Design for high availability, fault tolerance, scalability, and rapid recovery
- Ensure platform reliability supports manufacturing uptime, enterprise operations, and business continuity
- Serve as technical authority for compute architecture, virtualization standards, storage design, and resilience strateg y
- Drive modernization, standardization, and lifecycle management of servers, hypervisors, storage arrays, and backup platforms
- Reduce technical debt and eliminate configuration drift
- Act as a technical mentor and escalation point within the platform domain
- Design and implement resilient, secure, and scalable compute, virtualization, and storage architectures
- Define and maintain standards, reference designs, and best practices for server builds, cluster design, hypervisor configuration, and storage layout
- Lead platform upgrades, hypervisor migrations, storage refreshes, and modernization initiatives
- Ensure integration with adjacent platforms such as network, security, cloud, identity, data, and applications
- Support hybrid environments spanning on-premises infrastructure, cloud compute platforms (Azure AWS), and SaaS workloads
- Design andmaintainhigh-availability clusters and disaster recovery configurations
- Define compute and infrastructure configurations using code, templates, or structured configuration management tools
- Establish version-controlled configurations as the system of record or server builds, hypervisor configurations, and storage policiesli>
- Enable repeatable, low-risk changes through standardized deployment models
- Reduce manual changes and operational inconsistencies
- Contribute to CI/CD practices for infrastructure or platform changes
- Maintain version-controlled repositories as the authoritative source of platform configuration
- Automate server provisioning, patching, lifecycle management, validation, recovery, and compliance validation
- Reduce manual operational effort through scripting and workflow automation
- Partner with MSPs to ensure consistent execution of b ackup , recovery, and infrastructure runbooks
- Improve monitoring signal quality a cross compute, storage, and virtualization layers
- Design self-healing or auto-remediation capabilities where appropriate
- Continuouslyoptimizeresourceutilization, performance, and capacity planning
- Ensure compute platform resilience, redundancy, backup, and disaster recovery alignment
Own backup, recovery, and disaster recovery design and testing processes
Maintain documented recovery procedures and conduct periodic DR exercises
- Partner with Security teams to maintain compliance, segmentation, access controls, and monitoring standards
- Support enterprise risk management initiatives related to infrastructure stability , ransomware protection, and business continuity
- Leverage AI-driven monitoring and analytics to detect…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).