More jobs:
Lead Software Architect – AI Infrastructure & Cluster
Job in
Singapore, Singapore
Listed on 2026-03-13
Listing for:
CREW by HRNET
Full Time
position Listed on 2026-03-13
Job specializations:
-
IT/Tech
Systems Engineer, Cybersecurity, IT Support
Job Description & How to Apply Below
Location: Singapore (travel-ready for regional AI cluster deployments)
Competitive
Salary: From SGD $12,000
Work Hours: 9 AM – 6 PM, Monday – Friday
Location: Ubi Road (East)
Key Responsibilities Leadership Responsibilities- Lead multi-disciplinary engineering teams in AI cluster performance and deployment projects.
- Define technical roadmaps, standards, and best practices for large-scale AI infrastructure.
- Mentor and upskill engineers in high-performance AI frameworks, cluster optimization, and security hardening.
- Manage stakeholder communications, performance reporting, and deployment planning.
- Drive decision-making for hardware-software trade-offs, contingencies, and multi-region deployments.
- Foster a culture of reliability, proactive monitoring, and continuous performance improvement.
- Collaborate with cross-functional teams to enforce governance, Zero-Trust access, and infrastructure hardening.
Cluster & Hardware Optimization:
- Conduct cluster-level audits for software consistency across 3,456 GPUs and 48 racks.
- Fine-tune BIOS, firmware, kernel, and network parameters (NVLink, Infini Band, PCIe Gen5/6) for maximum throughput.
- Validate collective communications using NVIDIA NCCL and SHARP for zero-bottleneck AI training.
AI Frameworks & Orchestration:
- Deploy, integrate, and optimize PyTorch, JAX, Slurm, Kubernetes with GPU-direct storage.
- Implement real-time telemetry for GPU/NPU health, power, and thermal metrics.
Security & Compliance:
- Implement Hardware Root-of-Trust, Secure Boot, and Zero-Trust IAM policies.
- Enforce inline encryption (AES-GCM 256) for AI fabrics and secure sensitive training data.
- Conduct vulnerability scanning, penetration testing, and ensure compliance with ISO
27001, SOC2, NIST AI RMF. - Implement secrets management for API keys, SSH keys, and SSL certificates.
Networking & Performance Tuning:
- Optimize ultra-low latency networks (400G/800G fabrics) and eliminate congestion using Sharpv4, RoCEv
2. - Audit network paths, validate topology alignment with SDN, and monitor fabric performance proactively.
- Configure and validate GPUDirect RDMA/Storage for direct GPU-to-storage data movement.
Send
Full Name, Contact Number & Resume
to:
📱
📲 Email: athirah.rosli
Please include your availability, notice period and expected salary in your application.
* Only shortlisted candidates will be contacted.
CREW by HRNet | HRnet Ventures Pte Ltd (24C2435)
Athirah Bte Rosli (R2197227)
#J-18808-LjbffrTo View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×