Principal Cloud Native Platform Engineer
Listed on 2026-02-16
-
Engineering
Systems Engineer
About Nscale
UK
Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.
We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.
AboutThe Role
The Principal Cloud Native Platform Engineer is a senior technical leader responsible for the long-term integrity, coherence, and evolution of Nscale’s cloud-native platform. This role extends beyond individual systems, focusing on architecture, standards, and engineering excellence as the organisation and platform scale.
The role combines deep hands-on engineering with strong architectural stewardship. You will act as a technical escalation point, a mentor to senior engineers, and a trusted advisor to engineering leadership, helping shape the direction of the platform and the practices used to build it.
This role requires Principal engineers to be able to accelerate the delivery of Nscale’s platform service offerings, marrying innovation with efficiency via experience in technical direction. Working closely with the Director of Cloud Native Platform Engineering
What You'll be Doing (Responsibilities)- Own and evolve the core platform architecture across multiple subsystems
- Design and review complex, multi-controller Kubernetes-native systems
- Maintain a strong bias toward simplicity, explicitness, and long-term maintainability
- Act as a technical escalation point for the most complex platform problems
Standardisation & Technical Governance
- Define and maintain platform-wide engineering standards
, including: - Controller and operator design patterns
- API and CRD design guidelines
- Versioning, compatibility, and deprecation strategies
- Ensure consistency across teams in:
- Reconciliation behavior
- Error handling and retry semantics
- Review and influence designs to prevent:
- Unnecessary divergence
- Overlapping abstractions
- Establish reference implementations and shared libraries where appropriate
Mentoring & Capability Building
- Actively mentor Senior and mid-level engineers in:
- Kubernetes internals and control plane design
- Distributed systems thinking
- Production readiness and failure analysis
- Raise the overall technical bar through:
- Design reviews
- Code reviews focused on correctness and clarity
- Knowledge sharing and documentation
- Identify skill gaps within the team and contribute to closing them through guidance and example
- Serve as a trusted technical advisor to engineering leadership
Cross-Team Influence
- Align platform engineering decisions with:
- SRE operational requirements
- Infrastructure and hardware roadmaps
- Product and customer needs
- Communicate architectural intent clearly through:
- Reviews and technical discussions
- Ensure that platform changes are understandable, supportable, and well-documented
- Demonstrated experience designing and building Kubernetes-native systems, including custom controllers, operators, CRDs, and reconciliation logic that runs reliably in production.
- Proven ability to design coherent, multi-component platform architectures that evolve over time without accumulating excessive complexity or technical debt.
- Production-Grade Software Engineering in Go
- Strong track record of writing maintainable, testable, and resilient Go code for long-lived distributed systems.
- Experience designing Kubernetes APIs and internal abstractions that are explicit, stable, and aligned with real operational constraints.
- Deep understanding of failure modes in Kubernetes and distributed systems, and the ability to design for graceful degradation, recovery, and operability.
- Experienc…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: