Cloud Engineer
Listed on 2026-01-27
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
Overview
Altana is the network for trusted trade. Our AI-powered product network empowers governments and businesses to build a more resilient and secure global economy while keeping trade flowing.
The OpportunityThe Cloud Engineering team is looking for an experienced Staff Cloud Engineer to help build out our vision. You ll work closely with our Developers, Data Scientists, and Customers on projects to analyze and observe world-scale datasets, build systems that can scale to produce never-before-seen insights, and construct infrastructure and applications that help deliver our product vision.
In this role, you will be instrumental in ensuring the availability, performance, and scalability of Altana s critical production services across our cloud-native environments and data pipelines. You will drive reliability into our architecture and operations through automation, proactive monitoring, and comprehensive observability. Success will be measured by the resilience of our production systems, the effectiveness of our observability stack, and our continuous improvement in operational efficiency.
YourResponsibilities
- Observability & Monitoring:
Design, implement, and maintain comprehensive observability solutions across the platform stack, including metrics, logging, tracing, and alerting using modern tools (Prometheus, Grafana, Datadog, Open Telemetry). Develop dashboards and runbooks that provide deep insights into system health and behavior. - Internal Developer Platforms:
Build and maintain internal developer platforms using infrastructure as code (Terraform) to enable self-service provisioning across multi-cloud environments (AWS, Azure). - Automation & CI/CD:
Design and implement automation pipelines for infrastructure provisioning, application deployments, and operational tasks using Git Lab CI/CD, Git Hub Actions, or similar tools. - Kubernetes & Container Platforms:
Develop and maintain Kubernetes platforms including writing Helm charts, managing cluster operations, implementing pod security policies, and optimizing resource utilization. - Reliability Engineering:
Champion SRE principles including establishing and monitoring Service Level Objectives (SLOs) and error budgets for critical services. Drive initiatives to improve system reliability, availability, performance, and efficiency. - Platform Abstractions:
Create platform abstractions and tooling that enable development teams to deploy and operate services independently while maintaining security and compliance standards. - Security & Compliance:
Build and maintain secure container images and deployment pipelines with automated security scanning, vulnerability management, and compliance checks. Support deployments in highly regulated customer environments. - Incident Management:
Participate in incident response lifecycle including detection, triage, mitigation, and resolution. Lead blameless postmortems to identify root causes and implement preventative measures. - Toil Reduction:
Automate operational tasks to reduce toil and improve system reliability through scripting, tooling development, and process improvement. - Collaboration & Mentorship:
Collaborate with engineering teams to understand their needs and translate them into platform capabilities. Mentor team members on cloud best practices, platform patterns, and automation techniques. - On-Call Rotation:
Participate in a periodic on-call rotation, responding to critical alerts and ensuring rapid resolution of production incidents.
- 5+ years of experience building developer platforms, infrastructure automation, or cloud infrastructure in a production environment.
- Expertise in designing, implementing, and managing observability platforms for cloud-native environments (e.g., Prometheus, Grafana, Datadog, ELK stack, Open Telemetry, Jaeger).
- Strong understanding and practical application of SRE principles, including SLOs, error budgets, toil reduction, and blameless culture.
- Production experience building and operating environments in AWS and/or Azure.
- Strong Infrastructure as Code skills with Terraform, Open Tofu, or similar tools.
- Hands-on Kubernetes experience including cluster management, application deployments, and operational maintenance.
- Proficiency in at least one programming/scripting language (e.g., Python, Go) for automation and tool development.
- Proven experience participating in and improving incident management processes for critical systems.
- Knowledge of modern software delivery paradigms, including microservices architectures and CI/CD pipelines.
- Excellent problem-solving, analytical, and troubleshooting skills in complex distributed systems.
- Strong written and verbal communication skills, comfortable working with technical teams to understand requirements and design solutions.
- Track record of delivering platform capabilities that improved team productivity or system reliability.
- Care deeply about developer experience, automation, security, and operational excellence.
- Experience at a startup or…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).