Site Reliability Engineer Job New York New York USA,IT/Tech

Position: Staff Site Reliability Engineer
Location: New York

Get AI-powered advice on this job and more exclusive features.

Pay found in job post

Retrieved from the description.

Base pay range

$/yr - $/yr

AI can be a powerful tool for good in the world – at Altana we apply AI to the world’s largest organized body of supply chain data to power a more resilient, more secure, and more sustainable model of global commerce. Our customers connect to the Altana network to build resilience for critical industries and infrastructure, automate and safeguard cross-border trade, transform insurance underwriting, protect national security, combat modern slave labor, disrupt fentanyl trafficking, and ensure that their products are sustainable.

Altana is backed by leading investors and used by the world’s most important organizations, including Lloyd’s, Maersk, multiple government agencies across the US, UK, EU, Singapore, and Australia, General Atomics, Boston Scientific, and more. We are building a global platform connecting the public and private sectors into an AI-powered network for building trusted supply chains. We operate in accordance with our values: we focus on value creation, not capture;

we foster diversity and embrace difference; we embrace reality; we get things done; we amaze our clients. When you join Altana, you’ll be joining a vibrant, collaborative team working together to solve complex problems with the potential for global societal impact.

The Opportunity at Altana

At Altana, we believe that software that ships must be reliable and efficient. As a Staff Site Reliability Engineer, you will be instrumental in ensuring the availability, performance, and scalability of Altana’s critical production services, with a strong focus on our cloud-native environments and data pipelines. You will apply Google-style SRE principles, embedding reliability into our architecture and operations through automation, proactive monitoring, and a commitment to reducing toil.

You will work hands‑on with engineering teams, influencing system design for operability and contributing to the development of robust, self‑healing infrastructure. This role emphasizes a deep understanding of observability practices to gain comprehensive insights into system behavior, proactive incident prevention, and efficient incident response. Success will be measured by the resilience of our production systems, the effectiveness of our observability stack, and our continuous improvement in operational efficiency and reliability.

Your

Responsibilities

Reliability Engineering:
Champion and implement SRE principles, including establishing and monitoring Service Level Objectives (SLOs) and error budgets for critical services. Drive initiatives to improve system reliability, availability, performance, and efficiency.
Observability & Monitoring:
Design, implement, and maintain advanced monitoring, logging, and tracing solutions for our cloud‑native applications and infrastructure (e.g., Kubernetes, microservices). Develop dashboards, alerts, and runbooks that provide deep insights into system health and behavior.
Automation & Toil Reduction:
Identify and automate repetitive operational tasks and manual processes across our production environment. Develop tools and scripts to enhance system operations, deployment pipelines, and incident response.
Incident Management & Postmortems:
Actively participate in the incident response lifecycle, including detection, triage, mitigation, and resolution of production issues. Lead thorough blameless postmortems to identify root causes and implement preventative measures and lasting improvements.
System Design & Optimization:
Collaborate closely with development teams to influence the design of new services, ensuring they are built for operability, reliability, and cost‑efficiency. Proactively identify and address performance bottlenecks and architectural weaknesses.
On‑Call Rotation:
Participate in a periodic on‑call rotation, responding to critical alerts and ensuring rapid resolution of production incidents.
Data Reliability:
Implement and maintain reliability and observability for critical data pipelines and data infrastructure, ensuring data integrity, availability, and…


Increase/decrease your Search Radius (miles)



Job Posting Language