Senior Site Reliability Engineer Job New York New York USA,IT/Tech

Location: New York

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.

Job Category

Software Engineering

Job Details About Salesforce

Salesforce is the #1 AI CRM, where humans with agents drive customer success together. Here, ambition meets action. Tech meets trust. And innovation isn’t a buzzword — it’s a way of life. The world of work as we know it is changing and we're looking for Trailblazers who are passionate about bettering business and the world through AI, driving innovation, and keeping Salesforce's core values at the heart of it all.

Ready to level-up your career at the company leading workforce transformation in the agentic era? You’re in the right place! Agentforce is the future of AI, and you are the future of Salesforce.

Job Title

Senior Site Reliability Engineer

About

The Role

The Site Reliability Engineering team is part of the Digital Enterprise Technology Platform Engineering organization, responsible for architecting, scaling, and maintaining the IT monitoring and observability ecosystem. You will ensure Enterprise IT services' reliability by driving proactive telemetry strategies and deep-system visibility.

We're looking for a self-starter with the ability to take ownership of tasks, work under pressure, and balance multiple assignments simultaneously while maintaining a positive outlook. You'll lead the evolution of observability frameworks, contribute ideas, and provide feedback on complex monitoring architectures while providing expertise for IT projects and enhancements across various IT organizations.

Responsibilities

Manage, assess, plan, and support core observability platform operations and strategy.
Lead process changes and implementations related to the monitoring and logging stack (e.g., Splunk, Grafana, New Relic).
Provide escalation support for configuration and platform issues, participating in on-call schedules to resolve major incidents using deep-dive observability data.
Collaborate with key stakeholders (Service Managers, Product Managers, Application Architects, Business Support, and Operations) to gather and develop complex monitoring and alerting requirements.
Develop AI, automation, and integrations to deliver predictive monitoring and automated anomaly detection.
Work with third-party vendors and partners to address platform-related enhancements and evaluate next-gen observability tooling.
Support and manage the introduction of new monitoring tools and orchestrate migrations to modern Open Telemetry-based standards.
Present reports on Service Level Indicators (SLIs), Service Level Objectives (SLOs), and correlation metrics to the Enterprise Operations team periodically.
Work under Agile scrum methodology and provide technical mentorship on observability best practices to junior team members.
Create standard operating procedures for monitoring-as-code and share them with the team for effective execution.

Minimum Qualifications

Bachelor's degree in Computer Science or related technical field, or equivalent experience in technical leadership
7 - 10 years of experience designing and implementing distributed systems to handle large-scale telemetry and log data
7 - 10 years of experience building and scaling high-volume observability pipelines.
Proven mastery of full-stack observability suites (Splunk, Thousand Eyes, or similar).
Direct experience implementing Open Telemetry (OTel) standards.
Strong background in "Monitoring as Code" using Terraform or similar automation tools.
Demonstrable ability in Bash/Powershell, Python, and JavaScript (NodeJS), especially program comprehension
Understanding of REST-based API design principles and best practices
Experience with server administration (Linux and Windows)
Knowledge of monitoring tools like Zabbix, Splunk, Grafana, New Relic, or Thousand Eyes
Experience with AWS public cloud and VMware vSphere
Knowledge of configuration management and orchestration tools like Puppet, Ansible, or Terraform
Experience with Docker and containerized applications
Strong troubleshooting and debug skills (reading log files, analyzing memory leaks)
Strong analytical skills and…


Increase/decrease your Search Radius (miles)



Job Posting Language