More jobs:
Job Description & How to Apply Below
Salesforce Monitoring Cloud team is looking for an engineering leader with deep expertise in AI/ML and analytical modeling for infrastructure systems to lead high caliber engineering teams building the next generation of intelligent availability monitoring platforms at massive scale.
Monitoring Cloud is a foundational part of Salesforce Infrastructure that ensures the reliability and availability of Salesforce products globally. We own the entire telemetry stack — from lightweight agents that emit metrics, logs, traces and events, to large-scale distributed backend systems that process petabytes of telemetry data in real time across public cloud environments
.
We are building advance d machine learning–driven detection and analytical syste ms that power proactive incident identification, anomaly detection, capacity forecasting, failure prediction and automated remediation across distributed cloud infrastructure. This role focuses on deep applied ML and statistical modeling for infrastructure observability — not generative AI or enterprise application M
L.
As an engineering leader, you will drive the architecture and implementation of scalable ML systems embedded directly into our monitoring cloud. You will lead teams building analytical pipelines, detection models, signal correlation engines and intelligent automation systems that improve availability, reduce MTTR and enhance system resilience. You are equally passionate about technical depth, operational excellence, and building high-performing tea
ms.
Responsibili
ties Define and drive the vision for ML-powered infrastructure observability with a focus on Availability, Reliability, Detection Accuracy and Operational Excelle
nce.
Architect and scale analytical modeling systems
for:
Real-time anomaly detection across metrics, logs and tr
aces Signal correlation and root cause anal
ysis Failure prediction and risk sco
ring Capacity forecasting and saturation predic
tion Intelligent alert noise reduc
tion Build ML pipelines that operate at hyperscale across distributed systems in Azure and other public cloud environme
nts.
Lead the development of statistical, time-series and deep learning models tailored for infrastructure telemetry d
ata.
Integrate ML models directly into monitoring and incident management workfl
ows.
Ensure models are production-grade: reliable, explainable, scalable and cost-effici
ent.
Drive execution in partnership with infrastructure engineering, product and architecture te
ams.
Establish strong service ownership practices including SLOs, SLAs and operational metrics for ML-powered servi
ces.
Build and mentor a high-caliber team of ML engineers and distributed systems engine
ers.
Promote rigorous experimentation, model evaluation frameworks and data-driven decision mak
ing.
Recruit top talent in ML systems and infrastructure engineer
ing.
Required Skills / Exper
ience
12+ years of experience in software development with 3+ years managing engineering t
eams.
Strong backgroun d in Machine Learning applied to large-scale systems or infrastructure pro b
lems.
Large-scale data anal
ytics
Experience product ionizing ML models in distributed cloud environm
ents.
Strong foundation in Distributed Systems architec
ture.
Experience building or operating observability or telemetry platforms at s
cale.
Experience with public cloud platforms such as Azure, AWS or
GCP.
Experience with large-scale data processing frameworks (e.g., Kafka, Spark, stream processing systems, No
SQL sto
res).Strong service ownership mindset with experience defining and operating services with SLOs/
SLAs.Ability to balance research-oriented thinking with practical production deli
very.
Proven track record of recruiting and developing high-performing technical t
eams.
Excellent written and verbal communication sk
ills.
Preferred Qualific
ations
Experience building ML-driven monitoring or availability plat
forms.
Background in infrastructure reliability engineering or SRE environ
ments.
Experience designing low-latency ML inference sy
stems.
Contributions to research, patents, or technical publications in applied ML or distributed sy
stems.
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×