Head of Observability and Monitoring
Listed on 2026-01-18
-
IT/Tech
Cybersecurity, IT Support
ESSENTIAL DUTIES AND RESPONSIBILITIES Technical Leadership & Expertise
Develop and execute a comprehensive observability strategy, integrating logging, metrics, and distributed tracing across the Bank’s technology stack.
Lead the design and deployment of monitoring platforms, ensuring real-time visibility into system performance, availability, and security threats.
Own the end-to-end observability architecture, including tools selection, automation, and integration with cloud, on-prem, and hybrid environments.
Drive the adoption of AI / ML-powered monitoring to enhance anomaly detection, predictive analytics, and automated incident response.
Ensure robust service level indicators (SLIs), service level objectives (SLOs), and error budgets are established and tracked for critical services.
Strategic Planning & GovernanceDefine and implement observability governance frameworks, ensuring compliance with regulatory requirements (e.g., FFIEC, OCC, Basel III, GDPR).
Develop strategies to support real-time monitoring, root cause analysis, and proactive remediation to minimize downtime and business impact.
Partner with engineering, security, business unit, risk, and compliance teams to align observability initiatives with operational stability and performance targets, continuity and disaster recovery plans.
Champion operational resilience by ensuring monitoring covers end-to-end customer journeys, critical business services, and third-party dependencies.
Establish and maintain a centralized observability platform, standardizing logging and metrics collection across microservices, APIs, databases, and infrastructure.
Collaboration & Stakeholder ManagementWork closely with platform teams to embed observability best practices into CI / CD pipelines and software development life cycles.
Partner with Cybersecurity to integrate security monitoring, anomaly detection, and threat intelligence into observability solutions.
Engage with business and operations teams to ensure monitoring capabilities support customer experience, regulatory reporting, and incident management.
Serve as the Bank’s SME on observability, engaging with industry forums, vendors, and regulatory bodies to stay ahead of trends and compliance needs.
Technical SkillsProven expertise in modern observability stacks, including Splunk, Dynatrace, App Dynamics, Thousand Eyes, Service Now AIOps or Datadog.
Deep understanding of cloud-native monitoring across AWS, Azure, and Google Cloud, including serverless, Kubernetes, and container-based architectures.
Strong hands-on experience with log aggregation, tracing (Jaeger, Zipkin), and APM (Application Performance Monitoring).
Knowledge of AI-driven monitoring, automated remediation, and self-healing infrastructure.
Familiarity with SIEM tools and security monitoring, ensuring alignment with SOC and threat detection capabilities.
Experience in API monitoring, network telemetry, and database performance tuning.
Leadership & Strategic Experience10+ years of experience in observability, monitoring, or infrastructure resilience roles within regulated financial services or banking environments.
Proven track record of designing and implementing enterprise-scale observability platforms in a complex, multi-cloud environment.
Experience leading cross-functional teams to drive cultural adoption of observability and monitoring best practices.
Strong knowledge of regulatory and compliance requirements related to operational resilience, incident management, and monitoring.
Soft Skills & CollaborationAbility to translate complex technical monitoring data into actionable insights for senior executives and non-technical stakeholders.
Strong problem-solving skills with a proactive and forward-thinking approach to technology and resilience.
Excellent communication and leadership abilities, fostering collaboration across engineering, risk, and business teams.
Compliance and Regulatory KnowledgeIn-depth understanding of compliance in regulated industries (e.g., financial services, healthcare).
Experience working with audit and risk management processes.
Stakeholder Engagement & CommunicationFacilitate collaboration between application,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).