Observability Analyst
Listed on 2026-02-21
-
IT/Tech
IT Support, Cloud Computing, Cybersecurity
Are you an experienced, passionate pioneer in technology who wants to work in a collaborative environment? As an experienced Applications Observability Analyst, you will have the ability to share new ideas and collaborate on projects as a consultant without the extensive demands of travel. If so, consider an opportunity with Deloitte under our Project Delivery Talent Model. Project Delivery Model (PDM) is a talent model that is tailored specifically for long‑term, onsite client service delivery.
The Application Operations – Observability role supports the availability, reliability, and performance of business‑critical applications and infrastructure through 24x7x365 monitoring, incident triage/support, knowledge management, and basic automation. You will use Dynatrace, Grafana, and Azure Monitor to maintain dashboards, baselines, and alerts, and partner with internal technology teams and third‑party vendors to detect, accelerate, and support resolution of incidents. You will also help reduce manual effort through shift‑left practices, including well‑defined SOPs/runbooks and scripting.
Responsibilities- Monitor applications and infrastructure using Dynatrace, Grafana, and Azure Monitor; maintain dashboards, baselines, and alert thresholds to improve signal quality and reduce noise.
- Participate in incident triage and troubleshooting, following defined procedures to assess impact, determine priority, and elevate to the correct resolver teams/vendors.
- Support major incident response by joining bridge calls, executing assigned actions under Incident Coordinator direction, and capturing accurate notes, timelines, and updates.
- Create and maintain standard operating procedures (SOPs), knowledge articles, and known error documentation to improve L1 effectiveness and consistency.
- Identify repetitive issues and contribute to automation through scripts/runbooks (e.g., Power Shell, Python, Bash) to improve detection and support remediation.
- Assist with operational reporting and continuous improvement, including metrics such as mean time to detect (MTTD), mean time to restore (MTTR), ticket volumes, change validations, and major incident summaries.
- Provide scheduled coverage for 24x7x365 operations, including off‑hours, weekends, and holidays as required.
- Communicate regularly with Engagement Managers (Directors), project team members, and representatives from various functional and/or technical teams, including escalating any matters that require additional attention and consideration from engagement management.
AI & Engineering leverages cutting‑edge engineering capabilities to build, deploy, and operate integrated/verticalized sector solutions in software, data, AI, network, and hybrid cloud infrastructure. These solutions are powered by engineering for business advantage, transforming mission‑critical operations. We enable clients to stay ahead with the latest advancements by transforming engineering teams and modernizing technology & data platforms. Our delivery models are tailored to meet each client’s unique requirements.
Our AI & Data practice offers comprehensive solutions for designing, developing, and operating advanced Data and AI platforms, products, insights, and services. We help clients innovate, enhance, and manage their data, AI, and analytics capabilities, ensuring they can grow and scale effectively.
Qualifications Required- 5+ years IT operations, incident management, or application support in a 24x7 environment
- 2+ years hands‑on observability/monitoring using Dynatrace, Grafana, and/or Azure Monitor (alerting, configuration, dashboarding)
- 3+ years participating in incident resolution (documentation, stakeholder communications); 1+ year exposure to major incident bridge calls (preferred)
- 2+ years working with IT service management (ITSM) tools and workflows (e.g., Service Now)
- 2+ years scripting/automation with Power Shell, Python, and/or Bash, plus documenting SOPs/knowledge articles
- 3+ years cross‑team collaboration and operational documentation in a production support environment
- 3+ years exceptional verbal and written communication, including clear incident and root cause analysis (RCA)…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).