More jobs:
Incident Manager
Job in
Austin, Travis County, Texas, 78716, USA
Listed on 2026-03-11
Listing for:
TEKsystems
Full Time
position Listed on 2026-03-11
Job specializations:
-
IT/Tech
IT Support, Cloud Computing
Job Description & How to Apply Below
Description
We are seeking a proactive, technology-forward Incident Manager to oversee and enhance our growing cloud-based operations as we transition from legacy to modern platforms. This role is pivotal in supporting large-scale enterprise environments, driving migration and modernization initiatives, and cultivating a resilient, automation-focused incident management culture. The ideal candidate thrives in dynamic, fast-paced settings, excels at cross-team collaboration, and can communicate effectively with both technical and executive audiences.
Top Technical Requirements & Qualifications
- 5+ years of experience in Incident Management supporting business-critical applications.
- Expertise with cloud-based monitoring and incident management tools, including Datadog (or similar tools like Splunk), Pager Duty for alerting/escalation, and Service Now for incident workflow automation and ITSM processes.
- Proficiency with cloud platforms (GCP preferred; AWS/Azure also valued), including analysis of cloud console logs and troubleshooting at the network, platform, and application levels.
- Strong understanding of modern technology stacks: cloud infrastructure, application/infrastructure monitoring, networking, APIs, and distributed systems concepts.
- Solid grasp of ITIL frameworks, specifically Incident, Problem, and Change Management, and experience enforcing best practices and continuous improvement.
- Demonstrated ability in process automation, workflow improvement, and leveraging or developing AI-enabled solutions for incident response. Key Responsibilities
- Incident Leadership & Response:
Act as the single point of accountability (“Incident Commander”) for major incidents, leading real-time incident bridges, prioritizing and driving resolution, and engaging appropriate technical resources. - Classify, triage, and escalate incidents per defined severity and SLAs, ensuring rapid service restoration and adherence to ITIL best practices.
- Monitor cloud-based systems using Datadog and native platform tools, interpreting alerts and coordinating responses across application, infrastructure, and network layers.
- Utilize Pager Duty for alerting and escalation, maintain escalation policies, and manage incidents across cloud-native and hybrid environments (including networking, APIs, databases, and edge components).
- Coordinate and manage incident workflows using Service Now, including supporting automation, workflow improvements, and bidirectional sync initiatives.
- Lead or participate in proactive resilience activities and failover testing to shift toward proactive incident management.
- Support migration from legacy to modern platforms, understanding both environments and navigating the challenges of large-scale enterprise transitions.
- Communicate incident status and recaps to technical teams, business stakeholders, and executive leadership, translating technical issues into business-impact-focused updates.
- Produce post-incident summaries, impact assessments, and executive reports, and contribute to root cause analysis and post-incident reviews.
- Contribute to sprint/project work, including developing Slack automations and AI-enabled solutions for enhanced incident response and onboarding.
- Continuously refine incident workflows, escalation paths, and runbooks for improved effectiveness, and track recurring incidents to drive service stability.
- Support integration among monitoring, alerting, collaboration, and ITSM tools to streamline workflows and promote automation. Preferred Attributes
- Self-starter attitude with an eagerness to innovate and drive improvements beyond the status quo.
- Collaborative mindset, with ability to mentor others in best practices and modern incident management approaches.
- Strong critical thinking, adaptability in ambiguous environments, and effective decision-making under pressure.
- Desire to contribute to a resilient, proactive incident management culture focused on preparation, simulation, and automation.
Incident management, Incident response, Problem management, Servicenow, Monitoring tools, Root cause analysis, Itil, Splunk, Itsm
Top Skills Details
Incident management,Incident response,Problem…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×