More jobs:
Job Description & How to Apply Below
Location: Bengaluru
Role:
GenAI Site Reliability Engineer
Level: Senior Associate
Tower: AI Operations & Platform Support (AI Managed Services)
Experience:
5-10 years
Key
Skills:
Monitoring & Alerting;
Incident Investigation;
Troubleshooting;
Automation/Scripting;
Cloud Operations;
GenAI Platform Operations
Educational
Qualification:
Bachelor’s degree in Computer Science/IT or relevant field (Master’s or relevant certifications preferred)
Work Location:
Bangalore / Hyderabad
Job Description
As an AC Senior Associate GenAI Site Reliability Engineer, you will operate and improve monitoring for in-scope GenAI services and AI workloads, investigate incidents, and implement reliability improvements. You will build dashboards, tune alerts, document runbooks, and automate repetitive operational tasks to improve stability and reduce time to restore.
Key Responsibilities:
1.Monitoring, Alerting & Service Health:
- Build and maintain dashboards and alerts for availability, latency, error rates, and overall service health for in-scope GenAI services.
- Tune thresholds and alert routing to reduce noise and improve actionable detection, improving MTTA and MTTR.
2.Incident Triage, Investigation & Restoration:
- Triage incidents, gather evidence, and perform structured troubleshooting using logs/metrics/traces and documented runbooks.
- Execute restoration steps and coordinate escalations to platform owners, engineering teams, or vendors for complex issues.
- Provide clear technical updates during live events and document resolution details for future reference and trend analysis.
3.Problem Prevention & Reliability Improvements:
- Contribute to root-cause investigations and implement corrective actions (monitoring improvements, configuration changes, resilience enhancements).
- Identify recurring failure modes and propose fixes that reduce repeat incidents and improve overall service stability.
- Support verification of corrective actions by monitoring outcomes and validating that improvements reduce incident recurrence.
4.Performance Troubleshooting Support:
- Assist with latency and error investigations by gathering diagnostics, isolating contributing factors, and proposing mitigations.
- Partner with engineering teams to validate fixes and monitor post-deployment impact on service health and performance.
5.Automation & Scripting:
- Automate diagnostics and routine operational tasks to reduce manual effort and improve consistency (scripts, repeatable checks, standardized steps).
- Maintain and document operational scripts and ensure they are usable and supportable by the broader team.
6.Documentation & Knowledge Management:
- Maintain runbooks, troubleshooting guides, and knowledge articles for frequent scenarios and standard operating procedures.
- Document known issues, standard resolutions, and escalation paths to improve first-time fix rate and onboarding efficiency.
7.Change Readiness & Post-Change Validation:
- Support operational readiness for changes by validating monitoring readiness, runbook updates, and post-change verification steps.
- Execute post-change checks and report regressions or unexpected behavior promptly to ensure rapid remediation.
8.Continuous Improvement & Service Reporting Inputs:
- Identify operational pain points and recommend improvements to monitoring, alerting, runbooks, and support workflows.
- Provide inputs to service reporting on incident trends, recurring issues, and improvement opportunities related to GenAI reliability.
9.Quality, Controls & Operational Discipline:
- Follow defined operational processes (incident, request, change) and maintain high-quality ticket hygiene and documentation discipline.
- Comply with security and access controls for supported tools and environments; proactively raise operational risks or control gaps for mitigation.
10.Collaboration & Team Support:
- Collaborate with peers and leads to coordinate workload, share knowledge, and support consistent execution standards across the pod.
- Support onboarding and knowledge transfer by maintaining clear documentation and participating in team enablement activities.
Required Skills:
- Hands-on experience supporting production services in a cloud environment,…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×