Senior Engineer - Site Reliability T500-22752 Job Hyderabad area,Telangana India,IT/Tech

Position: Senior Engineer - Site Reliability [T500-22752]
About T-Mobile:
T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

About TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.

This role ensures the reliability and resilience of digital infrastructure to support efficient software development and deployment. It involves automating processes and reducing manual effort to prevent operational incidents and improve system performance. The role requires expertise in programming, scripting, incident response management, and various technical tools to maintain system robustness. Success is measured by system stability, incident reduction, and continuous improvement in operational efficiency.

The work directly impacts organizational stability and customer experience by maintaining high-performing and reliable systems.
The Sr Engineer, Site Reliability is the core operations engineer, capable of resolving complex incidents, improving automation, and mentoring Engineer(s). They bridge operations and engineering by identifying recurring issues and creating scalable fixes.

Responsibilities:
Resolve escalated incidents across Kubernetes, API Proxy, WAF, DBs, and infra platforms.
Design and improve runbooks, automating manual steps wherever possible.
Lead and contribute to building self-healing systems and self-service tooling for users.
Analyze incident trends, propose improvements in monitoring, capacity, and reliability.
Collaborate with engineering teams on deployment, upgrades, and performance optimization.
Conduct postmortems, document RCA, and ensure learning is captured.
Mentor and coach Engineer(s)

Skills:

Mandatory Skills (Must-Have)
Advanced Incident Troubleshooting & Resolution:
Expectation:
Diagnose and resolve escalated incidents that Engineer(s) cannot handle, often across multiple layers (infrastructure, application, network).
Example:
For an API outage, identify if the root cause is in Kubernetes pod networking, API gateway misconfig, or backend DB latency — and apply fixes.

Kubernetes & Container Orchestration Expertise:
Expectation:
Comfortable with deployments, scaling, networking, and debugging cluster-level issues.
Example:
Troubleshoot why pods are pending by checking node capacity, taints/tole rations, and cluster autoscaler logs.

Automation & Scripting (Python, Go, Bash, Ansible, Terraform):
Expectation:
Write scripts and automation to reduce manual toil, enhance monitoring, and improve incident resolution speed.
Example:
Develop a Python script to automatically collect pod and system logs when a service crashes.

Observability & Monitoring Tooling:
Expectation:
Deep understanding of monitoring, alerting, tracing, and logging systems.
Example:
Build Prometheus alert rules to detect DB query spikes; configure Grafana dashboards for API latency.

CI/CD & Infrastructure as Code (IaC):
Expectation:
Familiarity with Git Ops workflows, CI/CD pipelines, and infrastructure provisioning.
Example:
Enhance Jenkins pipeline to add automated smoke tests before promoting Kubernetes deployments.

Database Troubleshooting (SQL & No

SQL):
Expectation:
Identify performance bottlenecks, connection issues, and basic tuning opportunities.
Example:
Run queries to detect slow-running SQL statements causing latency in an application.

Incident Management & RCA:
Expectation:
Act as incident commander for escalated issues, lead bridge calls, and produce Root Cause Analyses.
Example:
After a WAF misconfiguration causes downtime, lead the investigation, document the timeline, and propose preventive actions.

Mentorship & Runbook Improvement:
Expectation:
Coach Engineer(s), refine runbooks, and introduce new automated workflows.
Example:
Update a…


Increase/decrease your Search Radius (miles)



Job Posting Language