Principal Engineer - Site Reliability T500-22750 Job Hyderabad area,Telangana India,IT/Tech

Position: Principal Engineer - Site Reliability [T500-22750]
About T-Mobile:
T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

About TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.

This role supports organizational goals by enhancing system reliability and resilience to improve software development and deployment efficiency. It involves applying site reliability engineering principles and automation to reduce manual work and prevent operational incidents. The role requires strong problem-solving and analytical skills to resolve complex technical issues effectively. Success is measured by system stability, incident reduction, and acceleration of software delivery processes.

The work directly impacts the performance and reliability of the organization's digital infrastructure and customer experience.
The Principal Engineer, Site Reliability is the subject matter expert (SME) and automation leader. They focus on high-severity incidents, advanced automation, AI/ML-enabled operations, and driving the evolution of the operations model toward self-healing, proactive reliability engineering.

Responsibilities:
Lead resolution of high-severity/complex incidents across hybrid infrastructure.
Architect and implement automation frameworks, self-healing workflows, and AI-driven ops.
Define SRE best practices, reliability SLIs/SLOs/SLAs, and operational standards.
Partner with application and platform engineering teams to improve resilience.
Drive observability maturity: predictive monitoring, anomaly detection, automated RCA.
Own continuous improvement of Engineer(s)/Sr Engineer(s) runbooks and automation pipelines.
Provide technical leadership, mentor junior SREs, and conduct training.
Identify new technologies, tools, and processes that elevate operational excellence.

Skills:

Mandatory Skills (Must-Have):
Incident Command & Complex Troubleshooting:
Expectation:
Take leadership during high-severity outages, orchestrating technical response across teams.
Example:
Lead a Sev-1 bridge call where multiple microservices are failing due to cascading Kubernetes issues; coordinate DB, infra, network, security and app teams to isolate the problem.

Deep Kubernetes & Distributed Systems Expertise:
Expectation:
Design, troubleshoot, and optimize complex Kubernetes clusters and multi-region deployments
Example:
Diagnose why inter-cluster communication in a service mesh is causing intermittent API failures and propose architectural fixes.

Automation Framework Design (Infra & Ops):
Expectation:
Architect automation platforms to reduce manual toil, enable self-service, and support auto-remediation.
Example:
Build an Ansible/Terraform-based automation pipeline that provisions, configures, and tests new app environments with zero manual steps.

Observability Strategy & Advanced Monitoring:
Expectation:
Define enterprise-wide observability standards (SLIs/SLOs/SLAs), implement anomaly detection, and predictive monitoring.
Example:
Roll out a metrics-based SLO framework for all API services with automated burn-rate alerts in Prometheus.

Database & Application Performance Engineering:
Expectation:
Tune databases, caching layers, and app performance to handle scale.
Example:
Identify DB query patterns that degrade API performance and recommend schema/index optimizations.

Cross-Domain SME Knowledge (Networking, Storage, APIs):
Expectation:
Act as a go-to expert across infrastructure layers.
Example:
Troubleshoot why API gateway latency spikes correlate with storage backend bottlenecks.

AI/ML in Operations (AIOps):
Expectation:
Integrate AI-driven platforms for anomaly detection, auto-remediation, and incident prediction
Example:
Deploy an ML model that predicts…


Increase/decrease your Search Radius (miles)



Job Posting Language