Job Description & How to Apply Below
About T-Mobile:
T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.
About the Role:
The Senior Site Reliability Engineer for Container Platforms at T-Mobile plays a crucial role in shaping and maintaining the foundational infrastructure that powers our next-generation platforms and services. They design, implement, and manage large-scale Kubernetes clusters and related automation systems that ensure high availability, scalability, and reliability across T-Mobile’s technology ecosystem. They also utilize their strong problem-solving and analytical skills to automate processes, reducing manual effort and preventing operational incidents.
Their expertise in Kubernetes and scripting languages, incident response management, and various tech tools contributes to the robustness and efficiency of our systems. By continuously learning new skills and technologies, they adapt to changing circumstances and drive innovation. Their work and expertise contribute significantly to the stability and performance of T-Mobile's digital infrastructure. They are also responsible for diagnosing and resolving complex issues across networking, storage, and compute layers, driving continuous improvement through data-driven insights and Dev Ops best practices.
This engineer is also responsible for contributing to the overall architecture and strategy of technical systems, mentoring junior engineers, and ensuring solutions are aligned with T-Mobile's business and technical goals.
What You’ll Do:
Design, build, and maintain large-scale, production-grade Kubernetes (K8s) clusters to ensure high availability, scalability, and security across T-Mobile’s hybrid infrastructure.
Develop and manage Infrastructure as Code (IaC) using tools such as Terraform, Cloud Formation, and ARM templates, enabling consistent and automated infrastructure deployment across AWS, Azure, and on-premises data centers.
Resolve platform-related customer tickets by diagnosing and addressing infrastructure, deployment, and performance issues to ensure reliability and seamless user experience.
Lead incident response, root cause analysis (RCA), and post-mortems, implementing automation to prevent recurrence.
Implement and optimize CI/CD pipelines leveraging Git Lab, Argo, and Flux to support seamless software delivery, continuous integration, and progressive deployment strategies.
Automate system operations through scripting and development in Go, Bash, and Python, driving efficiency, repeatability, and reduced operational overhead.
Monitor, analyze, and enhance system performance, proactively identifying bottlenecks and ensuring reliability through data-driven observability and capacity planning.
Troubleshoot complex issues across the full stack—network, storage, compute, and application layers using advanced tools like pcap, telnet, and Linux-native diagnostics.
Apply deep Kubernetes expertise to diagnose, resolve, and prevent infrastructure-related incidents while mentoring team members on container orchestration best practices.
Drive a culture of automation, resilience, and continuous improvement, contributing to the evolution of T-Mobile’s platform engineering and cloud infrastructure strategies.
Drive innovation by recommending new technologies, frameworks, and tools.
Perform additional duties and strategic projects as assigned.
What You’ll Bring :
Bachelor’s degree in computer science, Software Engineering, or related field.
5–9 years of hands-on experience in Site Reliability Engineering roles supporting large-scale, production-grade systems.
Extensive…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×