Sr Engineer,Site Reliability T500-22222 Job Hyderabad area,Telangana India,IT/Tech

Position: Sr Engineer, Site Reliability [T500-22222]
About T-Mobile:
T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

About TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.

Job Overview:
At T-Mobile , we don’t just build technology — we empower people. We believe in investing in YOU — your growth, your impact, and your future. We’re unstoppable when individuals like you come together to solve bold challenges, inspire innovation, and build platforms that serve millions.
As a Senior Site Reliability Engineer (SRE), you will help ensure the availability, performance, and stability of platforms powering T-Mobile’s finance, credit, collections, document management, and supply chain systems. You will collaborate with application developers, Dev Ops, and cloud teams to build reliable, observable, and automated systems. This role is ideal for engineers passionate about operational excellence, learning distributed systems, and scaling production environments using code and data.

Key Responsibilities:

Reliability Engineering & Operations:
Contribute to the availability and performance of large-scale, customer-facing systems through monitoring, alerting, and incident response .
Assist in designing and implementing resiliency strategies , including health checks, failovers, circuit breakers, and retries.
Participate in on-call rotations , help triage incidents, and assist in root cause analysis and post-incident reviews.

Automation & CI/CD Support:
Develop scripts, tools, and automation to reduce manual toil and improve operational efficiency.
Support infrastructure deployment and service rollout via CI/CD pipelines and Infrastructure-as-Code workflows (e.g., Terraform, Helm).
Work with developers to improve service deployment, configuration management , and rollback strategies.

Observability & Metrics:
Help build and maintain dashboards, alerts, and logs that provide visibility into system health and application behavior.
Use tools such as Prometheus, Grafana, Splunk , or Open Telemetry to monitor services and infrastructure.
Analyze system performance data to guide optimizations and proactively detect issues.

Cross-Team Collaboration
Work with Dev Ops, SREs, and software engineers to ensure that services are built for reliability and observability .
Contribute to documentation, runbooks, playbooks, and operational readiness reviews.
Support development teams in designing systems that meet SLOs and operational standards .

Qualifications:

Bachelor’s degree in computer science, Engineering, or a related technical field.
8+ years of experience in infrastructure, operations, Dev Ops, or SRE roles.
Proficiency in scripting or programming languages such as Java, Python, Go, and Bash.
Strong familiarity with Linux systems, container orchestration (Kubernetes), and cloud platforms (Azure preferred/GCP also relevant).
Hands-on experience with monitoring and observability tools such as Grafana, Splunk, and Open Telemetry.
Expertise in Kubernetes and container orchestration, including Docker templates, Helm charts, and Git Lab templates.
Knowledge of authentication, authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.
Solid understanding of incident management tools such as Service Now.

Preferred

Skills:

Exposure to incident management frameworks , including alerting, escalation, and postmortem practices.
Understanding of SRE principles : SLOs, SLIs, error budgets, and service-level indicators.
Familiarity with tools like HAProxy, Envoy Proxy, Kafka, Rabbit

MQ , or other core infrastructure components.

Experience with performance tuning of Kubernetes runtime components.

Experience with CI/CD…


Increase/decrease your Search Radius (miles)



Job Posting Language

Sr Engineer, Site Reliability T500-22222