Job Description & How to Apply Below
This role owns production reliability outcomes as systems scale, migrate, and evolve within regulated aviation environments.
Job Title :
Senior Site Reliability Engineer Experience Required : 5+ Years Location : Pune/ Ahmedabad Educational
Qualification:
Bachelor's degree in Computer Science, Software Engineering, or a related field
Roles and Responsibilities
Reliability Ownership and Service Health Own availability, latency, throughput, and durability for production systems
Define and maintain service level indicators and service level objectives
Manage error budgets to guide engineering and operational decisions
Ensure reliability targets are met consistently
Production Architecture and Resilience Design and operate highly available multi availability zone and multi region architectures
Ensure controlled and observable failure behavior
Define redundancy, graceful degradation, and automated recovery strategies
Validate failover and recovery through testing
Incident Response and Operational Maturity Lead response to production incidents
Own root cause analysis focused on systemic contributors
Drive remediation actions to completion
Reduce incident frequency, severity, and blast radius over time
Observability and Operational Insight Design centralized logging, metrics, alerting, and dashboards
Define observability standards tied to customer impact
Ensure alerts are actionable and low noise
Use operational data for capacity planning and scaling decisions
Automation and Toil Reduction Identify and eliminate manual or repetitive operational tasks
Build automation to reduce operational risk
Standardize operational workflows
Treat simplicity as a reliability requirement
Data and Database Reliability Own production database reliability
Design replication, backup, restore, and failover strategies
Validate recovery procedures regularly
Lead migrations to managed cloud databases such as AWS RDS or Aurora
Technical
Qualifications:
Cloud and Infrastructure Hands on experience operating production systems on AWS or Azure
Strong understanding of networking, IAM, load balancing, and managed services
Ability to balance cost, reliability, and operational complexity
Distributed Systems Experience operating distributed systems in production
Strong understanding of partial failure and recovery patterns
Ability to diagnose cross stack production issues
Observability and Operations
Experience with centralized logging, metrics, and alerting
Ability to design alerts based on service impact
Experience driving improvement from operational data
Programming and Automation Strong scripting skills using Python, Node.js, or shell
Ability to write production grade operational tooling
Comfort modifying application code to improve reliability
Databases Experience operating relational databases in production
Experience with replication, backup, restore, and failover
Experience migrating legacy databases to managed services preferred
Preferred Experience
Experience in regulated or safety critical industries such as aviation
Familiarity with compliance, auditability, and traceability requirements
Experience supporting systems with direct operational impact
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×