Site Reliability Developer Job Bangalore area,Bengaluru Karnataka India,IT/Tech

Position: Site Reliability Developer 3
Location: Bengaluru

Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.
Description
Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.

Design, implement, and operate scalable, secure, and highly available infrastructure for cloud and AI-driven applications on OCI.
Apply SRE best practices including SLI/SLO definition, error budgets, automated monitoring, incident response, and post-incident reviews.
Instrument systems using observability tools (Grafana, Prometheus, APM) to monitor performance, availability, latency, and resource utilization.
Lead major incident management, perform deep root-cause analysis, and implement long-term preventive fixes.
Drive large-scale noise reduction initiatives by tuning alerts, eliminating duplicate alarms, and improving monitoring quality.
Automate common operational tasks to minimize manual intervention and improve MTTR.
Automation & Dev Ops
Build and maintain automation for infrastructure provisioning, deployments, monitoring, and remediation using Terraform, Ansible, Python, Shell, or Power Shell.
Develop CI/CD pipelines and Infrastructure-as-Code frameworks to ensure repeatable and reliable deployments.
Identify and eliminate toil by continuously improving operational processes through automation.
Collaborate closely with engineering, Dev Ops, and platform teams to improve system resilience and scalability.

Strong problem-solving and critical-thinking skills with attention to detail.
Proactive, solution-oriented mindset with a focus on fixing root causes.
Passion for automation and continuous improvement.
Ability to work effectively under pressure in high-stakes environments.
Eagerness to learn, innovate, and mentor others.

Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance. Authority for end-to-end performance and operability. Partner with development teams in defining and implementing improvements in service architecture.

Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio. Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack. Demonstrate clear understanding of automation and orchestration principles. Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs).

Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. Understand and explain the affect of product architecture decisions on distributed systems. Professional curiosity and a desire to a develop deep understanding of services and technologies.
Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the…


Increase/decrease your Search Radius (miles)



Job Posting Language