Site Reliability Developer Job Bangalore area,Bengaluru Karnataka India,IT/Tech

Position: Site Reliability Developer 3
Location: Bengaluru

Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.

Design, implement, and operate scalable, secure, and highly available infrastructure for cloud and AI-driven applications on OCI.
Apply SRE best practices including SLI/SLO definition, error budgets, automated monitoring, incident response, and post-incident reviews.
Instrument systems using observability tools (Grafana, Prometheus, APM) to monitor performance, availability, latency, and resource utilization.
Lead major incident management, perform deep root-cause analysis, and implement long-term preventive fixes.
Drive large-scale noise reduction initiatives by tuning alerts, eliminating duplicate alarms, and improving monitoring quality.
Automate common operational tasks to minimize manual intervention and improve MTTR.
Automation & Dev Ops     Build and maintain automation for infrastructure provisioning, deployments, monitoring, and remediation using Terraform, Ansible, Python, Shell, or Power Shell.
Develop CI/CD pipelines and Infrastructure-as-Code frameworks to ensure repeatable and reliable deployments.
Identify and eliminate toil by continuously improving operational processes through automation.
Collaborate closely with engineering, Dev Ops, and platform teams to improve system resilience and scalability.

Strong problem-solving and critical-thinking skills with attention to detail.
Proactive, solution-oriented mindset with a focus on fixing root causes.
Passion for automation and continuous improvement.
Ability to work effectively under pressure in high-stakes environments.
Eagerness to learn, innovate, and mentor others.
Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance. Authority for end-to-end performance and operability. Partner with development teams in defining and implementing improvements in service architecture.

Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio. Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack. Demonstrate clear understanding of automation and orchestration principles. Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs).

Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. Understand and explain the affect of product architecture decisions on distributed systems. Professional curiosity and a desire to a develop deep understanding of services and technologies.
----
- Database Reliability Engineering
Manage, monitor, and optimize Oracle and Exadata databases to ensure high availability, scalability, and peak performance.
Perform advanced performance tuning including SQL optimization, indexing strategies, storage optimization, and configuration tuning.
Design, validate, and maintain robust backup, recovery, and disaster recovery solutions for 24x7 mission-critical environments.
Lead database incident and problem management, including capacity planning and resource utilization analysis.
Partner with application and architecture teams on database design, data modeling, and reliability improvements.
Required     Bachelor's or Master's degree in Computer Science, Engineering, or related field.
6+ years of experience in SRE, Cloud Engineering, Dev Ops, or Database Reliability roles.
Expert-level experience with Oracle Database and Exadata administration, performance tuning, and high-availability architectures.
Strong hands-on experience with automation and scripting (Python, Shell, Power Shell).
Deep understanding of cloud computing concepts, distributed systems, and reliability engineering.
Experience operating large-scale, mission-critical systems in 24x7 environments.
Strong analytical, troubleshooting, and communication skills.
Good to Have

Experience with cloud platforms (OCI preferred AWS/Azure/GCP acceptable).
Familiarity with Kubernetes, Docker, and containerized workloads.
Knowledge of AI technologies such as LLMs, RAG, and AI Agents.

Experience with Infrastructure-as-Code tools (Terraform, Ansible).
Exposure to database observability and APM…


Increase/decrease your Search Radius (miles)



Job Posting Language