Job Description & How to Apply Below
Location: Bengaluru
Job Summary:
We’re looking for a Site Reliability Engineer (SRE) with 4-6 years of experience with strong technical and analytical skills to ensure the reliability, scalability, and performance of our core applications. This role focuses on improving the stability and efficiency of distributed systems built on Java and microservices architecture, driving operational excellence through monitoring, automation, and incident management.
Key Responsibilities:
Application Reliability & Performance
- Monitor and maintain the health, performance, and reliability of production applications.
- Define, measure, and track SLIs/SLOs for key services, driving improvements proactively.
- Identify performance bottlenecks, memory leaks, and slow transactions in Java-based microservices.
- Partner with development teams to design and deploy resilient, fault-tolerant systems.
- Mentor developers and operations engineers on observability and debugging techniques.
Incident Management & Troubleshooting
- Actively participate in incident response, triaging application issues, and restoring services quickly.
- Perform deep root-cause analysis for recurring incidents and ensure permanent fixes are implemented.
- Own the incident lifecycle — from detection to resolution and post-incident review.
- Ensure observability tools and alert thresholds are tuned to reduce false positives and improve signal quality.
Monitoring & Automation
- Enhance visibility across systems through better metrics, logs, and traces using Prometheus, Grafana, and Loki (or similar).
- Automate repetitive tasks — deployments, rollbacks, scaling, and diagnostics.
- Build or improve runbooks and self-healing mechanisms to reduce operational toil.
- Integrate AIOps capabilities for smarter alert correlation, anomaly detection, and incident prediction.
Operational Ownership
- Ensure production systems meet availability and performance targets.
- Track open issues, follow up on root cause actions, and drive closure with responsible teams.
- Collaborate with developers, infrastructure, and QA to maintain a consistent and stable release cycle.
- Contribute to continuous improvement of deployment, monitoring, and rollback processes.
Collaboration & Communication
- Work closely with product and platform engineering to integrate reliability into system design.
- Communicate incident status, RCA findings, and reliability metrics to stakeholders.
- Foster a reliability-first culture and advocate for operational excellence across teams.
Skills & Qualifications
Required Skills:
- 4–6 years of experience in Site Reliability Engineering or Application Operations.
- Solid understanding of Java, Springboot and microservices architecture.
- Proficiency in monitoring and observability tools (Prometheus, Grafana, Loki, New Relic, or equivalent).
- Familiarity with Kubernetes, containers, and CI/CD pipelines.
- Familiarity with incident management, RCA, and performance debugging.
- Experience with cloud platforms (AWS, Azure, or GCP).
- Strong scripting skills (Bash, Python, or Go) for automation and diagnostics.
- Good communication and stakeholder collaboration skills.
Preferred
Skills:
- Experience with modern observability tools
- Familiarity with ITSM or ticketing tools (Jira, Service Now) for issue tracking.
- Knowledge of security and compliance in production environments.
- Hands-on experience with JVM performance tuning and runtime diagnostics to improve Java service performance.
- Exposure to using AI or LLM-based tools for alert correlation, log analysis, root-cause detection, or automated
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×