Job Description & How to Apply Below
Location - Chennai
Job type - Permanent, Full-time
Work-Mode - 5 days' work from office (UK & Non US timing)
Experience - 4 to 10 Years
Immediate Joiners only
Role Overview
We are looking for a Senior Support Engineer who will play a critical role in transforming our traditional support and monitoring teams into a modern Site Reliability Engineering (SRE) function. This role combines Level 1 monitoring responsibilities and Level 2 support duties, ensuring end-to-end accountability for system reliability. The ideal candidate will have strong technical troubleshooting skills, experience with operational support, and a mindset for automation and proactive problem-solving.
Key Responsibilities
Act as the first point of contact for system alerts and proactively monitor on-prem and cloud environments.
Triage alerts, resolve issues using playbooks, and escalate when necessary.
Own incidents end-to-end, ensuring timely resolution and communication.
Troubleshoot and resolve complex issues such as incomplete file processing, manual data loads, and system alerts.
Handle customer support issues related to file processing and integrations.
Implement automation for recurring issues and manual interventions.
Integrate proactive monitoring and self-healing mechanisms into systems.
Drive root cause analysis and implement permanent fixes to prevent recurrence.
Apply SRE principles and AWS Well-Architected Framework best practices for reliability, scalability, and cost optimization.
Identify gaps in current processes and propose improvements.
Work closely with development teams to ensure reliability and operability of new features.
Participate in on-call rotations and incident reviews to improve system resilience.
Required
Skills & Qualifications
Minimum 4 years in application support, operations, or reliability engineering.
Strong troubleshooting skills across on-prem systems and cloud environments.
Familiarity with Java, MySQL, and Python for debugging and support.
Experience with monitoring tools (e.g., Nagios, Prometheus, Cloud Watch) and alert management.
Experience with automation scripting (Python, Shell, or similar).
Knowledge of incident management frameworks and ITIL processes.
Understanding of cloud platforms (AWS preferred) and migration considerations.
Preferred:
Exposure to SRE principles and practices.
Experience with CI/CD pipelines and Dev Ops tools.
Knowledge of observability concepts (metrics, logs, traces).
Soft Skills
Strong communication and collaboration skills.
Ability to work under pressure and manage critical incidents.
Analytical mindset with a focus on continuous improvement.
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×