More jobs:
Job Description & How to Apply Below
Senior Dev Ops Engineer / Platform Reliability Lead
Exp : 10-12+ years
Location : Kolkata
Role Overview
We are seeking a Senior Dev Ops Engineer / Platform Reliability Lead who can take an end-to-end view of our systems, identify improvement areas across architecture, infrastructure, deployment pipelines, and reliability, and guide the platform toward higher scalability, stability, and operational maturity.
This role requires strong system thinking, sound architectural judgment, and the ability to clearly call out risks and improvements.
Key Responsibilities
Review the complete backend ecosystem (Node.js, Golang services, cloud infrastructure, CI/CD).
Identify architectural, scalability, reliability, and security gaps post in-house migration.
Recommend and prioritise short-term fixes and long-term platform improvements.
Own containerized infrastructure using Docker and Kubernetes in production.
Design and maintain robust CI/CD pipelines with safe deployment and rollback strategies.
Implement and improve monitoring, logging, alerting , and incident response practices.
Define and track meaningful SLIs, SLOs, and error budgets.
Prepare systems for OTT traffic spikes during releases and live events.
Improve caching, queuing, and backend performance in collaboration with backend teams.
Drive secure access, secrets management, and cloud cost optimisation.
Act as a technical partner to backend, product, and leadership teams.
Required Technical Skills
Cloud & Infrastructure
Strong experience with AWS (EC2, EKS/ECS, S3, RDS/Dynamo
DB, IAM)
Docker and Kubernetes (production environments)
Infrastructure as Code – Terraform (preferred)
CI/CD & Operations
Git Hub Actions / Git Lab CI / Jenkins
Blue-green / canary deployments and rollback strategies
Backend Awareness
Node.js (Express / NestJS level understanding)
Golang (microservices, concurrency, profiling basics)
Observability
Prometheus, Grafana
Centralised logging (ELK / Open Search / Loki)
Distributed tracing (Jaeger / Open Telemetry)
Data, Cache & Messaging
Redis (cache and/or queues)
Kafka / SQS / Rabbit
MQ (deep experience with at least one)
Mongo
DB (understanding of No-SQL DBs, bonus if experienced with Atlas offerings)
Security & Reliability
Secrets management (Vault / AWS Secrets Manager)
IAM and least-privilege access design
Production incident handling experience
Personality & Mindset
Strong ownership and accountability for platform reliability.
Comfortable identifying what is wrong and explaining how to fix it.
Calm and structured during incidents and high-pressure situations.
Clear communication with engineers and non-technical stakeholders.
Systems thinker who understands end-to-end impact, not just isolated components.
Pragmatic, data-driven, and collaborative.
Reach out to : sushim / shirin
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×