Job Description & How to Apply Below
Location: Bengaluru
Job Description
Software Engineer (SDE-2) – Dev Ops, SRE & MLOps Platform Engineering
Location:
Bengaluru
Employment Type:
Full-time
Team: Platform Engineering / Reliability
About Blue Machines
Blue Machines powers large-scale, real-time Voice AI platforms and Agentic Workflows for global enterprises across BFSI, Healthcare, HRTech and customer experience domains.
Built and scaled from India, our platform has processed 14.5M+ minutes of production-grade AI agent conversations , operating latency-sensitive, always-on voice systems across geographies.
About the Role
We are hiring a hands-on Dev Ops / SRE engineer who owns platform reliability, observability and automation and grows into MLOps and AI platform engineering .
This role focuses on designing, operating and evolving the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale , driving uptime, performance and resilience.
Key Responsibilities
Platform Reliability & SRE
Own 99.9%+ platform uptime for real-time Voice AI workloads.
Participate in on-call rotations , incident response and post-incident reviews.
Lead root cause analysis (RCA) and drive permanent reliability improvements.
Design and implement self-healing systems using automation, retries, circuit breakers and failover strategies.
Kubernetes & Cloud Infrastructure
Design, operate and scale Kubernetes clusters in public cloud environments.
Work with managed Kubernetes platforms such as GKE , and apply cloud-native best practices.
Implement auto-scaling strategies (HPA, VPA, node pools, GPU workloads).
Manage infrastructure using Infrastructure as Code (Terraform) .
Optimize infrastructure for performance, reliability and cost efficiency .
Observability & Incident Intelligence
Build and maintain monitoring, logging and alerting systems using Prometheus, Grafana, Loki and Open Telemetry .
Define SLIs, SLOs and error budgets for platform and AI workloads.
Drive signal-based alerting to reduce noise and improve response quality.
Implement anomaly detection and predictive alerting for infrastructure and AI pipelines.
CI/CD & Platform Automation
Design and maintain CI/CD pipelines for services and infrastructure.
Build internal automation tooling for:
Progressive and canary deployments
Auto-scaling and capacity planning
Faster incident diagnosis and recovery
Enable self-service Dev Ops workflows for engineering teams.
MLOps & AI Platform Reliability
Own reliability and performance of STT, TTS and LLM inference pipelines .
Design provider routing, failover and SLA enforcement mechanisms.
Deploy, version and roll back AI models and inference services .
Monitor inference latency, quality and drift in production systems.
Operate GPU-backed inference workloads where applicable.
Security, Compliance & Resilience
Enforce Dev Sec Ops practices across build and deploy pipelines.
Implement network policies, encryption, secrets management and access controls .
Drive disaster recovery, backup strategies and resilience testing .
Contribute to SOC2 / ISO compliance and audits .
Collaboration & Engineering Excellence
Partner with backend, AI and platform teams on architecture and reliability.
Influence system design through a reliability-first mindset .
Mentor junior engineers and raise the overall bar for operational excellence.
Qualifications
Must-Have
3–6 years of experience in Dev Ops, SRE or Platform Engineering roles.
Strong hands-on experience with Kubernetes and Docker in production environments.
Familiarity with public cloud platforms and managed Kubernetes services (such as GKE) .
Strong understanding of distributed systems and production debugging .
Hands-on experience with observability systems .
Proficiency with Infrastructure as Code (Terraform) .
Strong incident ownership and communication skills.
Good-to-Have
Experience with MLOps or AI inference platforms .
Familiarity with LLM pipelines, real-time streaming or telephony systems .
Experience operating GPU workloads .
Knowledge of AIOps, anomaly detection or intelligent alerting .
Cloud cost optimization experience.
Why Blue Machines
Build global-scale AI infrastructure from India .
Operate real-time Voice AI systems with 14.5M+ minutes in production .
Work on low-latency, high-reliability platforms .
Grow from Dev Ops/SRE into MLOps and AI platform engineering .
High ownership, deep technical impact and real production scale
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×