×
Register Here to Apply for Jobs or Post Jobs. X

Devops​/MLOps and Platform Engineer

Job in Bengaluru, 560001, Bangalore, Karnataka, India
Listing for: apna
Full Time position
Listed on 2026-02-09
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability, Network Engineer
Job Description & How to Apply Below
Position: Devops / MLOps and Platform Engineer
Location: Bengaluru

Job Description

Software Engineer (SDE-2) – Dev Ops, SRE & MLOps Platform Engineering

Location:

Bengaluru

Employment Type:

Full-time
Team:  Platform Engineering / Reliability

About Blue Machines

Blue Machines  powers large-scale, real-time  Voice AI platforms and Agentic Workflows  for  global enterprises  across BFSI, Healthcare, HRTech and customer experience domains.
Built and scaled from India, our platform has  processed 14.5M+ minutes of production-grade AI agent conversations , operating  latency-sensitive, always-on voice systems  across geographies.

About the Role

We are hiring a  hands-on Dev Ops / SRE engineer  who owns  platform reliability, observability and automation  and grows into  MLOps and AI platform engineering .
This role focuses on  designing, operating and evolving  the infrastructure behind real-time Voice AI systems. You work directly on  production systems at global scale , driving uptime, performance and resilience.

Key Responsibilities

Platform Reliability & SRE

Own  99.9%+ platform uptime  for real-time Voice AI workloads.
Participate in  on-call rotations , incident response and post-incident reviews.
Lead  root cause analysis (RCA)  and drive permanent reliability improvements.
Design and implement  self-healing systems  using automation, retries, circuit breakers and failover strategies.

Kubernetes & Cloud Infrastructure

Design, operate and scale  Kubernetes clusters  in public cloud environments.
Work with managed Kubernetes platforms  such as GKE , and apply cloud-native best practices.
Implement  auto-scaling strategies  (HPA, VPA, node pools, GPU workloads).
Manage infrastructure using  Infrastructure as Code (Terraform) .
Optimize infrastructure for  performance, reliability and cost efficiency .

Observability & Incident Intelligence

Build and maintain  monitoring, logging and alerting systems  using  Prometheus, Grafana, Loki and Open Telemetry .
Define  SLIs, SLOs and error budgets  for platform and AI workloads.
Drive  signal-based alerting  to reduce noise and improve response quality.
Implement  anomaly detection and predictive alerting  for infrastructure and AI pipelines.

CI/CD & Platform Automation

Design and maintain  CI/CD pipelines  for services and infrastructure.
Build  internal automation tooling  for:
Progressive and canary deployments
Auto-scaling and capacity planning
Faster incident diagnosis and recovery
Enable  self-service Dev Ops workflows  for engineering teams.

MLOps & AI Platform Reliability

Own reliability and performance of  STT, TTS and LLM inference pipelines .
Design  provider routing, failover and SLA enforcement  mechanisms.
Deploy, version and roll back  AI models and inference services .
Monitor inference latency, quality and drift in production systems.
Operate  GPU-backed inference workloads  where applicable.

Security, Compliance & Resilience

Enforce  Dev Sec Ops  practices  across build and deploy pipelines.
Implement  network policies, encryption, secrets management and access controls .
Drive  disaster recovery, backup strategies and resilience testing .
Contribute to  SOC2 / ISO compliance and audits .

Collaboration & Engineering Excellence

Partner with  backend, AI and platform teams  on architecture and reliability.
Influence system design through a  reliability-first mindset .
Mentor junior engineers and raise the overall bar for operational excellence.

Qualifications

Must-Have

3–6 years  of experience in  Dev Ops, SRE or Platform Engineering  roles.
Strong hands-on experience with  Kubernetes and Docker  in production environments.
Familiarity with  public cloud platforms  and managed Kubernetes services  (such as GKE) .
Strong understanding of  distributed systems and production debugging .
Hands-on experience with  observability systems .
Proficiency with  Infrastructure as Code (Terraform) .
Strong incident ownership and communication skills.

Good-to-Have

Experience with  MLOps or AI inference platforms .
Familiarity with  LLM pipelines, real-time streaming or telephony systems .
Experience operating  GPU workloads .
Knowledge of  AIOps, anomaly detection or intelligent alerting .
Cloud  cost optimization  experience.

Why Blue Machines

Build  global-scale AI infrastructure from India .
Operate  real-time Voice AI systems  with  14.5M+ minutes in production .
Work on  low-latency, high-reliability platforms .
Grow from  Dev Ops/SRE into MLOps and AI platform engineering .
High ownership, deep technical impact and real production scale
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary