Job Description & How to Apply Below
We are seeking an experienced Site Reliability Engineer / Platform Engineer to join our team and help build and maintain a resilient, scalable infrastructure supporting our applications across multiple cloud providers. In this role, you will design and implement infrastructure solutions, automate operational processes, and work closely with development teams to ensure reliable, efficient systems that scale with our business.
What You'll Do:
Design, build, and maintain infrastructure across AWS, GCP, and Azure using Infrastructure as Code (IaC) principles.
Implement and optimize CI/CD pipelines using tools like Argo and Circle
CI to enable rapid, reliable deployments.
Manage and scale Kubernetes clusters in production environments, ensuring high availability and optimal resource utilization.
Administer and optimize cloud databases including Mongo
DB, Redis, RDS, and other data stores for performance and reliability.
Develop monitoring, alerting, and observability solutions to identify and resolve issues before they impact users.
Automate routine operational tasks to reduce manual toil and improve system reliability.
Conduct incident response and post-mortem analysis to drive continuous improvement.
Collaborate with development teams to design systems with reliability, scalability, and operational excellence in mind.
Document infrastructure architecture, runbooks, and operational procedures.
Evaluate and implement new tools and technologies to improve platform capabilities.
What You'll Bring:
3+ years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering.
Strong hands-on experience with at least two major cloud providers (AWS, GCP, Azure).
Proficiency with Kubernetes for container orchestration and management.
Demonstrated expertise with IaC tools (Terraform, Cloud Formation, Pulumi, or similar).
Experience with CI/CD platforms, particularly Argo and/or Circle
CI.
Solid understanding of database technologies including Mongo
DB, Redis, and relational databases (RDS).
Proficiency in at least one programming or scripting language (Python, Go, Bash, Typescript, etc.).
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Cloud Watch).
Experience implementing and managing Open Telemetry (OTEL) for distributed tracing, metrics, and logging.
Strong understanding of networking, security, and infrastructure best practices.
Nice to Have
Experience managing multi-cloud or hybrid cloud environments.
Familiarity with service mesh technologies (Istio, Linkerd).
Knowledge of security hardening and compliance in cloud environments.
Experience with cost optimization in cloud infrastructure.
Contributions to open-source infrastructure or Dev Ops projects.
Certifications from major cloud providers.
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×