SRE - Sr. Reliability Engineering
Listed on 2026-01-06
-
IT/Tech
Systems Engineer, Cloud Computing
Job Description:
Top
Skills:
" Cloud Infrastructure & Automation
o Design and manage scalable systems on platforms GCP.
o Use Infrastructure as Code (IaC) tools such as Terraform.
" Performance & Reliability Engineering
o Experience in capacity planning, performance tuning, and predictive analytics.
o Knowledge of distributed systems and high-availability architectures
" Monitoring & Observability
o Proficiency with APM tools like Dynatrace, New Relic, or App Dynamics.
o Proactive incident detection.
" Programming & Scripting
o Strong coding skills in Python, Go, or Java for automation and reliability improvements.
Experience
Required:
Minimum 4+ years of experience in the specific skill set (SRE)
Overall IT experience of 6 8+ years
Job Description
As we expand our customer deployments to build software that improves our customer s experience, we are seeking an experienced SRE to bring fresh ideas and demonstrate a unique and informed viewpoint to our business. The ideal candidate will be someone who enjoys collaborating with a cross-functional team to develop real-world solutions and positive user experiences at every interaction.
As an SRE, you will work with leading edge technologies both on-premise and in the cloud. Automation and superior software quality/performance and resiliency will be your mindset. You will be an expert resource in software and operational high-performance design patterns and support different development, architecture and operational teams from start to finish to create scalable and resilient solutions.
Responsibilities
" Support development, architecture and operational teams for performance/capacity related issues associated with complex multi-tier distributed platforms during the SDLC and postproduction.
" Support/coordinate new Build/Run initiatives prior to production and assure product readiness including infrastructure recommendations, software/script development, load/chaos testing, optimization, SLO definition, capacity planning, and observation/alerting.
" Review services, applications and identify bottlenecks. Identify opportunities to improve performance and scale.
" Perform new POCs for newer technologies and architectural patterns to help teams make informed decisions.
" Define new SLOs for services and applications to meet non-functional SLA requirements defined by the business.
" Work to reduce/minimize ongoing runtime costs through efficient throttling/queuing/pooling/autoscaling across application and infrastructure tiers.
" Proactively identify anomalies and opportunities in platforms in production to achieve greater performance/scale and recommend to impacted teams for future planning.
" Define performance quality gates and support canary development CI/CD scenarios around performance for teams.
Required Skills and Qualifications
" Experience supporting/troubleshooting large scale multi-tier distributed on-premise and cloud applications
" Experience architecting, developing and setting up new infrastructure solutions for GCP cloud leveraging terraform/on-premise applications
" Experience in Capacity Planning or Performance Engineering and leveraging predictive analytics to determine needed scaling patterns for platforms
" Experience programming in languages such as Java, NodeJS, Go, Python and Java Script
" Experience in Web Development and/or Web Service creation
" Demonstrable cross-functional knowledge with systems, storage, networking, security, and databases.
" Experience using APM tools such Dynatrace, New Relic or App Dynamics.
Preferred Qualifications
" Experienced Architect in GCP, Kubernetes, and serverless
" Collaborate with development team to define infrastructure requirements and implement scalable and resilient cloud architecture using terraform.
" Experience in migrating legacy applications to cloud-native architecture
" Strong understanding of Spring Framework
" Experienced in performance tracing/profiling using Google Developer Tools
"
Experience with SQL and database scaling/replication schemes
" Familiar with tools used for front end analysis such as Lighthouse, Page Speed Metrics, Webpage Test, GTMetrix and browser developer tools.
" Experience using Mongo
DB/Atlas, Oracle OCI, Postgres, GCP Cloud SQL
"
Experience with Angular
JS, React and Vue
" Experience tuning/optimizing runtime environments for Java (JVMs), Nodejs and Python for the best performance
"
Experience with Dev Ops/Quality gating concepts, Canary deployments and automation associated with CI/CD deployments.
" Experience in Enterprise Architecture integration patterns and domain model driven design addressing proper separation of concerns for an application/microservices and core web services.
" Experience using observability tools like Dynatrace or any APM tool is a must.
" Experience using cloud profiling tools and JVM tools like JProfiler/Java Flight Recorder.
" Experience in Testing methodologies and metrics using tools like JMeter, Neo Load, Load Runner or other.
" Systematic problem-solving approach, coupled with strong…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).