Job Description & How to Apply Below
Experience:
8 years+
Solid understanding of Google SRE principles and practices.
Handson experience implementing SLIs, SLOs, and error budgets.
Automation experience and hands on, preferably python. (Observability as
Code).
Expertise in incident management, postmortems, and reliability improvement
cycles.
Experience with monitoring and observability tools (e.g., Prometheus, Grafana,
New Relic, Datadog, Open Telemetry).
Strong expertise in logging, tracing, and metrics based troubleshooting.
Ability to design alerts that reflect customer and business impact.
Hands on with Linux, bash, git, CI/CD, Docker, K8S.
Experience with Infrastructure as Code (Terraform, ARM, Cloud Formation,
etc.).
Familiarity with CI/CD pipelines and deployment automation.
Strong focus on eliminating toil through automation.
Good understanding on AWS cloud concepts.
Good SQL knowledge so that it can be useful to run NRQL.
Fair understanding of rest API and Graph
QL .
Good understanding of networks, like CDN, DNS, API Gateway, Traffic routing.
Should have good understanding on BCP and Disaster Recover related
activities.
Strong analytical and troubleshooting skills under pressure.
Ability to diagnose complex production issues across multiple system layers.
Comfortable making data driven decisions during incidents.
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×