More jobs:
Job Description & How to Apply Below
Location: Bengaluru
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for the availability and reliability of our firm's most critical platform applications / services, and ensures they meet the requirements of our internal and external users. We look for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business environment.
We are seeking a highly skilled and experienced Senior Lead/Manager – Application Site Reliability Engineer (SRE) to lead and mentor a team of junior SREs while driving the reliability, scalability, and performance of our mission-critical applications. The ideal candidate will have a strong background in SRE practices, software engineering, and infrastructure management, coupled with proven leadership abilities. You will be responsible for ensuring the stability of our systems, implementing best practices and fostering a culture of continuous improvement within the team.
Additionally, you will play a key role in reviewing, planning and setting up reliable UAT environments to ensure seamless testing and deployment processes.
Responsibilities
Leadership & Team Management:
Lead and mentor a team of junior SREs, fostering a culture of learning and technical excellence
Define and drive best practices for reliability and automation
Set clear goals and expectations for the team, ensuring alignment with organizational objectives
Develop training plans and career growth opportunities for the SRE team
Partner with cross-functional teams to define and execute the SRE roadmap, aligning with business goals
Champion the adoption of SRE principles across the organization, promoting a shift-left mindset
Technical Responsibilities
Design, implement and maintain scalable, reliable and efficient systems to support application infrastructure
Develop and improve observability strategies using monitoring, logging and tracing tools (e.g. Prometheus, Grafana, Splunk)
Optimize CI/CD pipelines and ensure high system reliability during releases
Review, plan and set up reliable UAT environments that mirror production systems, ensuring consistency and stability for testing purposes
Ensure UAT environments are properly versioned, documented and maintained to support seamless testing and deployment workflows .
Improve application performance by identifying bottlenecks, optimizing code and enhancing infrastructure configurations
Collaborate with development team to implement SRE principles into software architecture and infrastructure design
Conduct performance testing, profiling and tuning to ensure applications meet or exceed performance benchmarks
Collaborate with development and product teams to prioritize and implement performance improvements.
Adhere to and drive incident management process and support a blameless post-mortems culture.
SKILLS AND EXPERIENCE WE ARE LOOKING FOR
12+ years in Site Reliability Engineering, Dev Ops or related fields with atleast 3 years in a leadership or management role.
REQUIRED QUALIFICATIONS
Technical: MS degree in Computer Science or related technical field involving coding and/or systems engineering.
Hands-on experience with coding, debugging, deploying and optimizing code, as well as automation in Java, Unix Shell scripting, complex SQL queries and stored procedures.
Proficiency in cloud platforms and containerization technologies (e.g Docker, Kubernetes)
Experience with CI/CD tools (e.g. Jenkins, Git Lab, Maven, Co-pilot etc.)
Hands-on Experience and good understanding of algorithms, data structures and software design.
Experience in setting up and managing UAT environments
Hands-on
Experience with UNIX operating systems internals and / or networking.
Leadership:
Excellent Communication and interpersonal skills with the ability to work effectively with both technical and non-technical stakeholders
Strong problem-solving and decision-making abilities
Focus on delivery results
Lead and mentor technical teams
PREFERRED QUALIFICATIONS
Hands-on experience with coding, debugging, deploying and optimizing code, as well as automation
Experience with distributed systems design, maintenance, and troubleshooting
Ability to work independently with little supervision.
Coding beyond simple scripts and solving novel problems from first principles.
Knowledge of using Microsoft Copilot
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×