Site Reliability Engineer Job New York New York USA,IT/Tech

Location: New York

We are seeking a highly skilled Site Reliability Engineer (SRE) to join our Infrastructure Management team. The ideal candidate will be responsible for automating processes, enhancing system reliability, and reducing operational toil through innovative solutions. This role requires a strong foundation in scripting and automation tools, with a focus on creating self‑healing systems that ensure optimal performance and availability.

Responsibilities

Design, implement, and maintain automation frameworks to improve system reliability and performance.
Develop and manage scripts using Bash, Shell, and Python to automate routine tasks and processes.
Utilize Ansible for configuration management and deployment automation.
Implement auto‑healing mechanisms to proactively address system failures and reduce downtime.
Collaborate with development and operations teams to identify and eliminate toil in existing processes.
Monitor system performance and reliability metrics, providing insights and recommendations for improvements.
Participate in on‑call rotations to support production systems and respond to incidents as needed.
Document processes, procedures, and best practices to ensure knowledge sharing within the team.
Stay current with industry trends and emerging technologies to continuously enhance our infrastructure capabilities.

Mandatory Skills

Proven expertise in Site Reliability Engineering (SRE) principles and practices.
Strong scripting skills in Bash, Shell, and Python.
Experience with automation tools, particularly Ansible.
Solid understanding of system architecture, networking, and cloud technologies.
Ability to troubleshoot complex systems and provide effective solutions under pressure.
Excellent communication and collaboration skills, with a focus on teamwork.

Preferred Skills

Familiarity with containerization technologies such as Docker and orchestration tools like Kubernetes.
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Knowledge of CI/CD pipelines and Dev Ops practices.
Understanding of security best practices in infrastructure management.

Qualifications

Bachelor's degree in Computer Science, Information Technology, or a related field.
Relevant certifications in cloud technologies, automation, or SRE are a plus.

Demonstrated ability to work in a fast‑paced, dynamic environment.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language