Site Reliability Engineer Job Dallas area,Texas USA,IT/Tech

Overview

We are seeking a Site Reliability Engineer who will be responsible for ensuring the reliability, scalability, and performance of critical infrastructure systems and driving best practices in operational excellence. You will take a leadership role in implementing automation, optimizing system monitoring, troubleshooting complex issues, and enhancing the security and reliability of our IT systems. The ideal candidate will be a technical leader with deep expertise in managing large-scale infrastructure and a strong drive for improving operational efficiency.

Office hours and location

Full-time, Monday to Friday
Location:

Argyle, TX – Onsite

Duties and Responsibilities

Take ownership of the design, implementation, and maintenance of the organization’s infrastructure and systems, ensuring their optimal performance and reliability. This includes managing self-hosted systems, performing routine server maintenance, and conducting system upgrades.
Lead and manage the resolution of complex incidents, ensuring minimal downtime and disruption. Drive the incident response process, including root cause analysis, post‑mortem reporting, and implementing preventative measures to improve system reliability.
Implement and optimize system monitoring and alerting systems across the infrastructure. Proactively monitor systems and services to identify and resolve potential issues before they affect end‑users, ensuring high levels of system uptime and availability.
Drive automation initiatives to streamline operational tasks, including server provisioning, configuration management, and incident response processes. Lead the effort to build and enhance internal automation tools and scripts to improve operational efficiency.
Collaborate with the security team to improve security posture by enforcing security best practices, ensuring secure access management, and automating security patching and vulnerability management. Contribute to compliance initiatives by ensuring systems meet industry standards and regulations.
Work closely with development teams, Dev Ops, and other cross‑functional teams to implement and maintain reliable systems and ensure the smooth deployment of applications. Participate in the design and development of highly reliable, scalable, and fault‑tolerant systems.
Oversee and support the provisioning, de‑provisioning, and management of user accounts and SaaS platforms. Work closely with IT and HR teams to ensure seamless access management during onboarding and offboarding.
Design, implement, and test disaster recovery and backup processes to ensure data integrity and availability. Continuously improve these processes to minimize recovery time objectives (RTO) and recovery point objectives (RPO).
Create and maintain comprehensive documentation for systems, processes, and incident response procedures. Share best practices and lessons learned through knowledge‑sharing sessions, ensuring high levels of collaboration across teams.
Stay current with the latest technologies and best practices in cloud computing, containerization, automation, and site reliability engineering. Lead initiatives to evaluate, implement, and integrate new technologies that can improve system performance and reduce operational costs.

Required Experience and Job Qualifications

3+ years of experience in Site Reliability Engineering, Dev Ops, or IT operations with significant hands‑on experience in managing large‑scale systems.
Proficient with Linux/Unix systems, particularly RHEL and Ubuntu‑like distributions.
Experience with cloud infrastructure, primarily AWS, and the ability to manage and troubleshoot cloud‑based environments.
Strong understanding of networking concepts, such as TCP/IP, DNS, and HTTP/HTTPS, and hands‑on experience troubleshooting network connectivity issues.
Hands‑on experience with tools like Apache, Nginx, MySQL, Postgre

SQL, Docker, Kubernetes, Zabbix, Ansible, Puppet, and Terraform.
Experience with infrastructure automation tools (e.g., Ansible, Puppet, Chef, Terraform) and configuration management best practices.
Proficient in scripting and automation (e.g., Python, Bash, or Ruby).
Experience with monitoring tools (e.g.,…


Increase/decrease your Search Radius (miles)



Job Posting Language