×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Dallas, Dallas County, Texas, 75215, USA
Listing for: Aircraft Performance Group
Full Time position
Listed on 2026-03-01
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Project Manager
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

Overview

We are seeking a Site Reliability Engineer who will be responsible for ensuring the reliability, scalability, and performance of critical infrastructure systems and driving best practices in operational excellence. You will take a leadership role in implementing automation, optimizing system monitoring, troubleshooting complex issues, and enhancing the security and reliability of our IT systems. The ideal candidate will be a technical leader with deep expertise in managing large-scale infrastructure and a strong drive for improving operational efficiency.

Office hours and location
  • Full-time, Monday to Friday
  • Location:

    Argyle, TX – Onsite
Duties and Responsibilities
  • Take ownership of the design, implementation, and maintenance of the organization’s infrastructure and systems, ensuring their optimal performance and reliability. This includes managing self-hosted systems, performing routine server maintenance, and conducting system upgrades.
  • Lead and manage the resolution of complex incidents, ensuring minimal downtime and disruption. Drive the incident response process, including root cause analysis, post‑mortem reporting, and implementing preventative measures to improve system reliability.
  • Implement and optimize system monitoring and alerting systems across the infrastructure. Proactively monitor systems and services to identify and resolve potential issues before they affect end‑users, ensuring high levels of system uptime and availability.
  • Drive automation initiatives to streamline operational tasks, including server provisioning, configuration management, and incident response processes. Lead the effort to build and enhance internal automation tools and scripts to improve operational efficiency.
  • Collaborate with the security team to improve security posture by enforcing security best practices, ensuring secure access management, and automating security patching and vulnerability management. Contribute to compliance initiatives by ensuring systems meet industry standards and regulations.
  • Work closely with development teams, Dev Ops, and other cross‑functional teams to implement and maintain reliable systems and ensure the smooth deployment of applications. Participate in the design and development of highly reliable, scalable, and fault‑tolerant systems.
  • Oversee and support the provisioning, de‑provisioning, and management of user accounts and SaaS platforms. Work closely with IT and HR teams to ensure seamless access management during onboarding and offboarding.
  • Design, implement, and test disaster recovery and backup processes to ensure data integrity and availability. Continuously improve these processes to minimize recovery time objectives (RTO) and recovery point objectives (RPO).
  • Create and maintain comprehensive documentation for systems, processes, and incident response procedures. Share best practices and lessons learned through knowledge‑sharing sessions, ensuring high levels of collaboration across teams.
  • Stay current with the latest technologies and best practices in cloud computing, containerization, automation, and site reliability engineering. Lead initiatives to evaluate, implement, and integrate new technologies that can improve system performance and reduce operational costs.
Required Experience and Job Qualifications
  • 3+ years of experience in Site Reliability Engineering, Dev Ops, or IT operations with significant hands‑on experience in managing large‑scale systems.
  • Proficient with Linux/Unix systems, particularly RHEL and Ubuntu‑like distributions.
  • Experience with cloud infrastructure, primarily AWS, and the ability to manage and troubleshoot cloud‑based environments.
  • Strong understanding of networking concepts, such as TCP/IP, DNS, and HTTP/HTTPS, and hands‑on experience troubleshooting network connectivity issues.
  • Hands‑on experience with tools like Apache, Nginx, MySQL, Postgre

    SQL, Docker, Kubernetes, Zabbix, Ansible, Puppet, and Terraform.
  • Experience with infrastructure automation tools (e.g., Ansible, Puppet, Chef, Terraform) and configuration management best practices.
  • Proficient in scripting and automation (e.g., Python, Bash, or Ruby).
  • Experience with monitoring tools (e.g.,…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary