Site Reliability Engineer; Sre
Job in
Hudson, Summit County, Ohio, 44236, USA
Listed on 2026-03-01
Listing for:
Open Practice
Full Time
position Listed on 2026-03-01
Job specializations:
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
Job Description & How to Apply Below
Location:
On-site / Hybrid
Employment Type:
Full-time
Experience Level: Mid–Senior
We’re looking for a Site Reliability Engineer to help operate and scale a multi-tenant, web-based application running on AWS. This is a hands‑on role for someone who’s comfortable jumping into an already‑established architecture, making incremental improvements, and solving real production problems.
You’ll work closely with engineering and product teams to keep our platform reliable, performant, and scalable as customer usage grows. This is not a “greenfield rewrite” role — we need someone scrappy, practical, and effective inside real-world constraints.
What You’ll Do- Ensure the reliability, availability, and performance of a multi-tenant production system
- Scale and operate AWS-based infrastructure supporting a Java web application
- Monitor and troubleshoot issues across application, database, cache, and data warehouse layers
- Improve observability through metrics, logging, and alerting
- Participate in on‑call rotations and lead incident response and root cause analysis
- Identify performance bottlenecks and scaling limits in a shared‑tenant environment
- Automate operational tasks and reduce toil where it matters most
- Work within existing frameworks and tooling to make systems safer and more scalable
- Partner with developers to improve deployments, capacity planning, and failure handling
- Implement automated load and fuzz testing
- Define key service level objectives (SLO)
- AWS (EC2, ECS, RDS, Elasti Cache, Redshift, and related services)
- Java‑based web applications
- MySQL (performance tuning, scaling, reliability)
- Amazon Elasti Cache (Redis/Memcached)
- Amazon Redshift
- Monitoring and alerting tools (Graphite, Grafana, Cloudwatch)
- 3+ years of experience in SRE, Dev Ops, or production operations roles
- Strong understanding of AWS infrastructure and cloud‑native scaling patterns
- Experience supporting Java applications in production
- Solid knowledge of MySQL performance, replication, and scaling strategies
- Experience operating cache layers and data stores at scale
- Understanding of multi‑tenant architectures, including isolation, noisy‑neighbor issues, and capacity planning
- Strong Linux fundamentals and troubleshooting skills
- Ability to stay calm, think clearly, and prioritize during incidents
- A “get‑things‑done” mindset — pragmatic, resourceful, and comfortable with imperfect systems
- Experience scaling multi‑tenant SaaS platforms
- Familiarity with Redshift performance tuning and data workflows
- Infrastructure‑as‑code experience (Terraform)
- CI/CD and Git Lab pipeline experience
- Prior ownership of on‑call rotations and incident processes
- Experience improving reliability without large architectural rewrites
- Engineers who work within reality, not just ideal architectures
- Incremental improvements that reduce risk and improve uptime
- Clear communication during incidents
- Ownership, accountability, and practical problem‑solving
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×