Principal Site Reliability Engineer
Listed on 2026-01-19
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, Data Engineer
Job Description Our Team
Building off our Cloud momentum, Oracle has formed a new organization – Oracle Health Data, Analytics Platform. This team will focus on product development and product strategy for Oracle Health, while building out a complete platform supporting modernized, automated healthcare. This is a net new line of business, constructed with an entrepreneurial spirit that promotes an energetic and creative environment. We are unencumbered and will need your contribution to make it a world class engineering center with the focus on excellence.
Oracle Health Data, Analytics Platform has a rare opportunity to play a critical role in how Oracle Health products impact and disrupt the healthcare industry by transforming how healthcare and technology intersect.
You will have the opportunity to:- Reach billions of people with our products & services
- Create technology in which truly impacts the world
- Ability to have immediate impact on developing technology
- Unlimited growth potential with inspiring work
- Work with the best minds in the industry
- Enjoy working in an open, diverse, and productive environment
This role provides technical leadership for the core data platforms behind Oracle Health’s Data & Analytics Platform. As a Principal Site Reliability Engineer (SRE), you will own shared, mission-critical systems used by multiple products and teams.
You will lead the design and operation of large-scale, stateful distributed platforms, including Hadoop ecosystem components (HDFS, YARN, HBase) deployed on Oracle Big Data Service (BDS), Kafka, and Storm. These multi-tenant platforms are deployed and operated through Ansible- and Terraform-based automation and require strong architectural ownership to manage scale, change, and broad blast radius.
What You’ll Do Platform Ownership & Technical Leadership- Own the end-to-end reliability, scalability, and operability of shared data platforms
- Define platform standards, architectural direction, and operational guardrails
- Influence cross-team technical decisions and long-term platform strategy
- Drive long-term platform evolution and influence reliability strategy across the data ecosystem
- Lead platform architecture and design reviews
- Clearly articulate system behavior, dependencies, and failure modes
- Make principled trade-offs between reliability, performance, cost, and complexity
- Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively
- Establish capacity models, scaling strategies, and operational best practices
- Design platforms that behave predictably under load, failure, and change
- Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery
- Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
- Reason about failure modes such as back pressure, rebalancing, region movement, replication lag, and rolling upgrades
- Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication
- Treat security as a first-class architectural concern
- Design and evolve an Ansible- and Terraform-driven automation framework
- Treat automation as production software: versioned, reviewed, tested, and improved
- Eliminate operational toil by encoding reliability and safety into the platform
- Serve as the ultimate escalation point for complex or ambiguous incidents
- Focus on eliminating entire classes of failure, not just resolving individual issues
- Represent SRE and platform engineering in high-visibility and sensitive forums
- Communicate clearly with engineering leadership and partner teams
The team operates within the Oracle Health Data & Analytics Platform, supporting one of Oracle Health’s core products, Healthe Intent. We operate the big data and streaming infrastructure that enables downstream teams to deliver reliable customer-facing solutions at scale, while continuously improving operability and efficiency.
Required Experience- 8 years operating large-scale,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).