Software Engineer; Tooling - Principal Level Job San Francisco area,California USA,Software Development

Position: Software Engineer (Tooling - Principal Level)

As a Software Engineer in SWEET (Software Engineering, Engagement and Tooling) Team at Mule Soft, you will be part of a high-impact team focused on architecting, building, and scaling the infrastructure, tools, and platforms that improve the resiliency, reliability, performance, and scalability of distributed systems running on the Mule Soft Anypoint Platform. This is a software engineering-driven role, where you'll write production-grade code to automate operations, enhance observability, and strengthen service resilience—especially in high-security environments, including FedRAMP, Protected B, among others.

This candidate must be a U.S. citizen (U.S. born or naturalized) operating on U.S. Soil who does not hold dual citizenship with the ability to meet customer and government screening standards applicable to this role.

Your work will span the entire stack: from shaping engineering practices and building proactive failure-prevention mechanisms to streamlining deployment pipelines and improving the end-to-end reliability of mission-critical services. As stewards of observability, incident management, release automation, and reliability engineering, our team’s mission is to embed resiliency and reliability into every layer of the system and consistently exceed industry standards for uptime, latency, and performance.

What You’ll Be Doing

Engineering Resiliency and Reliability:
Design and develop systems, libraries, and tools that strengthen the resiliency and reliability of distributed services running on the Mule Soft Anypoint Platform.
Observability by Design:
Develop and extend monitoring, logging, and alerting capabilities using industry-standard observability platforms (e.g., metrics, tracing, and log aggregation tools) to ensure issues are detected and diagnosed before they impact customers.
Automation at Scale:
Write production-grade code in Python, Go, or similar languages to automate operational tasks, scale deployment pipelines, and implement self-healing systems.
Incident Response & Prevention:
Participate in on-call rotations, drive root cause analysis, and deliver software-based solutions that prevent recurrence and reduce meantime to recovery (MTTR).
Platform and Infrastructure Development:
Build internal platforms, shared APIs, and systems that enhance developer velocity while improving overall system resilience and operability.
CI/CD and Deployment Engineering:
Optimize and evolve our CI/CD pipelines using Jenkins, Spinnaker, and infrastructure-as-code tools such as Terraform and Kubernetes to enable safe and frequent delivery.
Security and Compliance as Code:
Develop and maintain automated solutions to meet FedRAMP, Protected B, and other regulatory requirements—integrating security and compliance directly into deployment workflows.
Collaborative Reliability Advocacy:
Work closely with product engineers, platform teams, and security stakeholders to influence architectural decisions and bake reliability into all layers of the stack.
Runbooks and Design Documentation:
Create and maintain high-quality documentation for systems, processes, and playbooks to promote operational excellence and team scalability.

Requirements:

10+ years of experience in Software Engineering, with a particular focus on developing production-quality, maintainable, and testable code for infrastructure and platform automation.
Proven proficiency in coding with Java, Python, Go, Bash.
Hands‑on experience with infrastructure as code, CI/CD pipelines, and deployment automation using tools like Terraform, Jenkins, and Spinnaker.
Proven experience architecting, developing, and operating systems in cloud-native environments (AWS) and managing containerized workloads with Kubernetes.
Strong understanding of observability engineering, including instrumentation, metrics, logging, and distributed tracing—experience with Open Telemetry, Grafana, Splunk, Sumo Logic, or similar platforms.
Solid knowledge of distributed systems, network protocols (TCP/IP, DNS, HTTP, TLS), and API design standards (REST, RAML, OAS).
Demonstrated ability to diagnose complex system issues, design for fault tolerance and high availability, and continuously…


Increase/decrease your Search Radius (miles)



Job Posting Language