Senior Platform Engineer – USDS
Listed on 2025-12-31
-
IT/Tech
Systems Engineer, Data Engineer
Responsibilities
About the Team The Cyber Defense & Engineering team is missioned to run and operate security infrastructures, platforms and technologies, as well as to support cross-functional teams to protect our users, products and infrastructures. This team is responsible for enhancing security tools and identifying vulnerabilities, with a specific focus on content assurance and the application of large language models (LLMs). You’ll collaborate cross-functionally with partners inside and outside Tik Tok to fortify our products and users’ security, helping to establish Tik Tok as the most trusted platform.
In order to enhance collaboration and cross-functional partnerships, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.
About the Role We are seeking a hands‑on Platform Engineer to architect, build, and operate the greenfield on‑premise infrastructure that powers our next‑generation AI initiatives. This is a unique opportunity to build an AI‑native platform from the ground up. You will bridge the gap between classic security services such as SPIFFE/Spire, which enables all our service‑to‑service communication, and modern AI workloads, leveraging a stack that includes Kubernetes, GPUs, Vector Databases, and streaming technologies like Kafka and Flink.
You will solve complex challenges in distributed computing, network latency, and automated provisioning to foster a culture of innovation and velocity. There is also responsibility for designing multi‑tenant cloud architecture.
- Architect and operate highly available, on‑premise Kubernetes‑based GPU compute cluster. You will manage the scheduling and orchestration of high performance workloads.
- Build robust CI/CD pipelines specifically for LLM applications.
- Infrastructure as Code (IaC):
Lead the design and implementation of IaC (using Terraform, Ansible or Saltstack) to fully automate the provisioning of bare metal servers, network and storage layers, and ensuring the environment is reproducible and idempotent. - Lead and perform hands‑on technical work, including architecture design and code development for an on‑premise, highly scalable, and parallelized infrastructure. The role includes developing internal tools to manage the entire lifecycle of a large scale RAG pipeline.
- Architect, implement, and manage a high‑performance compute cluster for LLM workloads. This involves the selection and configuration of specialized hardware like GPUs, as well as the design of a robust network fabric to facilitate efficient inter‑node communication for parallel processing.
- Implement security best practices for a private data centre environment. This includes configuring network firewalls, managing access controls, and encrypting data at rest and in transit.
- Establish comprehensive monitoring and alerting systems to track the health and performance of the compute cluster and LLM workloads. This involves analysing metrics related to GPU utilisation, memory usage, network throughput, and model inference latency. You will proactively resolve performance issues to enhance platform reliability and operational support for internal teams.
- Collaborate with internal stakeholders to optimise resource utilisation and improve the platform’s efficiency. You’ll work closely with data scientists and machine learning engineers to understand their compute needs and ensure the infrastructure is optimised for their specific workloads.
Minimum Qualifications:
- Bachelor’s degree in Computer Science, Information Technology, or a related field, with 5 years of experience in platform, systems, or infrastructure engineering.
- Proven expertise in infrastructure automation using tools like Terraform, Ansible or Saltstack, with strong hands‑on experience in automating deployments and managing bare‑metal hardware and virtual machines.
- Deep experience with on‑premises infrastructure, with a solid understanding of large‑scale data processing, distributed computing and other…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).