Evaluations Engineer
Job in
City Of London, Central London, Greater London, England, UK
Listed on 2026-01-10
Listing for:
COL Limited
Full Time
position Listed on 2026-01-10
Job specializations:
-
Software Development
AI Engineer, Data Scientist, Machine Learning/ ML Engineer
Job Description & How to Apply Below
Overview
Applications deadline:
We're accepting applications until 03 January 2026. We encourage early submissions and will start interviews in December 2025.
We’re looking for Evaluation Engineers who will run and own “evaluation campaigns” (pre-deployment testing for unreleased frontier models), build out our evaluation infrastructure, and automate the evals pipeline.
You will get to work with frontier labs like OpenAI, Anthropic, and Google Deep Mind and be amongst the first to interact with new models before anyone else.
The ideal candidate loves rigorously testing frontier AI models, and enjoys building efficient pipelines and automating them.
Key Responsibilities- Run and own “evaluation campaigns”:
We run an evals campaign approximately every two weeks with thousands of runs across hundreds of distinct environments. Our typical workflow for this process includes running all of our evaluations at scale, going through the resulting transcripts quickly using LLM assistance to find the most interesting evidence for the capabilities and propensities of the model, finding novel AI behaviors that no one else has ever observed before, e.g. the non-standard language described in our anti-scheming paper, diving deeper into this evidence and running targeted follow-up experiments, compiling the most interesting pieces of evidence and sharing them with the AI developer, and engaging with feedback and answering questions from the AI developer. - Automate the evaluation pipeline
: by improving our infrastructure, building more efficient processes, building and improving agentic workflows that quickly scan the results and provide preliminary conclusions, and more. We’re already using automation across all parts of the pipeline, i.e. building, running and analyzing the evals. This includes both classic pipeline automation procedures as well as LLM-based workflows. - Improve our evaluations
: by building new and better evaluations for frontier risks or including publicly available evaluations. - Develop a larger vision for what the best possible evaluation pipeline should look like a year from now.
- We don’t require a formal background or industry experience and welcome self-taught candidates.
- Software engineering skills: Our entire stack uses Python. We re looking for candidates with strong software engineering experience. Ideally, you have experience shipping and maintaining production Python code, and know how to factor messy problems into clean abstractions that others can use and extend.
- Process optimisation: You always try to improve workflows. Pre-deployment evaluations are very fast paced so ideally you love shaving friction off your workflows wherever possible.
- Data Analysis & Pattern Recognition: You can extract signal from large, messy datasets. You re comfortable with quantitative analysis and know when qualitative assessment is more appropriate. You can identify anomalies and unexpected model behaviors.
- Writing and communication: You succinctly convey qualitative and quantitative findings to a technical and non-technical audience.
- AI power-user: You are curious about the capabilities and propensities of frontier AI models. You have experience using different models, know which ones to use for which tasks, when not to use AI, and you always experiment with new AI workflows
- (Bonus) We are using Inspect as our primary evals framework, and we value experience with it.
- We want to emphasize that people who feel they don’t fulfill all of these characteristics but think they would be a good fit for the position, nonetheless, are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine.
- Start Date:
Target of 2-3 months after the first interview - Time Allocation:
Full-time - Location:
The office is in London, and the building is shared with the London Initiative for Safe AI (LISA) offices. - This is an in-person role. In rare situations, we may consider partially remote arrangements on a case-by-case basis.
- Work Visas:
We can sponsor UK visas
- Salary: 100k - 200k GBP (~135k - 270k…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×