×
Register Here to Apply for Jobs or Post Jobs. X

LLM Evaluation Engineering Lead

Job in Redwood City, San Mateo County, California, 94061, USA
Listing for: DeepRec.ai
Full Time position
Listed on 2026-01-11
Job specializations:
  • Software Development
    AI Engineer, Machine Learning/ ML Engineer
Job Description & How to Apply Below

LLM Evaluations Engineering Lead – SF Bay Area (Onsite)

Full-time / Permanent

We’re partnering with a deep‑tech AI company building autonomous, agentic systems for complex physical and real‑world environments. The team operates at the edge of what’s possible today, designing AI systems that plan, act, recover, and improve over long horizons in high‑stakes settings.

They’re hiring an LLM Evaluations Engineering Lead to own the evaluation, verification, and regression layer for agentic LLM systems running end‑to‑end workflows. This is not a metrics‑only role; you’ll be building the guardrails that determine whether the system is actually getting better.

Why this role matters

As agentic LLM systems move into long‑horizon planning and execution, evals become the bottleneck.

  • Agents are actually improving
  • Changes introduce silent regressions
  • Uncertainty is shrinking or compounding
  • "success" reflects real‑world outcomes, not proxy metrics

Escalating incorrect evals means downstream systems fail. This role sits directly on that fault line.

What you’ll do
  • Build eval harnesses for agentic LLM systems (offline + in‑workflow)
  • Design evals for planning, execution, recovery, and safety
  • Implement verifier‑driven scoring and regression gates
  • Turn eval failures into training signals (SFT / DPO / RL)
What they’re looking for
  • Strong experience building evaluation systems for ML models (LLMs strongly preferred)
  • Excellent software engineering fundamentals:
    • Python
    • Data pipelines
    • Test harnesses
    • Distributed execution
    • Reproducibility
  • Deep understanding of agentic failure modes, including:
    • Tool misuse
    • Hallucinated evidence
    • Reward hacking
    • Brittle formatting and schema drift
  • Ability to reason about what to measure, not just how to measure it
  • Comfortable operating between research experimentation and production systems
Why join
  • Work on frontier agentic AI systems with real‑world consequences
  • Own a foundational layer that determines system reliability and progress
  • High autonomy, strong technical peers, and meaningful equity
  • Build evals that actually matter, not academic benchmarks
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary