Senior Software Engineer, AI Evaluation Infra
Listed on 2026-02-06
-
Software Development
AI Engineer, Machine Learning/ ML Engineer
Overview
nTop is pioneering the future of engineering design with our advanced software that pushes the boundaries of performance and delivers mission-critical components faster than ever before. With a focus on Aerospace & Defense where programs face an impossible reality: deliver next-gen aircraft faster, with fewer experts, and zero tolerance for failure. nTop changes how aircraft get designed. Our platform collapses months of configuration iteration into hours, letting teams explore thousands of validated variants instead of locking in the first concept.
Teams cut development cycles by 50% and protect PWin with simulation-backed proposals. Defense primes and startups choose nTop when mission success isn t negotiable
We are looking for Software Engineers to solve the hardest problems in physical design exploration. Our users are the world s most demanding builders of physical goods—from aircrafts and race cars to energy turbines. Your focus will be on developing software for deeply parametric engineering, physical simulation, and managing immense design spaces. We reduce the crippling cost of late-stage design changes, making building with atoms as fast and agile as building with bits.
If you re motivated by solving tough engineering challenges alongside a team that learns and grows together, you ll thrive re seeking teammates who are eager to experiment, innovate, and make a meaningful impact with technology.
nTop is hiring a Sr Software Engineer with a focus on Evaluation and Observability. You will own reliably measuring that our AI systems are ready for production. Design, implement, and maintain the rigorous evaluation frameworks that ensure the accuracy, groundedness, and reliability of our system. This role is NYC based-hybrid and reports to the VP of Engineering.
What You ll DoAs our Sr Software Engineer in AI Evals Infra & Observability, you will be the quality gate for our AI systems, focusing on the entire data-to-answer pipeline. Your responsibilities will include:
- Design evaluation frameworks: Develop metrics and benchmarks to systematically measure AI model performance, including accuracy, robustness, safety, and reliability.
- Develop automated tools: Build automated evaluation pipelines that run tests at scale to assess AI performance under various conditions, including adversarial, edge-case scenarios and/or integrate with 3rd party eval platforms/tools
- Implement human feedback loops: Design human annotation protocols and quality control mechanisms to incorporate human judgment into the evaluation process, especially for subjective tasks.
- Analyze model behavior: Conduct in-depth analysis to understand AI model performance, identify weaknesses, and pinpoint failure modes.
- Build production systems: Extend or integrate external tools for evaluation process to production environments by creating dashboards, alerts, and observability tools to monitor models after deployment.
- Golden Dataset Management: Collaborate with domain experts to curate and manage high-quality "Golden Question-Answer-Context" datasets essential for ground-truth RAG evaluation.
- Prompt and System Optimization: Translate evaluation results into clear, actionable recommendations for Engineers to optimize the LLM integration, prompt templates, and data chunking strategies.
- Collaborate across teams: Work closely with product managers and software engineers to ensure that evaluation methodologies align with business goals and to communicate technical findings to stakeholders.
We are looking for a hands-on engineer with 2-3 years of professional experience in machine learning, MLOps, or software quality assurance, specifically focused on modern LLM applications.
- Experience building, testing, or evaluating production-grade RAG systems or other complex information retrieval/NLP systems.
- Containerization & Infrastructure:
Proven experience with Docker for containerizing applications, setting up consistent evaluation environments, and managing dependencies. - Programming & Tools:
Expert proficiency in Python and experience with NLP/ML libraries and data processing tools. - MLOps and CI/CD: Practical experience…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).