×
Register Here to Apply for Jobs or Post Jobs. X

Senior Software Engineer, AI Evaluation Infra

Job in New York, New York County, New York, 10261, USA
Listing for: nTopology
Full Time position
Listed on 2026-02-06
Job specializations:
  • Software Development
    AI Engineer, Machine Learning/ ML Engineer
Job Description & How to Apply Below
Location: New York

Overview

nTop is pioneering the future of engineering design with our advanced software that pushes the boundaries of performance and delivers mission-critical components faster than ever before. With a focus on Aerospace & Defense where programs face an impossible reality: deliver next-gen aircraft faster, with fewer experts, and zero tolerance for failure. nTop changes how aircraft get designed. Our platform collapses months of configuration iteration into hours, letting teams explore thousands of validated variants instead of locking in the first concept.

Teams cut development cycles by 50% and protect PWin with simulation-backed proposals. Defense primes and startups choose nTop when mission success isn t negotiable

We are looking for Software Engineers to solve the hardest problems in physical design exploration. Our users are the world s most demanding builders of physical goods—from aircrafts and race cars to energy turbines. Your focus will be on developing software for deeply parametric engineering, physical simulation, and managing immense design spaces. We reduce the crippling cost of late-stage design changes, making building with atoms as fast and agile as building with bits.

If you re motivated by solving tough engineering challenges alongside a team that learns and grows together, you ll thrive  re seeking teammates who are eager to experiment, innovate, and make a meaningful impact with technology.

nTop is hiring a Sr Software Engineer with a focus on Evaluation and Observability. You will own reliably measuring that our AI systems are ready for production. Design, implement, and maintain the rigorous evaluation frameworks that ensure the accuracy, groundedness, and reliability of our system. This role is NYC based-hybrid and reports to the VP of Engineering.

What You ll Do

As our Sr Software Engineer in AI Evals Infra & Observability, you will be the quality gate for our AI systems, focusing on the entire data-to-answer pipeline. Your responsibilities will include:

  • Design evaluation frameworks: Develop metrics and benchmarks to systematically measure AI model performance, including accuracy, robustness, safety, and reliability.
  • Develop automated tools: Build automated evaluation pipelines that run tests at scale to assess AI performance under various conditions, including adversarial, edge-case scenarios and/or integrate with 3rd party eval platforms/tools
  • Implement human feedback loops: Design human annotation protocols and quality control mechanisms to incorporate human judgment into the evaluation process, especially for subjective tasks.
  • Analyze model behavior: Conduct in-depth analysis to understand AI model performance, identify weaknesses, and pinpoint failure modes.
  • Build production systems: Extend or integrate external tools for evaluation process to production environments by creating dashboards, alerts, and observability tools to monitor models after deployment.
  • Golden Dataset Management: Collaborate with domain experts to curate and manage high-quality "Golden Question-Answer-Context" datasets essential for ground-truth RAG evaluation.
  • Prompt and System Optimization: Translate evaluation results into clear, actionable recommendations for Engineers to optimize the LLM integration, prompt templates, and data chunking strategies.
  • Collaborate across teams: Work closely with product managers and software engineers to ensure that evaluation methodologies align with business goals and to communicate technical findings to stakeholders.
Required Experience

We are looking for a hands-on engineer with 2-3 years of professional experience in machine learning, MLOps, or software quality assurance, specifically focused on modern LLM applications.

  • Experience building, testing, or evaluating production-grade RAG systems or other complex information retrieval/NLP systems.
  • Containerization & Infrastructure:
    Proven experience with Docker for containerizing applications, setting up consistent evaluation environments, and managing dependencies.
  • Programming & Tools:
    Expert proficiency in Python and experience with NLP/ML libraries and data processing tools.
  • MLOps and CI/CD: Practical experience…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary