×
Register Here to Apply for Jobs or Post Jobs. X

Research Internship: Diagnostic & Perceptual Evaluation Framework Generative Speech

Job in Indiana Borough, Indiana County, Pennsylvania, 15705, USA
Listing for: Agigo AG
Apprenticeship/Internship position
Listed on 2026-03-12
Job specializations:
  • Software Development
    AI Engineer, Machine Learning/ ML Engineer
Salary/Wage Range or Industry Benchmark: 10000 - 60000 USD Yearly USD 10000.00 60000.00 YEAR
Job Description & How to Apply Below
Position: Research Internship: Diagnostic & Perceptual Evaluation Framework for Generative Speech

Research Internship:
Diagnostic & Perceptual Evaluation Framework for Generative Speech

Full-time | Voice & Conversational AI | Global Enterprise AI Platform

Duration: 4-8 Months

About AGIGO

AGIGO™ is the first enterprise-grade conversational AI platform that empowers enterprises to transform customer engagement and business performance with high-agency AI-agents - agents that match well-trained human customer agents in naturalness, responsiveness, and autonomous task resolution. Built for on-premises or hybrid deployment, with no reliance on third-party services, our proprietary platform gives enterprises full control, observability, and data sovereignty. Its unified core, tunable base models, and end-to-end design toolchain deliver context-aware, adaptable agents that engage directly with customers in real-time.

Founded February 2025 in Switzerland by a team of 18 experienced AI pioneers, AGIGO is driven by a bold vision to lead the next major wave in AI by transforming how businesses interact with their customers.

Your Research Mission

The objective of this internship is to design and build a next-generation diagnostic and perceptual evaluation framework for generative speech models - a system that not only tells us if a model is better, but why. You will combine robust objective metrics with novel techniques for automated failure diagnosis and perceptual correlation. The resulting framework will become a core internal tool, guiding model selection and optimization across AGIGO’s voice-synthesis development and deployment cycle.

Phase 1:
Foundational Objective Metrics

In the initial phase of your project, you implement a state-of-the-art suite of automated metrics that provide a comprehensive, objective view of model performance and robustness, going far beyond conventional Word Error Rate (WER):

Aggregated WER
:
An ensemble of diverse ASR models (auto and non-autoregressive (AR/NAR) models of different architectures) to measure intelligibility robustness.

Semantic Error Rate (SER):
You will implement a metric that goes beyond simple word matching. By comparing the semantic embeddings (e.g., from a T5 or BERT model) of the ground truth text and the ASR-transcribed text, this metric can tolerate minor transcription differences ("the car" vs. "a car") while heavily penalizing meaning-altering errors, e.g., hallucinations or repeated n-grams.

Signal & Perceptual Proxy Suite

  • Integrate standardized metrics such as STOI, PESQ, Si-SNR from Torch Audio-Squim to assess signal-level fidelity.
  • Integrate non-Intrusive perceptual objective metrics based on neural networks, such as DNSMOS.
  • Implement speaker similarity metrics using pre-trained speaker-verification models to quantify performance in voice cloning tasks (optional, if time allows).
Phase 2:
Automated Failure Diagnostics & Adversarial Testing

This is where the project becomes truly innovative. The goal is to automatically find and categorize the subtle failures that plague even the best TTS models. You will develop a classifier to detect common TTS failure modes on generated audio:

Hallucination Detector
:
Identifies repeated phrases, word omissions, and truncated sentences.

Prosody Mismatch Detector
: A model trained to detect when the intonation of a sentence does not match its punctuation, e.g., a question spoken as a statement.

Artifact Detector
: A model that specifically listens for common synthesis artifacts like metallic ringing or hissing.

Automated Challenge Set Generation
: A system to automatically find or generate difficult text samples (e.g., tongue twisters, complex numerical expressions) that are likely to cause a given model to fail, creating a constantly evolving stress test. We could potentially use an LLM to pursue this line of research. (Optional, if time allows.)

Predictive Evaluation
:
Can you analyze a model's internal states or confidence scores before synthesis to predict whether it is likely to fail on a given piece of text? This could be used to build a fallback or self-correction mechanism directly into the TTS engine.

Multi-Lingual Generalization
:
How do these advanced metrics and diagnostic tools generalize across different languages…

Position Requirements
Less than 1 Year work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary