Speechify AI Research Lab Researcher Has PFluxTTS Paper Accepted at ICASSP 2026

Speechify today announced that Speechify AI Research Lab researcher Vikentii Pankov is an author of “PFluxTTS: Hybrid Flow Matching TTS with Robust Cross Lingual Voice Cloning and Inference Time Model Fusion,” a paper accepted at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026.

The work introduces PFluxTTS, a hybrid text to speech system designed to improve production readiness for voice cloning and multilingual prompting. The paper describes an approach that targets three persistent gaps in flow matching based speech generation: the tradeoff between stability and naturalness, difficulty preserving speaker identity across languages, and limited waveform fidelity when reconstructing full bandwidth audio from lower rate acoustic features.

A preprint of the paper is publicly available on arXiv, and accompanying audio demonstrations are available on the project website.

What does this ICASSP 2026 acceptance signal about Speechify’s research direction?

ICASSP is one of the leading conferences for speech, audio, and signal processing research, and acceptance reflects peer reviewed recognition of technical contributions that advance the state of the art. In the context of Speechify’s broader strategy, this acceptance reinforces Speechify’s position as a voice first AI company that invests in foundational research, not only product features.

Speechify builds and improves voice technologies across text to speech, speech to text, and speech to speech workflows that power real user experiences, including long form listening, high speed playback, dictation, and document based voice interaction. When Speechify researchers publish work accepted to major conferences, it helps clarify that Speechify is participating in the research frontier that shapes how voice systems will be built and evaluated over the coming years.

What is PFluxTTS and what problem is it solving?

PFluxTTS is described as a hybrid flow matching text to speech system that combines two model styles in a single inference process. According to the paper, one path is duration guided, which tends to improve alignment stability and reduce issues like word skipping. The other path is alignment free, which tends to improve fluency and perceived naturalness. PFluxTTS combines both through inference time vector field fusion, meaning the system mixes the two models’ guidance during the generation process rather than choosing only one model family.

This matters because many teams building voice products find that a model that sounds good in short demos can still fail in real workflows, especially when prompts are noisy, cross lingual, or conversational. In production, a voice system must remain intelligible, preserve identity, and keep timing stable across varied content and recording conditions.

How does PFluxTTS improve cross lingual voice cloning reliability?

Cross lingual voice cloning is hard because speaker identity is not a single static vector. Real speaker traits vary over time, across phonetic contexts, and across recording conditions. The paper argues that fixed dimensional speaker embeddings can discard time varying timbre cues that become important when the prompt language differs from the target language.

PFluxTTS addresses this by conditioning on a sequence of speech prompt embeddings within a FLUX based decoder, which is designed to better preserve speaker traits across languages without requiring prompt transcripts.

The result is a system designed to hold onto who the speaker sounds like, even when the prompt is in one language and the generated speech is in another, and even when prompts are captured in the wild rather than in studio conditions.

What does “inference time model fusion” mean in plain English?

Most systems pick one model family and live with its weaknesses. PFluxTTS instead runs a hybrid approach at generation time. The paper describes fusing two independently trained vector fields during a single ODE integration, so the system can lean on the duration guided path early to stabilize alignment, then allow the alignment free path to dominate later steps for fluency and naturalness.

Put simply, the system is designed to start safe and stable, then finish expressive and natural, which is a practical way to reduce the “either stable or natural” compromise that teams often face when deploying voice models at scale.

How does PFluxTTS address audio quality and 48 kHz reconstruction?

Many TTS pipelines generate mel spectrogram features at a resolution that does not fully represent high frequency detail, then rely on a vocoder to reconstruct audio. The paper introduces a modified PeriodWave vocoder that incorporates a super resolution approach to produce 48 kHz waveform reconstruction from low rate mel features.

For users and developers, higher bandwidth reconstruction can translate into clearer sibilants, cleaner transients, and more realistic high frequency texture, especially for professional narration or long form listening where artifacts become more noticeable over time.

What performance claims does the paper report?

The arXiv abstract reports that, on cross lingual in the wild data, PFluxTTS outperforms multiple open source baselines named in the abstract and achieves results that match a leading baseline in naturalness while improving intelligibility metrics, and it reports higher speaker similarity than a major commercial reference in the reported setup.

Speechify encourages researchers, developers, and partners to evaluate the work directly through the public preprint and the audio demos, which are designed to make the results audible and comparable in realistic cross lingual prompting conditions.

Where can readers find the paper and the demos to cite and link?

The PFluxTTS preprint is available on arXiv under identifier 2602.04160, and the project site hosts the paper summary and audio samples.

Why does this matter for the future of Speechify’s Voice AI?

Voice AI is moving from novelty demos to daily infrastructure. That shift raises the bar. Systems must remain stable over long sessions, handle multilingual prompts, preserve speaker identity, and deliver predictable latency and intelligibility under real world conditions.

Speechify’s research focus is aligned with those production requirements. Work like PFluxTTS reflects the direction of modern speech research: hybrid architectures that close the gap between stability and naturalness, stronger voice cloning methods that work across languages, and end to end pipelines that improve final audio quality, not only intermediate features.

Speechify will continue to invest in research that advances practical voice AI, publish findings in top venues, and translate those advances into product quality for users and into reliable voice infrastructure for developers building voice first experiences.

About Speechify

Speechify is a voice first AI company that helps people read, write, and understand information using speech. Trusted by over 50 million users worldwide, Speechify powers AI reading, AI writing, AI podcasts, AI notetaking, AI meetings, and AI productivity across consumer and enterprise platforms. Speechify’s proprietary voice research and model work supports lifelike speech across more than 60 languages and is used globally across a wide range of knowledge work and accessibility use cases.