ScalewaySkip to loginSkip to main contentSkip to footer section

A new kind of AI

Moshi is a next-generation conversational model (Speech-to-Speech), designed to understand and respond fluidly and naturally to complex conversations, while bringing unprecedented expressiveness and spontaneity.
Unlike traditional AI systems, it offers instant voice interactions, enhanced by speech synthesis that adds a human and emotional dimension to every exchange.

Open Science by Kyutai

Built by Kyutai, a French AI research lab partially funded by the founder of Scaleway, Moshi is part of an Open Science initiative. This approach enables the community and businesses to benefit from the latest advancements in AI, while fostering innovation and large-scale customization. Moshi represents the future of conversational applications, accessible to everyone.

Accessible effortlessly

Thanks to our Managed Inference service, deploying Moshi within the Scaleway ecosystem is effortless. This model benefits from complete isolation of inference computations and network, ensuring optimal performance regardless of other users' activity, as well as full audio confidentiality. With no bandwidth limitations, Moshi is ready to provide dynamic voice interactions at any time.

Key features

Open Source

Open Science is at the heart of Kyutai and Moshi's philosophy. You can explore the full research paper to gain an in-depth understanding of Moshi and access the source code for inference under the Apache 2 license. Additionally, customize the performance by adjusting the model's weights yourself, available under the CC BY 4.0 license.

A full speech-to-speech model

Moshi is an experimental yet advanced Speech-to-Speech conversational model that receives the user's voice and generates both text and a vocal response. Its innovative “Inner Monologue” mechanism enhances the coherence and quality of the generated speech, strengthening its ability to reason and respond accurately.

Say it with emotion

Moshi can modulate its intonation to adapt to various emotional contexts. Whether you ask it to whisper a mysterious story or speak with the energy of a fearless pirate, it can express over 92 different intonations, adding a powerful and immersive emotional dimension to conversations.

End-to-end seamlessness

Moshi natively integrates WebSocket protocol support, enabling real-time management of vocal inputs and outputs. This ensures natural, continuous, and expressive interactions without any noticeable latency.

Design and trained in France

To make the training of Moshi feasible, Kyutai relied on our supercomputer Nabu2023. This cluster, consisting of 1,016 Nvidia H100 GPUs (~4 PFLOPS), is hosted at DC5, known for its efficient cooling, in the Paris greater-region.

Fine acoustic processing

The Mimi acoustic model, integrated into Moshi, processes audio in real time at 24 kHz and compresses it to a bandwidth of 1.1 kbps, while maintaining ultra-low latency of 80ms. Despite this high compression rate, Mimi outperforms non-streaming codecs such as SpeechTokenizer (50 Hz, 4 kbps) and SemantiCodec (50 Hz, 1.3 kbps), providing a smooth and accurate experience.

A state-of-the-art model

Current voice dialogue systems rely on chains of independent components (voice activity detection, speech recognition, text processing, and voice synthesis). This results in several seconds of latency and the loss of non-linguistic information, such as emotions or non-verbal sounds. Additionally, these systems segment dialogues into turn-based interactions, overlooking interruptions or overlapping speech.

Kyutai's approach with Moshi aims to solve these issues by directly generating speech (both audio and text) from the user's voice, without relying on intermediate text.

The user's and the AI's voices are modeled separately, allowing for more natural and dynamic dialogues. The model predicts text first, before generating sounds, enhancing linguistic quality while enabling real-time speech recognition and synthesis. With a theoretical latency of 160ms, Moshi is the first real-time, full-duplex voice language model.

Deep dive into Moshi

An open model pricing

Three models have been released: the audio codec Mimi, along with two pre-trained Moshi models featuring artificially generated voices: a masculine voice named Moshiko and a synthetic feminine voice called Moshika.

All these models have been published under the CC BY 4.0 license. This license allows others to distribute, fine-tune, and modify these models, even for commercial purposes, provided they give credit to Kyutai for the original creation.

ModesSupported languagesQuantizationGPUPrice
MoshikoEnglish (M)FP8L4-1-24G0.93€/hour
MoshikoEnglish (M)FP8, BF16H100-1-80G3.40€/hour
MoshikaEnglish (F)FP8L4-1-24G0.93€/hour
MoshikaEnglish (F)FP8, BF16H100-1-80G3.40€/hour

To fully understand Moshi, the full research paper is also accessible.

Deploy Moshi in 2 steps

Use one of our clients to interact with Moshi

Frequently asked questions

What are the current limitations of Moshi?

Moshi has a limited context window and conversations longer than 5 minutes will be stopped. It also has a limited knowledge base covering the years 2018 to 2023, which can lead to repetitive or inconsistent responses during prolonged interactions.

How do I use Scaleway's Managed Inference service with Moshi?

You can find a comprehensive guide here on getting started, including details on deployment, security, and billing. If you need further assistance, feel free to reach out to us through the Slack community #inference-beta.

What is Moshi's safety score?

To evaluate toxicity during content generation, the ALERT benchmark by Simone Tedeschi has been applied to Moshi. Moshi's score is 83.05 (Falcon: 88.11, GPT-4: 99.18). A higher score indicates a less "toxic" model.