Managed Inference

Deploy your managed AI infrastructure with dedicated GPUs and optimized models.

Optimize the deployment of your model

Choose from a Model Library, featuring quantized LLMs, vLMs, embeddings, and more, or deploy your own model (e.g., Hugging Face). Skip the complexities of open-weight quantization and enjoy efficient inference.

Guaranteed throughput with dedicated Instances

Dedicated GPU infrastructure ensures consistent and predictable performance, with unlimited tokens at a fixed hourly rate. Guaranting stable inference speeds, critical for high-load, latency-sensitive applications like chatbots.

Secured Private Networking in a European Cloud

Access your AI endpoints over a private, low-latency connection within a Virtual Private Cloud. Data sovereignty is ensured—your prompts and responses remain private, stored only in Europe, and inaccessible to third parties.

Open-weights language and embedding models

Pixtral-12b-2409

Vision language model able to analyze your images and offer insights without compromising on instruction following. Another fantastic model made by Mistral AI and distributed under the Apache 2.0 license.

Try the 1st European Inference product for generative AI models on the market

Deploy a model now

Predictable pricing

Pick among on-the-shelf optimized models, and get a dedicated inference endpoint right away.

You are charged for usage of the GPU type you choose.

Model	Quantization	GPU	Price	Approx. per month
Llama3.1-8b-instruct	BF16, FP8	L4-1-24G	€0.93^/hour	~€679^/month
Llama3.1-70b-instruct	FP8	H100-1-80G	€3.40^/hour	~€2482^/month
Llama3.1-Nemotron-70b-instruct	FP8	H100-1-80G	€3.40^/hour	~€2482^/month
Mistral-7b-instruct-v0.3	BF16	L4-1-24G	€0.93^/hour	~€679^/month
Pixtral-12b-2409	BF16	H100-1-80G	€3.40^/hour	~€2482^/month
Mistral-nemo-instruct-2407	FP8	H100-1-80G	€3.40^/hour	~€2482^/month
Mixtral-8x7b-instruct-v0.1	FP8	H100-1-80G	€3.40^/hour	~€2482^/month
BGE-Multilingual-Gemma2	FP32	L4-1-24G	€0.93^/hour	~€679^/month
Qwen2.5-coder-32b-instruct	INT8	H100-1-80G	€3.40^/hour	~€2482^/month

More models and conditions available on this page.

Benefit from a secured European Cloud ecosystem

Virtual Private Cloud

Your AI endpoints are accessible through low-latency and secure connection to your resources hosted at Scaleway, thanks to a resilient regional Private Network.

Learn more

Access Management

We make generative AI endpoints compatible with Scaleway's Identity and Access Management, so that your deployments are compliant with your enterprise architecture requirements.

Learn more

Cockpit

Identify bottlenecks on your deployments, view inference requests in real time and even report your energy consumption with a fully managed observability solution.

Learn more

Give it a try now

Get started with tutorials

Retrieval-Augmented Generation (RAG)Learn how to implement RAG
Processing images with a vision modelGetting structured outputs with Pixtral vision model
Get started with agentic AIUtiliser des appels de fonctions avec Llama 3.1

Tutorials

Frequently asked questions

How can I start using this service?

You'll find here a comprehensive guide on getting started, including details on deployment, security, and billing.
If you need support, don't hesitate to reach out to us through the dedicated slack community #inference-beta

What are Scaleway's security protocols for AI services?

Scaleway’s AI services implement robust security measures to ensure customer data privacy and integrity. Our measures and policies are published on our documentation.

Can I use the OpenAI libraries and APIs?

Scaleway lets you seamlessly transition applications already utilizing OpenAI. You can use any of the OpenAI official libraries, for example the OpenAI Python client library, to interact with your Scaleway Managed Inference deployments. Find here the APIs and parameters supported.

What are the advantages over mutualized LLM API services?

Complete isolation of computing and networking resources to ensure maximum control for sensitive applications.
Consistent and predictable performance, unaffected by the activity of other users.
No strict rate limits—usage is only constrained by the maximum load your deployment can handle.
Access to a wider range of models.
More cost-effective with high utilization.

Do you have pay-per-token hosted models?

Managed Inference deploys AI models and creates dedicated endpoints on a secure production infrastructure.

Alternatively, Scaleway has a selection of hosted models in its datacenters, priced per million tokens consumed, available via API. Find all details on the Generative APIs page.

I've got a request, where can I share it?

Tell us the good and the bad about your experience here. Thank you for your time!

What are the different types of AI inference?

Two broad categories of inference can be distinguished in the field of artificial intelligence.

Deductive inference applies general rules to reach specific conclusions, such as a medical expert system that diagnoses a pathology based on symptoms.
Inductive inference, on the other hand, works the opposite way by deducing general principles from specific observations. A neural network that learns to recognize faces after analyzing thousands of photos is a prime example.

These two approaches are available in different deployment modes: batch inference for processing large volumes of data, and real-time inference for applications requiring instantaneous responses, such as autonomous vehicles.

Managed Inference

Optimize the deployment of your model

Guaranteed throughput with dedicated Instances

Secured Private Networking in a European Cloud

Open-weights language and embedding models

Pixtral-12b-2409

Llama-3.1-8b-instruct

Llama-3.1-70b-instruct

Mistral-nemo-instruct-2407

Mixtral-8x7B-Instruct-v0.1

BGE-Multilingual-Gemma2

Predictable pricing

You are charged for usage of the GPU type you choose.

Benefit from a secured European Cloud ecosystem

Virtual Private Cloud

Access Management

Cockpit

Get started with tutorials

Frequently asked questions

How can I start using this service?

What are Scaleway's security protocols for AI services?

Can I use the OpenAI libraries and APIs?

What are the advantages over mutualized LLM API services?

Do you have pay-per-token hosted models?

I've got a request, where can I share it?

What are the different types of AI inference?