ScalewaySkip to loginSkip to main contentSkip to footer section

Optimized models – incl. yours– deployment

Choose from a Model Library, featuring quantized LLMs, vLMs, embeddings, and more, or deploy your own model (e.g., Hugging Face) soon. Skip the complexities of open-weight quantization and enjoy efficient inference.

Guaranteed throughput with dedicated Instances

Dedicated GPU infrastructure ensures consistent and predictable performance, with unlimited tokens at a fixed hourly rate. Guaranting stable inference speeds, critical for high-load, latency-sensitive applications like chatbots.

Secured Private Networking in a European Cloud

Access your AI endpoints over a private, low-latency connection within a Virtual Private Cloud. Data sovereignty is ensured—your prompts and responses remain private, stored only in Europe, and inaccessible to third parties.

Open-weights language and embedding models

Pixtral-12b-2409

Vision language model able to analyze your images and offer insights without compromising on instruction following. Another fantastic model made by Mistral AI and distributed under the Apache 2.0 license.

Predictable pricing

Pick among on-the-shelf optimized models, and get a dedicated inference endpoint right away.

You are charged for usage of the GPU type you choose.


ModelQuantizationGPUPriceApprox. per month
Llama3.1-8b-instructBF16, FP8L4-1-24G€0.93/hour~€679/month
Llama3.1-70b-instructFP8H100-1-80G€3.40/hour~€2482/month
Llama3.1-Nemotron-70b-instructFP8H100-1-80G€3.40/hour~€2482/month
Mistral-7b-instruct-v0.3BF16L4-1-24G€0.93/hour~€679/month
Pixtral-12b-2409BF16H100-1-80G€3.40/hour~€2482/month
Mistral-nemo-instruct-2407FP8H100-1-80G€3.40/hour~€2482/month
Mixtral-8x7b-instruct-v0.1FP8H100-1-80G€3.40/hour~€2482/month
BGE-Multilingual-Gemma2FP32L4-1-24G€0.93/hour~€679/month
Qwen2.5-coder-32b-instructINT8H100-1-80G€3.40/hour~€2482/month

More models and conditions available on this page.

Benefit from a secured European Cloud ecosystem

Virtual Private Cloud

Your AI endpoints are accessible through low-latency and secure connection to your resources hosted at Scaleway, thanks to a resilient regional Private Network.

Learn more

Access Management

We make generative AI endpoints compatible with Scaleway's Identity and Access Management, so that your deployments are compliant with your enterprise architecture requirements.

Learn more

Cockpit

Identify bottlenecks on your deployments, view inference requests in real time and even report your energy consumption with a fully managed observability solution.

Learn more

Frequently asked questions

How can I start using this service?

You'll find here a comprehensive guide on getting started, including details on deployment, security, and billing.
If you need support, don't hesitate to reach out to us through the dedicated slack community #inference-beta

What are Scaleway's security protocols for AI services?

Scaleway’s AI services implement robust security measures to ensure customer data privacy and integrity. Our measures and policies are published on our documentation.

Can I use the OpenAI libraries and APIs?

Scaleway lets you seamlessly transition applications already utilizing OpenAI. You can use any of the OpenAI official libraries, for example the OpenAI Python client library, to interact with your Scaleway Managed Inference deployments. Find here the APIs and parameters supported.

What are the advantages over mutualized LLM API services?
  • Complete isolation of computing and networking resources to ensure maximum control for sensitive applications.
  • Consistent and predictable performance, unaffected by the activity of other users.
  • No strict rate limits—usage is only constrained by the maximum load your deployment can handle.
  • Access to a wider range of models.
  • More cost-effective with high utilization.
Do you have pay-per-token hosted models?

Managed Inference deploys AI models and creates dedicated endpoints on a secure production infrastructure.

Alternatively, Scaleway has a selection of hosted models in its datacenters, priced per million tokens consumed, available via API. Find all details on the Generative APIs page.

I've got a request, where can I share it?

Tell us the good and the bad about your experience here. Thank you for your time!

Get started with tutorials