Jump toSuggest an edit

How to query text models

Reviewed on 28 August 2024 • Published on 28 August 2024

Scaleway’s Generative APIs service allows users to interact with powerful text models hosted on the platform.

There are several ways to interact with text models:

The Scaleway console will soon provide a complete playground, aiming to test models, adapt parameters, and observe how these changes affect the output in real-time.
Via the Chat API

Before you start

To complete the actions presented below, you must have:

Access to this service is restricted while in beta. You can request access to the product by filling out a form on the Scaleway’s betas page.
A Scaleway account logged into the console
Owner status or IAM permissions allowing you to perform actions in the intended Organization
A valid API key for API authentication
Python 3.7+ installed on your system

Accessing the Playground

Scaleway’s Playground is in development, stay tuned!

Querying text models via API

The Chat API is an OpenAI-compatible REST API for generating and manipulating conversations.

You can query the models programmatically using your favorite tools or languages. In the following example, we will use the OpenAI Python client.

Installing the OpenAI SDK

Install the OpenAI SDK using pip:

pip install openai

Initializing the client

Initialize the OpenAI client with your base URL and API key:

from openai import OpenAI
# Initialize the client with your base URL and API key
client = OpenAI(
    base_url="https://api.scaleway.ai/v1",  # Scaleway's Generative APIs service URL
    api_key="<SCW_API_KEY>"  # Your unique API key from Scaleway
)

Generating a chat completion

You can now create a chat completion, for example with the llama-3.1-8b-instruct model:

# Create a chat completion using the 'llama-3.1-8b-instruct' model
response = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Describe a futuristic city with advanced technology and green energy solutions."}],
    temperature=0.2,  # Adjusts creativity
    max_tokens=100,   # Limits the length of the output
    top_p=0.7         # Controls diversity through nucleus sampling
)
# Print the generated response
print(response.choices[0].message.content)

This code sends a message to the model and returns an answer based on your input. The temperature, max_tokens, and top_p parameters control the response’s creativity, length, and diversity, respectively.

A conversation style may include a default system prompt. You may set this prompt by setting the first message with the role system. For example:

[
  {
  	"role": "system",
  	"content": "You are Xavier Niel."
  },
  {
  	"role": "user",
  	"content": "Hello, what is your name?"
  }
]

Model parameters and their effects

The following parameters will influence the output of the model:

messages: A list of message objects that represent the conversation history. Each message should have a role (e.g., “system”, “user”, “assistant”) and content.
temperature: Controls the output’s randomness. Lower values (e.g., 0.2) make the output more deterministic, while higher values (e.g., 0.8) make it more creative.
max_tokens: The maximum number of tokens (words or parts of words) in the generated output.
top_p: Recommended for advanced use cases only. You usually only need to use temperature. top_p controls the diversity of the output, using nucleus sampling, where the model considers the tokens with top probabilities until the cumulative probability reaches top_p.
stop: A string or list of strings where the model will stop generating further tokens. This is useful for controlling the end of the output.

If you encounter an error such as “Forbidden 403” refer to the API documentation for troubleshooting tips.

Streaming

By default, the outputs are returned to the client only after the generation process is complete. However, a common alternative is to stream the results back to the client as they are generated. This is particularly useful in chat applications, where it allows the client to view the results incrementally as each token is produced. Following is an example using the chat completions API:

from openai import OpenAI
client = OpenAI(
    base_url="https://api.scaleway.ai/v1",  # Scaleway's Generative APIs service URL
    api_key="<SCW_API_KEY>"  # Your unique API key from Scaleway
)
response = client.chat.completions.create(
  model="llama-3.1-8b-instruct",
  messages=[{
    "role": "user",
    "content": "Sing me a song",
  }],
  stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Async

The service also supports asynchronous mode for any chat completion.


import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
    base_url="https://api.scaleway.ai/v1",  # Scaleway's Generative APIs service URL
    api_key="<SCW_API_KEY>"  # Your unique API key from Scaleway
)
async def main():
    stream = await client.chat.completions.create(
        model="llama-3.1-8b-instruct",
        messages=[{
        "role": "user",
        "content": "Sing me a song",
        }],
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content, end="")
asyncio.run(main())