Jump to

400: Bad Request - You exceeded maximum context window for this model
- Cause
- Solution
403: Forbidden - Insufficient permissions to access the resource
- Cause
- Solution
416: Range Not Satisfiable - max_completion_tokens is limited for this model
- Cause
- Solution
416: Range Not Satisfiable - max_completion_tokens is limited for this model
- Cause
- Solution
429: Too Many Requests - You exceeded your current quota of requests/tokens per minute
- Cause
- Solution
429: Too Many Requests - You exceeded your current threshold of concurrent requests
- Cause
- Solution
504: Gateway Timeout
- Cause
- Solution
Structured output (e.g., JSON) is not working correctly
Multiple "role": "user" successive messages
- Cause
- Solution
Tokens consumption is not displayed in Cockpit metrics
- Causes
- Solution
Embeddings vectors cannot be stored in a database or used with a third-party library
- Cause
- Solution
Previous messages are not taken into account by the model
- Causes
- Solution
Best practices for optimizing model performance

Was this page helpful?

Fixing common issues with Generative APIs

Reviewed on 16 January 2025 • Published on 16 January 2025

Below are common issues that you may encounter when using Generative APIs, their causes, and recommended solutions.

400: Bad Request - You exceeded maximum context window for this modelLink to this anchor

CauseLink to this anchor

You provided an input exceeding the maximum context window (also known as context length) for the model you are using.
You provided a long input and requested a long input (in max_completion_tokens field), which added together, exceeds the maximum context window of the model you are using.

SolutionLink to this anchor

Reduce your input size below what is supported by the model.
Use a model supporting longer context window values.
Use Managed Inference, where the context window can be increased for several configurations with additional GPU vRAM. For instance, llama-3.3-70b-instruct model in fp8 quantization can be served with:
- 15k tokens context window on H100 Instances
- 128k tokens context window on H100-2 Instances

403: Forbidden - Insufficient permissions to access the resourceLink to this anchor

CauseLink to this anchor

You are not providing valid credentials
You do not have the required IAM permissions sets
You are not connecting to the right endpoint URL

SolutionLink to this anchor

Ensure you provide an API secret key in your API request or third party library configuration
Ensure the IAM user or application you are connecting with has the right IAM permissions sets (either GenerativeApisFullAccess or a narrower one)
If you have access only to a specific Project, ensure you are connecting to this Project URL:
- The URL format is: https://api.scaleway.ai/{project_id}/v1"
- If no project_id is specified in the URL (https://api.scaleway.ai/v1"), your default Project will be used.

416: Range Not Satisfiable - max_completion_tokens is limited for this modelLink to this anchor

CauseLink to this anchor

You provided a value for max_completion_tokens that is too high and not supported by the model you are using.

SolutionLink to this anchor

Remove max_completion_tokens field from your request or client library, or reduce its value below what is supported by the model.
- As an example, when using the init_chat_model from Langchain, you should edit the max_tokens value in the following configuration:
```
llm = init_chat_model("llama-3.3-70b-instruct", max_tokens="8000", model_provider="openai", base_url="https://api.scaleway.ai/v1", temperature=0.7)
```
Use a model supporting higher max_completion_tokens value.
Use Managed Inference, where these limits on completion tokens do not apply (your completion tokens amount will still be limited by the maximum context window supported by the model).

416: Range Not Satisfiable - max_completion_tokens is limited for this modelLink to this anchor

CauseLink to this anchor

You provided max_completion_tokens value too high, which is not supported by the model you are using.

SolutionLink to this anchor

Remove the max_completion_tokens field from your request or client library, or reduce its value below what is supported by the model.
- As an example, when using the init_chat_model from Langchain, you should edit the max_tokens value in the following configuration:
```
llm = init_chat_model("llama-3.3-70b-instruct", max_tokens="8000", model_provider="openai", base_url="https://api.scaleway.ai/v1", temperature=0.7)
```
Use a model supporting a higher max_completion_tokens value.
Use Managed Inference, where these limits on completion tokens do not apply (your completion tokens amount will still be limited by the maximum context window supported by the model).

429: Too Many Requests - You exceeded your current quota of requests/tokens per minuteLink to this anchor

CauseLink to this anchor

You performed too many API requests over a given minute
You consumed too many tokens (input and output) with your API requests over a given minute

SolutionLink to this anchor

Smooth out your API requests rate by limiting the number of API requests you perform over a given minute so that you remain below your Organization quotas for Generative APIs.
Add a payment method and validate your identity to increase automatically your quotas based on standard limits.
Ask our support to raise your quota.
Reduce the size of the input or output tokens processed by your API requests.
Use Managed Inference, where these quotas do not apply (your throughput will be only limited by the amount of Inference Deployment your provision)

429: Too Many Requests - You exceeded your current threshold of concurrent requestsLink to this anchor

CauseLink to this anchor

You kept too many API requests opened at the same time (number of HTTP sessions opened in parallel)

SolutionLink to this anchor

Smooth out your API requests rate by limiting the number of API requests you perform at the same time (eg. requests which did not receive a complete response and are still opened) so that you remain below your Organization quotas for Generative APIs.
Use Managed Inference, where concurrent request limit do not apply. Note that exceeding the number of concurrent requests your Inference Deployment can handle may impact performance metrics.

504: Gateway TimeoutLink to this anchor

CauseLink to this anchor

The query is too long to process (even if context-length stays between supported context window and maximum tokens)
The model goes into an infinite loop while processing the input (which is a known structural issue with several AI models)

SolutionLink to this anchor

Set a stricter maximum token limit to prevent overly long responses.
Reduce the size of the input tokens, or split the input into multiple API requests.
Use Managed Inference, where no query timeout is enforced.

Structured output (e.g., JSON) is not working correctlyLink to this anchor

DescriptionLink to this anchor

Structured output response contains invalid JSON
Structured output response is valid JSON but content is less relevant

CausesLink to this anchor

Incorrect field naming in the request, such as using "format" instead of the correct "response_format" field.
Lack of a JSON schema, which can lead to ambiguity in the output structure.
Maximum tokens is lower than what the model response needs to be complete.
Temperature is not set or set too high.

SolutionLink to this anchor

Ensure the proper field "response_format" is used in the query.
Provide a JSON schema in the request to guide the model’s structured output.
Ensure the max_tokens value is higher than the response completion_tokens value. If this is not the case, the model answer may be stripped down before it can finish the proper JSON structure (and lack closing JSON brackets } for example). Note that if the max_tokens value is not set in the query, default values apply for each model.
Ensure the temperature value is set with a lower range value for the model. If this is not the case, the model answer may output invalid JSON characters. Note that if the temperature value is not set in the query, default values apply for each model. As examples:
- for llama-3.3-70b-instruct, temperature should be set lower than 0.6
- for mistral-nemo-instruct-2407 , temperature should be set lower than 0.3
Refer to the documentation on structured outputs for examples and additional guidance.

Multiple “role”: “user” successive messagesLink to this anchor

CauseLink to this anchor

Successive messages with "role": "user" are sent in the API request instead of alternating between "role": "user" and "role": "assistant".

SolutionLink to this anchor

Ensure the "messages" array alternates between "role": "user" and "role": "assistant".
If multiple "role": "user" messages need to be sent, concatenate them into one "role": "user" message or intersperse them with appropriate "role": "assistant" responses.

Example error message (for Mistral models)

{
  "object": "error",
  "message": "After the optional system message, conversation roles must alternate user/assistant/user/assistant/...",
  "type": "BadRequestError",
  "param": null,
  "code": 400
}

Tokens consumption is not displayed in Cockpit metricsLink to this anchor

CausesLink to this anchor

Cockpit is isolated by project_id and only displays token consumption related to one Project.
Cockpit Tokens Processed graphs along time can take up to an hour to update (to provide more accurate average consumptions over time). The overall Tokens Processed counter is updated in real-time.

SolutionLink to this anchor

Ensure you are connecting to the Cockpit corresponding to your Project. Cockpits are currently isolated by project_id, which you can see in their URL: https://PROJECT_ID.dashboard.obs.fr-par.scw.cloud/. This Project should correspond to the one used in the URL you used to perform Generative APIs requests, such as https://api.scaleway.ai/{PROJECT_ID}/v1/chat/completions. You can list your projects and their IDs in your Organization dashboard.

Example error behavior

When displaying the wrong Cockpit for the Project:
- Counter for Tokens Processed or API Requests should display a value of 0
- Graph across time should be empty
When displaying the Cockpit of a specific Project, but waiting for average token consumption to display:
- Counter for Tokens Processed or API Requests should display a correct value (different from 0)
- Graph across time should be empty

Embeddings vectors cannot be stored in a database or used with a third-party libraryLink to this anchor

CauseLink to this anchor

The embedding model you are using generates vector representations with a fixed dimension number, which is too high for your database or third-party library.

For example, the embedding model bge-multilingual-gemma2 generates vector representations with 3584 dimensions. However, when storing vectors using PostgreSQL pgvector extensions, indexes (in hnsw or ivvflat formats) only support up to 2000 dimensions.

SolutionLink to this anchor

Use a vector store supporting higher dimensions numbers, such as Qdrant.
Do not use indexes for vectors or disable them from your third-party library. This may limit performance in vector similarity search for significant volumes.
- When using Langchain PGVector method, this method does not create an index by default and should not raise errors.
- When using the Mastra library with vectorStoreName: "pgvector", specify indexConfig type as flat to avoid creating any index on vector dimensions.
```
await vectorStore.createIndex({
  indexName: 'papers',
  dimension: 3584,
  indexConfig: {"type":"flat"},
});
```
Use a model with a lower number of dimensions. Using Managed Inference, you can deploy for instance thesentence-t5-xxl model, which represents vectors with 768 dimensions.

Previous messages are not taken into account by the modelLink to this anchor

CausesLink to this anchor

Previous messages are not sent to the model
The content sent exceeds the maximum context window for this model

SolutionLink to this anchor

LLM models are completely “stateless” and thus do not store previous messages or conversations. For example, when building a chatbot application, each time a new message is sent by the user, all preceding messages in the conversation need to be sent through the API payload. Example payload for multi-turn conversation:

from openai import OpenAI
client = OpenAI(
    base_url="https://api.scaleway.ai/v1",
    api_key=os.getenv("SCW_SECRET_KEY")
)
response = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[
      {
        "role": "user",
        "content": "What is the solution to 1+1= ?"
      },
      {
        "role": "assistant",
        "content": "2"
      },
      {
        "role": "user",
        "content": "Double this number"
      }
    ]
)
print(response.choices[0].message.content)

This snippet will output the model response, which is 4.

When exceeding the maximum context window, you should receive a 400 - BadRequestError detailing the context length value you exceeded. In this case, you should reduce the size of the content you send to the API.

Best practices for optimizing model performanceLink to this anchor

Input size managementLink to this anchor

Avoid overly long input sequences; break them into smaller chunks if needed.
Use summarization techniques for large inputs to reduce token count while maintaining relevance.

Use proper parameter configurationLink to this anchor

Double-check parameters like "temperature", "max_tokens", and "top_p" to ensure they align with your use case.
For structured output, always include a "response_format" and, if possible, a detailed JSON schema.

Debugging silent errorsLink to this anchor

For cases where no explicit error is returned:
- Verify all fields in the API request are correctly named and formatted.
- Test the request with smaller and simpler inputs to isolate potential issues.

Was this page helpful?