Fixing common issues with Generative APIs
Reviewed on 16 January 2025 • Published on 16 January 2025
Below are common issues that you may encounter when using Generative APIs, their causes, and recommended solutions.
429: Too Many Requests - You exceeded your current quota of requests/tokens per minuteLink to this anchor
CauseLink to this anchor
- You performed too many API requests over a given minute
- You consumed too many tokens (input and output) with your API requests over a given minute
SolutionLink to this anchor
- Ask our support to raise your quota
- Smooth out your API requests rate by limiting the number of API requests you perform in parallel
- Reduce the size of the input or output tokens processed by your API requests
- Use Managed Inference, where these quota do not apply (your throughput will be only limited by the amount of Inference Deployment your provision)
504: Gateway TimeoutLink to this anchor
CauseLink to this anchor
- The query is too long to process (even if context-length stays between supported context window and maximum tokens)
- The model goes into an infinite loop while processing the input (which is a known structural issue with several AI models)
SolutionLink to this anchor
- Set a stricter maximum token limit to prevent overly long responses.
- Reduce the size of the input tokens, or split the input into multiple API requests.
- Use Managed Inference, where no query timeout is enforced.
Structured output (e.g., JSON) is not working correctlyLink to this anchor
CauseLink to this anchor
- Incorrect field naming in the request, such as using
"format"
instead of the correct"response_format"
field. - Lack of a JSON schema, which can lead to ambiguity in the output structure.
SolutionLink to this anchor
- Ensure the proper field
"response_format"
is used in the query. - Provide a JSON schema in the request to guide the model’s structured output.
- Refer to the documentation on structured outputs for examples and additional guidance.
Multiple “role”: “user” successive messagesLink to this anchor
CauseLink to this anchor
- Successive messages with
"role": "user"
are sent in the API request instead of alternating between"role": "user"
and"role": "assistant"
.
SolutionLink to this anchor
- Ensure the
"messages"
array alternates between"role": "user"
and"role": "assistant"
. - If multiple
"role": "user"
messages need to be sent, concatenate them into one"role": "user"
message or intersperse them with appropriate"role": "assistant"
responses.
Example error message (for Mistral models)
{"object": "error","message": "After the optional system message, conversation roles must alternate user/assistant/user/assistant/...","type": "BadRequestError","param": null,"code": 400}
Tokens consumption is not displayed in Cockpit metricsLink to this anchor
CausesLink to this anchor
- Cockpit is isolated by
project_id
and only displays token consumption related to one Project. - Cockpit
Tokens Processed
graphs along time can take up to an hour to update (to provide more accurate average consumptions over time). The overallTokens Processed
counter is updated in real time.
SolutionLink to this anchor
- Ensure you are connecting to the Cockpit corresponding to your Project. Cockpits are currently isolated by
project_id
, which you can see in their URL:https://PROJECT_ID.dashboard.obs.fr-par.scw.cloud/
. This Project should correspond to the one used in the URL you used to perform Generative APIs requests, such ashttps://api.scaleway.ai/{PROJECT_ID}/v1/chat/completions
. You can list your projects and their IDs in your Organization dashboard.
Example error behavior
- When displaying the wrong Cockpit for the Project:
- Counter for Tokens Processed or API Requests should display a value of 0
- Graph across time should be empty
- When displaying the Cockpit of a specific Project, but waiting for average token consumption to display:
- Counter for Tokens Processed or API Requests should display a correct value (different from 0)
- Graph across time should be empty
## Best practices for optimizing model performance### Input size management- Avoid overly long input sequences; break them into smaller chunks if needed.- Use summarization techniques for large inputs to reduce token count while maintaining relevance.### Use proper parameter configuration- Double-check parameters like `"temperature"`, `"max_tokens"`, and `"top_p"` to ensure they align with your use case.- For structured output, always include a `"response_format"` and, if possible, a detailed JSON schema.### Debugging silent errors- For cases where no explicit error is returned:- Verify all fields in the API request are correctly named and formatted.- Test the request with smaller and simpler inputs to isolate potential issues.
Was this page helpful?