Fixing common issues with Generative APIs
Reviewed on 16 January 2025 • Published on 16 January 2025
Below are common issues that you may encounter when using Generative APIs, their causes, and recommended solutions.
429: Too Many Requests - You exceeded your current quota of requests/tokens per minute
Cause
- You performed too many API requests over a given minute
- You consumed too many tokens (input and output) with your API requests over a given minute
Solution
- Ask our support to raise your quota
- Smooth out your API requests rate by limiting the number of API requests you perform in parallel
- Reduce the size of the input or output tokens processed by your API requests
- Use Managed Inference, where these quota do not apply (your throughput will be only limited by the amount of Inference Deployment your provision)
504: Gateway Timeout
Cause
- The query is too long to process (even if context-length stays between supported context window and maximum tokens)
- The model goes into an infinite loop while processing the input (which is a known structural issue with several AI models)
Solution
- Set a stricter maximum token limit to prevent overly long responses.
- Reduce the size of the input tokens, or split the input into multiple API requests.
- Use Managed Inference, where no query timeout is enforced.
Structured output (e.g., JSON) is not working correctly
Cause
- Incorrect field naming in the request, such as using
"format"
instead of the correct"response_format"
field. - Lack of a JSON schema, which can lead to ambiguity in the output structure.
Solution
- Ensure the proper field
"response_format"
is used in the query. - Provide a JSON schema in the request to guide the model’s structured output.
- Refer to the documentation on structured outputs for examples and additional guidance.
Multiple “role”: “user” successive messages
Cause
- Successive messages with
"role": "user"
are sent in the API request instead of alternating between"role": "user"
and"role": "assistant"
.
Solution
- Ensure the
"messages"
array alternates between"role": "user"
and"role": "assistant"
. - If multiple
"role": "user"
messages need to be sent, concatenate them into one"role": "user"
message or intersperse them with appropriate"role": "assistant"
responses.
Example error message (for Mistral models)
{"object": "error","message": "After the optional system message, conversation roles must alternate user/assistant/user/assistant/...","type": "BadRequestError","param": null,"code": 400}
Best practices for optimizing model performance
Input size management
- Avoid overly long input sequences; break them into smaller chunks if needed.
- Use summarization techniques for large inputs to reduce token count while maintaining relevance.
Use proper parameter configuration
- Double-check parameters like
"temperature"
,"max_tokens"
, and"top_p"
to ensure they align with your use case. - For structured output, always include a
"response_format"
and, if possible, a detailed JSON schema.
Debugging silent errors
- For cases where no explicit error is returned:
- Verify all fields in the API request are correctly named and formatted.
- Test the request with smaller and simpler inputs to isolate potential issues.
Was this page helpful?