Generative APIs - Concepts
API rate limits
API rate limits define the maximum number of requests a user can make to the Generative APIs within a specific time frame. Rate limiting helps to manage resource allocation, prevent abuse, and ensure fair access for all users. Understanding and adhering to these limits is essential for maintaining optimal application performance using these APIs.
Context window
A context window is the maximum amount of prompt data considered by the model to generate a response. Using models with high context length, you can provide more information to generate relevant responses. The context is measured in tokens.
Function calling
Function calling allows a large language model (LLM) to interact with external tools or APIs, executing specific tasks based on user requests. The LLM identifies the appropriate function, extracts the required parameters, and returns the results as structured data, typically in JSON format.
Embeddings
Embeddings are numerical representations of text data that capture semantic information in a dense vector format. In Generative APIs, embeddings are essential for tasks such as similarity matching, clustering, and serving as inputs for downstream models. These vectors enable the model to understand and generate text based on the underlying meaning rather than just the surface-level words.
Error handling
Error handling refers to the strategies and mechanisms in place to manage and respond to errors during API requests. This includes handling network issues, invalid inputs, or server-side errors. Proper error handling ensures that applications using Generative APIs can gracefully recover from failures and provide meaningful feedback to users.
Parameters
Parameters are settings that control the behavior and performance of generative models. These include temperature, max tokens, and top-p sampling, among others. Adjusting parameters allows users to tweak the model’s output, balancing factors like creativity, accuracy, and response length to suit specific use cases.
Inter-token Latency (ITL)
The inter-token latency (ITL) corresponds to the average time elapsed between two generated tokens. It is usually expressed in milliseconds.
JSON mode
JSON mode allows you to guide the language model in outputting well-structured JSON data.
To activate JSON mode, provide the response_format
parameter with {"type": "json_object"}
.
JSON mode is useful for applications like chatbots or APIs, where a machine-readable format is essential for easy processing.
Prompt Engineering
Prompt engineering involves crafting specific and well-structured inputs (prompts) to guide the model towards generating the desired output. Effective prompt design is crucial for generating relevant responses, particularly in complex or creative tasks. It often requires experimentation to find the right balance between specificity and flexibility.
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) is a technique that enhances generative models by integrating information retrieval methods. By fetching relevant data from external sources before generating a response, RAG ensures that the output is more accurate and contextually relevant, especially in scenarios requiring up-to-date or specific information.
Stop words
Stop words are a parameter set to tell the model to stop generating further tokens after one or more chosen tokens have been generated. This is useful for controlling the end of the model output, as it will cut off at the first occurrence of any of these strings.
Streaming
Streaming is a parameter allowing responses to be delivered in real-time, showing parts of the output as they are generated rather than waiting for the full response. Scaleway is following the Server-sent events standard. This behavior usually enhances user experience by providing immediate feedback and a more interactive conversation.
Structured outputs
Structured outputs enable you to format the model’s responses to suit specific use cases. To activate structured outputs, provide the response_format
parameter with "type": "json_schema"
and define its "json_schema": {}
.
By customizing the structure, such as using lists, tables, or key-value pairs, you ensure that the data returned is in a form that is easy to extract and process.
By specifying the expected response format through the API, you can make the model consistently deliver the output your system requires.
Temperature
Temperature is a parameter that controls the randomness of the model’s output during text generation. A higher temperature produces more creative and diverse outputs, while a lower temperature makes the model’s responses more deterministic and focused. Adjusting the temperature allows users to balance creativity with coherence in the generated text.
Time to First Token (TTFT)
Time to First Token (TTFT) measures the time elapsed from the moment a request is made to the point when the first token of the generated text is returned. TTFT is a crucial performance metric for evaluating the responsiveness of generative models, especially in interactive applications where users expect immediate feedback.
Tokens
Tokens are the basic units of text that a generative model processes. Depending on the tokenization strategy, these can be words, subwords, or even characters. The number of tokens directly affects the context window size and the computational cost of using the model. Understanding token usage is essential for optimizing API requests and managing costs effectively.