LLM Inference - Concepts

Allowed IPs

Allowed IPs are single IPs or IP blocks which have the required permissions to remotely access a deployment. They allow you to define which host and networks can connect to your LLM Inference endpoints. You can add, edit, or delete allowed IPs. In the absence of allowed IPs, all IP addresses are allowed by default.

Access control is handled directly at the network level by Load Balancers, making the filtering more efficient and universal and relieving the LLM Inference server from this task.

Context size

The context size refers to the length or size of the input text used to generate predictions or responses from a Large Language Model (LLM). It is crucial in determining the model’s understanding of the given prompt or query.

Deployment

A deployment makes a trained language model available for real-world applications. It encompasses tasks such as integrating the model into existing systems, optimizing its performance, and ensuring scalability and reliability.

Embedding models

Embedding models are a representation-learning technique that converts textual data into numerical vectors. These vectors capture semantic information about the text, and are often used as input to downstream machine-learning models, or algorithms.

Endpoint

In the context of LLMs, an endpoint refers to a network-accessible URL or interface through which clients can interact with the model for inference tasks. It exposes methods for sending input data and receiving model predictions or responses.

Fine-tuning

Fine-tuning involves further training a pre-trained language model on domain-specific or task-specific data to improve performance on a particular task. This process often includes updating the model’s parameters using a smaller, task-specific dataset.

Few-shot prompting

Few-shot prompting uses the power of language models to generate responses with minimal input, relying on just a handful of examples or prompts. It demonstrates the model’s ability to generalize from limited training data to produce coherent and contextually relevant outputs.

Hallucinations

Hallucinations in LLMs refer to instances where generative AI models generate responses that, while grammatically coherent, contain inaccuracies or nonsensical information. These inaccuracies are termed “hallucinations” because the models create false or misleading content. Hallucinations can occur because of constraints in the training data, biases embedded within the models, or the complex nature of language itself.

Inference

Inference is the process of deriving logical conclusions or predictions from available data. This concept involves using statistical methods, machine learning algorithms, and reasoning techniques to make decisions or draw insights based on observed patterns or evidence. Inference is fundamental in various AI applications, including natural language processing, image recognition, and autonomous systems.

Large Language Model Applications

LLM Applications are applications or software tools that leverage the capabilities of LLMs for various tasks, such as text generation, summarization, or translation. These apps provide user-friendly interfaces for interacting with the models and accessing their functionalities.

Large Language Models

LLMs are advanced artificial intelligence systems capable of understanding and generating human-like text on various topics. These models, such as Llama-2, are trained on vast amounts of data to learn the patterns and structures of language, enabling them to generate coherent and contextually relevant responses to queries or prompts. LLMs have applications in natural language processing, text generation, translation, and other tasks requiring sophisticated language understanding and production.

Prompt

In the context of LLMs, a prompt refers to the input provided to the model to generate a desired response. It typically consists of a sentence, paragraph, or series of keywords or instructions that guide the model in producing text relevant to the given context or task. The quality and specificity of the prompt greatly influences the generated output, as the model uses it to understand the user’s intent and create responses accordingly.

Quantization

Quantization is a technique used to reduce the precision of numerical values in a model’s parameters or activations to improve efficiency and reduce memory footprint during inference. It involves representing floating-point values with fewer bits while minimizing the loss of accuracy. LLMs provided for deployment are named with suffixes that denote their quantization levels, such as :int8, :fp8, and :fp16.

Retrieval Augmented Generation (RAG)

RAG is an architecture combining information retrieval elements with language generation to enhance the capabilities of LLMs. It involves retrieving relevant context or knowledge from external sources, and incorporating it into the generation process to produce more informative and contextually grounded outputs.