Remember to always verify the AI-generated information for accuracy before using it in critical applications or decision-making processes.
Processing images and getting structured outputs with Pixtral vision model
- AI
- vision-model
- image-processing
- Pixtral
- Mistral
- structured-data
In today’s data-driven world, the ability to extract structured information from visual content is becoming increasingly valuable across various industries. From analyzing medical images to interpreting financial charts, from processing historical documents to cataloging diverse product lines, the applications are vast and varied.
Pixtral, developed by Mistral AI, is a powerful vision model capable of understanding and describing images with remarkable accuracy. By leveraging Pixtral, we can automate the process of extracting structured information from a wide range of visual inputs, including photographs, scanned documents, graphs, and more.
This tutorial will guide you through the process of using the Pixtral vision model to analyze images and automatically generate structured outputs. We’ll use Python to interact with the model and structure our data, making it easy to integrate this solution into your existing workflows. While we’ll use a product catalog as an example, the techniques demonstrated here can be adapted to various use cases across different industries.
Before you start
To complete the actions presented below, you must have:
- A Scaleway account logged into the console
- A Python environment (version 3.7 or higher)
- An API key from Scaleway Identity and Access Management
- Access to a Scaleway Managed Inference endpoint with Pixtral deployed or to Scaleway Generative APIs service
- The
openai
andpydantic
Python libraries installed
Setting up the environment
Before we dive into using Pixtral, let’s set up our Python environment and install the necessary libraries.
-
Create a new directory for your project:
mkdir pixtral-image-processorcd pixtral-image-processor -
Create a virtual environment and activate it:
python3 -m venv venvsource venv/bin/activate # On Windows, use `venv\Scripts\activate` -
Install the required libraries:
pip install openai pydantic
Defining the data model
We’ll start by defining our data model using pydantic
. This will ensure that our structured output has a consistent format and that all required fields are present.
Create a new file called models.py
and add the following code:
from pydantic import BaseModel, Fieldfrom typing import List, Optionalfrom datetime import dateclass Dimension(BaseModel):length: floatwidth: floatheight: floatunit: str = Field(..., pattern="(cm|in|mm)")class Price(BaseModel):amount: floatcurrency: str = Field(..., pattern="[A-Z]{3}")class Review(BaseModel):user_id: strrating: int = Field(..., ge=1, le=5)comment: Optional[str] = Nonedate: dateclass Product(BaseModel):id: strname: strdescription: strcategory: strsubcategory: Optional[str] = Nonebrand: strsku: strprice: Pricedimensions: Dimensionweight: floatweight_unit: str = Field(..., pattern="(kg|g|lb|oz)")in_stock: boolavailable_colors: List[str] = Field(..., min_items=1)features: List[str] = Field(..., min_items=1)image_urls: List[str] = Field(..., min_items=1)reviews: List[Review] = Field(default_factory=list)average_rating: Optional[float] = Noneclass ProductCatalog(BaseModel):products: List[Product]total_products: intlast_updated: date
This model defines the structure for our product catalog, which we’ll use as an example of structured output from image processing.
Setting up the Pixtral client
Next, we’ll set up the client to interact with the Pixtral model. Create a new file called pixtral_client.py
and add the following code:
from openai import OpenAIimport osMODEL = "pixtral-12b-2409"API_KEY = os.environ.get("SCALEWAY_API_KEY")BASE_URL = os.environ.get("SCALEWAY_INFERENCE_ENDPOINT_URL")# use https://api.scaleway.ai/v1 for Scaleway Generative APIsclient = OpenAI(base_url=BASE_URL,api_key=API_KEY)def get_pixtral_client():return client
Make sure to set the SCALEWAY_API_KEY
and SCALEWAY_INFERENCE_ENDPOINT_URL
environment variables with your actual API key from Scaleway IAM, and the appropriate endpoint URL for Scaleway Managed Inference or Generative APIs service.
Creating the image processor
Now, let’s create the main script that will use Pixtral to analyze images and generate our structured output. Create a file called process_images.py
and add the following code:
import jsonfrom datetime import datefrom pixtral_client import get_pixtral_clientfrom models import ProductCatalogdef process_images(image_urls):client = get_pixtral_client()prompt = """Extract detailed information from the provided images. Create an entry for each image.Include the following details for each item:- A descriptive name and detailed description- Appropriate category and subcategory- Realistic dimensions, weight, and pricing (if applicable)- At least 3 key features or characteristics- Any visible attributes (e.g., colors, materials)- Generate 2 hypothetical reviews or interpretationsEnsure all information is consistent with what can be seen or reasonably inferred from the images."""messages = [{"role": "system","content": "You are a helpful assistant. Only reply in JSON.",},{"role": "user","content": [{"type": "text","text": prompt},*[{"type": "image_url", "image_url": {"url": url}} for url in image_urls]],},]try:response = client.chat.completions.create(messages=messages,model=MODEL,response_format={"type": "json_schema","json_schema": {"strict": True,"name": "Processed Image Data","schema": ProductCatalog.model_json_schema()}},)processed_data = response.choices[0].message.parsedstructured_output = ProductCatalog(**processed_data)structured_output.last_updated = date.today()structured_output.total_products = len(structured_output.products)return structured_outputexcept Exception as e:print(f"Error processing images: {e}")return Nonedef save_output_to_json(output, filename):with open(filename, 'w') as f:json.dump(output.model_dump(), f, indent=2, default=str)if __name__ == "__main__":image_urls = ["https://picsum.photos/id/26/800/600", # Sample image 1"https://picsum.photos/id/3/800/600" # Sample image 2]processed_output = process_images(image_urls)if processed_output:save_output_to_json(processed_output, "processed_image_data.json")print(f"Image processing complete. Structured data for {processed_output.total_products} items generated.")else:print("Failed to process images.")
This script does the following:
- Imports the necessary modules and models.
- Defines a function to process images using the Pixtral model.
- Creates a prompt that instructs the model on how to analyze the images and what information to extract.
- Sends the images and prompt to the Pixtral model and receives the generated structured data.
- Validates the received data against our
pydantic
models. - Saves the generated structured output to a JSON file.
Running the image processor
To use the image processor:
-
Set the environment variables for your Scaleway API key and inference endpoint URL:
export SCALEWAY_API_KEY="your_api_key_here"export SCALEWAY_INFERENCE_ENDPOINT_URL="your_endpoint_url_here" -
Run the script:
python process_images.py
The script will process the sample images and generate a processed_image_data.json
file containing the extracted structured information.
Customizing the image processor
You can easily customize the image processor for your specific needs:
- Modify the
prompt
inprocess_images.py
to extract different or additional information from the images. - Update the
models.py
file to change the structure of your output data to fit your specific use case. - Add error handling and logging to make the script more robust for production use.
Conclusion
In this tutorial, we’ve explored how to leverage Mistral’s Pixtral vision model to process images and generate structured outputs following a strict and complex JSON schema. This approach can be applied to a wide range of industries and use cases, from cataloging products to analyzing medical images, from interpreting financial charts to processing historical documents.
By combining the power of AI vision models with structured data validation, we’ve created a flexible and extensible solution that can be adapted to various image processing needs.