Multimodal RAG Implementation Guide: Image and Chart Search Mechanisms with Python Code

As engineers building RAG (Retrieval-Augmented Generation), we are constantly fighting the risk of “oversight.” Especially when working with PDFs and technical documents, it’s impossible to fully grasp the meaning of documents using only text information. Line graphs that occupy half the page, complex system configuration diagrams, or screenshot images. These have been treated as mere noise or “gaps” lost to OCR in traditional text-based RAG systems.

However, with the evolution of LLMs (Large Language Models) and image understanding models, the situation has changed dramatically. “Multimodal RAG,” which maps text and images to the same “semantic space” and searches across entire documents, has become a realistic solution. In this article, we go beyond mere concept introduction to delve into the internal operations when actually building systems, specific implementation code, and application cases in business settings.

What’s Missing with Text Alone

In traditional RAG architecture, images in documents were either converted to text (OCR) or ignored. OCR has limitations. For example, consider a bar chart where blue bars exceed red bars. OCR can extract fragmentary text data like “blue,” “red,” “100,” “200,” but the visual relationship “blue is twice red” is difficult to reproduce with text information alone.

This is where technologies like CLIP (Contrastive Language-Image Pre-training) come into play. CLIP is a model trained to place images and text in the same high-dimensional vector space. This places the text vector “cat image” and the actual “cat image” vector at close distances.

Multimodal RAG uses this mechanism to vectorize each image in documents and store them in a vector database. When users ask questions like “Find graphs with declining sales trends,” the query text is vectorized and similar images (graphs with declining trends) are searched in the image vector space. This makes it possible to retrieve information that would never have been hit by text search.

Technical Explanation: Internal Operation of Multimodal RAG

The multimodal RAG pipeline is configured by adding an “image processing path” to traditional text RAG. The following diagram shows its typical data flow:

graph TD A[Input Document PDF] --> B{Parser} B --> C[Text Chunks] B --> D[Image Extraction] C --> E[Text Encoder] D --> F[Image Encoder CLIP etc.] E --> G[Vector DB Text Collection] F --> G H[User Query] --> I[Query Encoder] I --> J[Hybrid Search] G --> J J --> K[Search Results Text+Images] K --> L[LLM Answer Generation] L --> M[Final Answer + Image References]

The key to this architecture is “integrated search of text and images”. Rather than simply having separate indexes for text and images, they are stored in the same vector database or collections linked by metadata, allowing the LLM to integrate search results and generate answers.

When I actually design systems, I emphasize “metadata management”. Without maintaining context information about which page and section an image belonged to, the LLM cannot accurately determine “what this image represents.” Therefore, when vectorizing, adding the text before and after images (captions and surrounding explanations) as metadata is an essential technique for improving accuracy.

Explanatory Diagram

Implementation Example: Multimodal Search System with Python

Let’s look at a specific implementation. Here, we’ll build a flow using Python to extract images from PDFs, vectorize them using the CLIP model, and perform searches.

This code is just a prototype for verification, but it includes error handling and logging, providing a sufficient structure for engineers to understand the basics.

Prerequisites:

  • Python 3.9+
  • Required libraries: langchain, chromadb, sentence-transformers, pdf2image, pillow
import logging
import os
import shutil
import tempfile
from pathlib import Path
from typing import List, Optional, Tuple

import chromadb
from pdf2image import convert_from_path
from PIL import Image
from sentence_transformers import SentenceTransformer, util

# Logging configuration
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

class MultimodalRAG:
    def __init__(self, collection_name: str = "multimodal_docs"):
        """
        Initialize the multimodal RAG system.
        Set up the CLIP model and ChromaDB.
        """
        try:
            logger.info("Initializing model and database...")
            # Load CLIP model that can handle both images and text
            # For Japanese support, consider 'paraphrase-multilingual-clip-ViT-B-32' etc.
            self.model = SentenceTransformer('clip-ViT-B-32')
            logger.info(f"Model loading completed: {self.model}")
            
            # Set up ChromaDB (specify path for persistence)
            self.chroma_client = chromadb.Client()
            self.collection = self.chroma_client.get_or_create_collection(
                name=collection_name,
                metadata={"hnsw:space": "cosine"}
            )
            logger.info("Database connection completed")
            
        except Exception as e:
            logger.error(f"Error during initialization: {e}")
            raise

    def _extract_images_from_pdf(self, pdf_path: str) -> List[Tuple[Image.Image, int]]:
        """
        Helper method to extract images from PDF.
        Convert PDF to images page by page using pdf2image.
        *Ideally, use layout analysis library (Unstructured, etc.) to extract only chart parts,
         but here we simply treat entire pages as images.
        """
        images = []
        try:
            logger.info(f"Extracting images from PDF: {pdf_path}")
            # Convert PDF to image list
            pil_images = convert_from_path(pdf_path)
            
            for i, img in enumerate(pil_images):
                images.append((img, i + 1)) # (image object, page number)
            
            logger.info(f"Extracted {len(images)} page(s) of images")
            return images
        except Exception as e:
            logger.error(f"Image extraction error: {e}")
            return []

    def index_document(self, pdf_path: str, doc_id: str):
        """
        Index a document.
        Extract text and images, vectorize each, and save to DB.
        """
        if not os.path.exists(pdf_path):
            logger.error(f"File not found: {pdf_path}")
            return

        try:
            # 1. Image extraction and embedding
            images = self._extract_images_from_pdf(pdf_path)
            
            image_embeddings = []
            image_ids = []
            image_metadatas = []
            
            for img, page_num in images:
                # Temporarily save image to get path (for metadata)
                # In actual operation, store S3 storage path etc.
                temp_dir = tempfile.mkdtemp()
                img_path = os.path.join(temp_dir, f"page_{page_num}.png")
                img.save(img_path)
                
                # Generate image embedding
                emb = self.model.encode(img)
                image_embeddings.append(emb.tolist())
                image_ids.append(f"{doc_id}_img_page_{page_num}")
                image_metadatas.append({
                    "type": "image",
                    "source": pdf_path,
                    "page": page_num,
                    "doc_id": doc_id
                })
                
                # Cleanup
                shutil.rmtree(temp_dir)

            # 2. Text extraction and embedding (generate dummy text for simplification)
            # Actually extract text using PyPDF2 etc.
            text_chunks = [
                f"This is a text summary from page {i+1} of {pdf_path}."
                for i in range(len(images))
            ]
            
            text_embeddings = self.model.encode(text_chunks)
            text_ids = [f"{doc_id}_text_page_{i+1}" for i in range(len(text_chunks))]
            text_metadatas = [
                {"type": "text", "source": pdf_path, "page": i+1, "doc_id": doc_id}
                for i in range(len(text_chunks))
            ]

            # 3. Add to database
            if image_embeddings:
                self.collection.add(
                    ids=image_ids,
                    embeddings=image_embeddings,
                    metadatas=image_metadatas
                )
            
            if text_embeddings:
                self.collection.add(
                    ids=text_ids,
                    embeddings=text_embeddings,
                    metadatas=text_metadatas
                )
            
            logger.info(f"Document {doc_id} indexing completed")

        except Exception as e:
            logger.error(f"Unexpected error during indexing: {e}")
            raise

    def search(self, query: str, n_results: int = 3) -> dict:
        """
        Search for text and images related to the query.
        """
        try:
            logger.info(f"Executing search query: {query}")
            query_embedding = self.model.encode(query).tolist()
            
            results = self.collection.query(
                query_embeddings=[query_embedding],
                n_results=n_results
            )
            
            return results
        except Exception as e:
            logger.error(f"Error during search: {e}")
            return {}

# Execution example
if __name__ == "__main__":
    # Initialization
    rag = MultimodalRAG()
    
    # Dummy PDF file path (specify existing file in practice)
    # Here, assume non-existent path for error handling demonstration
    dummy_pdf_path = "sample_report.pdf"
    
    # Fallback processing if PDF doesn't exist (for demo)
    if not os.path.exists(dummy_pdf_path):
        logger.warning("Sample PDF not found. Skipping processing.")
        # Normally, create sample data here
    else:
        # Register document
        rag.index_document(dummy_pdf_path, doc_id="doc_001")
        
        # Execute search
        query = "Which pages have upward-trending graphs?"
        search_results = rag.search(query)
        
        logger.info(f"Search results: {search_results}")

The key point of this code is the use of SentenceTransformer('clip-ViT-B-32'). This model can encode both text and images, eliminating the need for separate models and simplifying implementation.

Regarding error handling, the code checks for file existence and catches exceptions during image processing to prevent unexpected system crashes. In production environments, more advanced layout analysis libraries (e.g., unstructured or marker) can be used instead of pdf2image to accurately extract only chart portions for vectorization, further improving search accuracy.

Business Use Case: Report Search for Securities Analysts

Let’s imagine a specific business scenario. Imagine a securities analyst at a brokerage firm. They read through earnings materials (PDFs) for hundreds of companies every day. These materials contain numerous line graphs showing sales trends and pie charts showing market share.

With traditional keyword search, they could search for the word “decrease” but couldn’t directly find downward-sloping graphs. However, by introducing multimodal RAG, analysts can now submit natural language queries like “Extract graphs of companies with declining profit margins in recent years.”

The system vectorizes the query intent (“declining,” “profit margins”) and rapidly compares it with tens of thousands of graph images in the database. As a result, subtle trends not explicitly stated in text and visual insights shown only in charts can be discovered much faster than human visual inspection. This promises clear ROI (return on investment) through reduced research time and improved analysis comprehensiveness.


Frequently Asked Questions

Q: What is the biggest difference between multimodal RAG and traditional RAG? A: While traditional RAG only handles text information, multimodal RAG can search and understand visual information such as images, charts, and graphs in the same vector space as text. This enables question answering for visual elements within documents.

Q: Which part of the implementation requires the most computational cost? A: The image vectorization (embedding) process. Especially when processing high-resolution images or PDFs with many pages, GPU resources are often required, and processing time tends to be longer compared to text-only processing.

Q: In what business scenarios does it demonstrate effectiveness? A: It demonstrates great effectiveness in document-intensive tasks where text alone is insufficient, such as equipment drawing search in manufacturing, analysis of financial earnings materials (including graphs), and collation of medical reports and images.


Summary

  • Multimodal RAG enables document search including images and charts, utilizing visual information that was often overlooked in traditional text-only RAG.
  • By leveraging models like CLIP to place text and images in a common vector space, semantic cross-search is realized.
  • Implementation requires appropriate error handling and library selection in each step of image extraction, vectorization, and metadata management.
  • It has the potential to dramatically improve operational efficiency in business scenarios where visual information plays an important role, such as securities analysis and manufacturing manual search.

  • Book: “Building Vector Search Applications: AI Architecture with Open-Source Tools”
    • Explains everything from vector search basics to applications in detail from an architectural design perspective.
  • Tool: LlamaIndex
    • Rich in modules for building multimodal RAG, from data loading (LlamaParse) to indexing and search.

AI Implementation Support & Development Consulting

We provide design and development support for AI solutions optimal for your business processes. If you’re considering introducing multimodal RAG, please contact us through the form below.

Contact Form


References

[1] OpenAI CLIP: Connecting Text and Images https://openai.com/research/clip [2] Sentence-Transformers Documentation https://www.sbert.net/ [3] ChromaDB Documentation https://docs.trychroma.com/


Tag Cloud

#LLM (17) #ROI (16) #AI Agents (13) #Python (9) #RAG (9) #Digital Transformation (7) #AI (6) #LangChain (6) #AI Agent (5) #LLMOps (5) #Small and Medium Businesses (5) #Agentic Workflow (4) #AI Ethics (4) #Anthropic (4) #Cost Reduction (4) #Debugging (4) #DX Promotion (4) #Enterprise AI (4) #Multi-Agent (4) #2025 (3) #2026 (3) #Agentic AI (3) #AI Adoption (3) #AI ROI (3) #AutoGen (3) #LangGraph (3) #MCP (3) #OpenAI O1 (3) #Troubleshooting (3) #Vector Database (3) #AI Coding Agents (2) #AI Orchestration (2) #Automation (2) #Best Practices (2) #Business Strategy (2) #ChatGPT (2) #Claude (2) #CrewAI (2) #Cursor (2) #Development Efficiency (2) #DX (2) #Gemini (2) #Generative AI (2) #GitHub Copilot (2) #GraphRAG (2) #Inference Optimization (2) #Knowledge Graph (2) #Langfuse (2) #LangSmith (2) #LlamaIndex (2) #Management Strategy (2) #MIT Research (2) #Mixture of Experts (2) #Model Context Protocol (2) #MoE (2) #Monitoring (2) #Multimodal AI (2) #Privacy (2) #Quantization (2) #Reinforcement Learning (2) #Responsible AI (2) #Robotics (2) #SLM (2) #System 2 (2) #Test-Time Compute (2) #VLLM (2) #VLM (2) #.NET (1) #2025 Trends (1) #2026 Trends (1) #Adoption Strategy (1) #Agent Handoff (1) #Agent Orchestration (1) #Agentic Memory (1) #Agentic RAG (1) #AI Agent Framework (1) #AI Architecture (1) #AI Engineering (1) #AI Fluency (1) #AI Governance (1) #AI Implementation (1) #AI Implementation Failure (1) #AI Implementation Strategy (1) #AI Inference (1) #AI Integration (1) #AI Management (1) #AI Observability (1) #AI Safety (1) #AI Strategy (1) #AI Video (1) #Autonomous Coding (1) #Backend Optimization (1) #Backend Tasks (1) #Beginners (1) #Berkeley BAIR (1) #Business Automation (1) #Business Optimization (1) #Business Utilization (1) #Business Value (1) #Business Value Assessment (1) #Career Strategy (1) #Chain-of-Thought (1) #Claude 3.5 (1) #Claude 3.5 Sonnet (1) #Compound AI Systems (1) #Computer Use (1) #Constitutional AI (1) #CUA (1) #DeepSeek (1) #Design Pattern (1) #Development (1) #Development Method (1) #Devin (1) #Edge AI (1) #Embodied AI (1) #Entity Extraction (1) #Error Handling (1) #Evaluation (1) #Fine-Tuning (1) #FlashAttention (1) #Function Calling (1) #Google Antigravity (1) #Governance (1) #GPT-4o (1) #GPT-4V (1) #Green AI (1) #GUI Automation (1) #Image Recognition (1) #Implementation Patterns (1) #Implementation Strategy (1) #Inference (1) #Inference AI (1) #Inference Scaling (1) #Information Retrieval (1) #Kubernetes (1) #Lightweight Framework (1) #Llama.cpp (1) #LLM Inference (1) #Local LLM (1) #LoRA (1) #Machine Learning (1) #Mamba (1) #Manufacturing (1) #Microsoft (1) #Milvus (1) #MLOps (1) #Modular AI (1) #Multimodal (1) #Multimodal RAG (1) #Neo4j (1) #Offline AI (1) #Ollama (1) #On-Device AI (1) #OpenAI (1) #OpenAI Operator (1) #OpenAI Swarm (1) #Operational Efficiency (1) #Optimization (1) #PEFT (1) #Physical AI (1) #Pinecone (1) #Practical Guide (1) #Prediction (1) #Production (1) #Prompt Engineering (1) #PyTorch (1) #Qdrant (1) #QLoRA (1) #Reasoning AI (1) #Refactoring (1) #Retrieval (1) #Return on Investment (1) #Risk Management (1) #RLHF (1) #RPA (1) #Runway (1) #Security (1) #Semantic Kernel (1) #Similarity Search (1) #Skill Set (1) #Skill Shift (1) #Small Language Models (1) #Software Development (1) #Software Engineer (1) #Sora 2 (1) #SRE (1) #State Space Model (1) #Strategy (1) #Subsidies (1) #Sustainable AI (1) #Synthetic Data (1) #System 2 Thinking (1) #System Design (1) #TensorRT-LLM (1) #Text-to-Video (1) #Tool Use (1) #Transformer (1) #Trends (1) #TTC (1) #Usage (1) #Vector Search (1) #Video Generation (1) #VS Code (1) #Weaviate (1) #Weights & Biases (1) #Workstyle Reform (1) #World Models (1)