Making Images and Charts Searchable: Multimodal RAG Solves the Unstructured Data Challenge

Have you ever struggled to find necessary “numbers” from a sea of PDF documents? I once faced an insurmountable wall when building a system to analyze securities company reports. While OCR (Optical Character Recognition) could extract text information, crucial “line graphs showing sales trends” or “pie charts comparing market share” were simply ignored as mere images.

Traditional RAG (Retrieval-Augmented Generation) was a method of vectorizing and searching text data. However, real-world business documents contain much more than just text. For a long time, we have treated these treasure troves of unstructured data as mere collections of pixels, leaving them unsearchable.

This is where Multimodal RAG comes in—technology that can understand visual information as context. This technology, which handles not just text but images, charts, and layout information in an integrated manner, becomes a groundbreaking turning point for AI agents to handle more advanced tasks. In this article, we unravel the internal structure of Multimodal RAG for engineers and explain implementation methods through actually working Python code.

The “Blind Spot” That Text-Only RAG Has Missed

Why is Multimodal RAG needed now? The answer lies in the structural limitations of traditional approaches.

Existing text-based RAG systems essentially perform text extraction when parsing documents like PDFs. However, this process has two major problems.

First is the loss of layout information. For example, the relationship “description text corresponding to the image on the left” can be severed by text conversion. While AI can read image descriptions, it loses clues to identify which graph the description refers to.

Second is the complete absence of information within images. The most important insights in business reports are often condensed in charts. Even without text saying “20% increase year-over-year,” you can read the increase from the height of bar charts. However, text-only RAG processes these graphs as “blanks” or “noise.”

To solve this, we need to give AI not just the ability to “read” documents but to “see” them. That is Multimodal RAG.

How Multimodal RAG Works and Its Architecture

There are broadly two approaches to implementing Multimodal RAG.

  1. Image Summarization Approach: Extract images from documents, use Vision LLM (e.g., GPT-4o) to generate detailed text descriptions (captions), and vectorize these descriptions together with text RAG.
  2. Multimodal Embedding Approach: Use models like CLIP or OpenAI’s latest models that map text and images to the same latent space (vector space), directly calculating similarity between image vectors and text vectors.

From a practical perspective in 2026, considering accuracy and controllability, the most robust approach is a hybrid configuration based on the “Image Summarization Approach” while using the second embedding approach as needed. This is because converting images to text once allows leveraging existing powerful text search engine ecosystems.

The diagram below illustrates a typical Multimodal RAG data flow.

graph TD A[Input Document PDF] --> B(Parser LlamaIndex/Unstructured) B --> C{Element Separation} C -->|Text| D[Text Chunking] C -->|Images| E[Image Extraction] D --> F[Text Embedding Model] E --> G[Vision LLM Image Summarization] F --> H[Vector Database] G --> I[Summary Text] I --> D H --> J[Search & Generation LLM]

The important point in this flow is not just storing images but converting them to “semantic information” through Vision LLM and re-injecting them as searchable text. This enables accurate hits when users ask questions like “Find graphs showing declining sales trends,” as words like “decline” and “trend” would be included in image summaries.

Python Implementation: Using LlamaIndex and OpenAI

Now let’s move to concrete implementation. Here, we build a system that can search including images within PDFs using Python, LlamaIndex, and OpenAI APIs.

This code is not just for verification but has a practical structure considering error handling and logging.

Prerequisites

Install necessary libraries.

pip install llama-index-core llama-index-readers-file llama-index-llms-openai llama-index-multi-modal-llms-openai llama-index-embeddings-openai python-dotenv

Implementation Code

The following code is a script that reads PDFs from a specified directory, extracts images to create summaries, and builds an index.

import logging
import os
import sys
from typing import List, Optional

from dotenv import load_dotenv
from llama_index.core import (
    Settings,
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import BaseNode, ImageNode
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Logging configuration
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

class MultimodalRAGPipeline:
    def __init__(
        self,
        input_dir: str,
        model_name: str = "gpt-4o",
        embed_model_name: str = "text-embedding-3-small",
        persist_dir: str = "./storage"
    ):
        """
        Initialize Multimodal RAG pipeline
        
        Args:
            input_dir (str): Directory path where PDFs are stored
            model_name (str): LLM model name to use
            embed_model_name (str): Embedding model name to use
            persist_dir (str): Directory to save index
        """
        self.input_dir = input_dir
        self.persist_dir = persist_dir
        
        # Check API key
        if not os.getenv("OPENAI_API_KEY"):
            logger.error("OPENAI_API_KEY is not set.")
            raise ValueError("Missing OpenAI API Key")

        # Configure LLM and Embedding models
        try:
            self.llm = OpenAI(model=model_name)
            self.embed_model = OpenAIEmbedding(model=embed_model_name)
            
            # Configure Multi-modal LLM (for image understanding)
            self.multi_modal_llm = OpenAIMultiModal(model=model_name)
            
            Settings.llm = self.llm
            Settings.embed_model = self.embed_model
            
            logger.info(f"Model initialization complete: LLM={model_name}, Embedding={embed_model_name}")
            
        except Exception as e:
            logger.error(f"Error during model initialization: {e}")
            raise

    def load_documents(self) -> List[BaseNode]:
        """
        Load documents and extract images and text
        
        Returns:
            List[BaseNode]: List of extracted nodes
        """
        logger.info(f"Loading documents from directory '{self.input_dir}'...")
        
        try:
            # Reader with automatic image extraction
            reader = SimpleDirectoryReader(
                self.input_dir,
                required_exts=[".pdf", ".jpg", ".png"],
                recursive=True,
                # Configuration to extract images and handle as ImageNode
                file_metadata=lambda x: {"file_name": x}
            )
            documents = reader.load_data()
            logger.info(f"Document loading successful: {len(documents)} documents")
            
            # Configure node parser (for text)
            text_parser = SentenceSplitter(
                chunk_size=1024,
                chunk_overlap=20
            )
            
            text_nodes = []
            image_nodes = []
            
            for doc in documents:
                if doc.image_embeds is not None or isinstance(doc, ImageNode):
                    image_nodes.append(doc)
                else:
                    # Text node splitting process
                    text_nodes.extend(text_parser.get_nodes_from_documents([doc]))
            
            logger.info(f"Node splitting complete: Text={len(text_nodes)}, Images={len(image_nodes)}")
            
            return text_nodes + image_nodes

        except FileNotFoundError:
            logger.error(f"Directory not found: {self.input_dir}")
            raise
        except Exception as e:
            logger.error(f"Unexpected error during document loading: {e}")
            raise

    def create_image_summaries(self, image_nodes: List[ImageNode]) -> List[BaseNode]:
        """
        Generate summary text using Vision LLM for image nodes
        
        Args:
            image_nodes (List[ImageNode]): List of image nodes
            
        Returns:
            List[BaseNode]: List of nodes containing summary text
        """
        if not image_nodes:
            logger.info("No image nodes to summarize.")
            return []

        logger.info(f"Starting summary generation for {len(image_nodes)} images...")
        processed_nodes = []

        for img_node in image_nodes:
            try:
                # Get image path
                image_path = img_node.metadata.get("file_path")
                if not image_path or not os.path.exists(image_path):
                    logger.warning(f"Image file not found: {image_path}, skipping.")
                    continue

                # Create prompt
                prompt = """
                Please describe this image in detail. Especially for graphs, extract numerical trends and patterns,
                and for tables, extract key data points and convert to text.
                Provide the description in English, including specific keywords that make it searchable.
                """
                
                # Image understanding and summary generation by Vision LLM
                response = self.multi_modal_llm.complete(
                    prompt=prompt,
                    image_documents=[img_node]
                )
                
                summary_text = response.text
                logger.info(f"Image summary generation successful ({os.path.basename(image_path)}): {summary_text[:50]}...")
                
                # Create new node with summary text, keeping reference to original image
                summary_node = BaseNode(
                    text=summary_text,
                    metadata={
                        **img_node.metadata,
                        "is_image_summary": True,
                        "original_image_path": image_path
                    }
                )
                processed_nodes.append(summary_node)

            except Exception as e:
                logger.error(f"Error during image summary generation: {e}")
                continue

        logger.info(f"Image summary generation complete: {len(processed_nodes)} nodes")
        return processed_nodes

    def build_index(self, nodes: List[BaseNode]):
        """
        Build vector index from nodes
        
        Args:
            nodes (List[BaseNode]): List of nodes to index
        """
        logger.info(f"Starting index build from {len(nodes)} nodes...")
        
        try:
            storage_context = StorageContext.from_defaults()
            index = VectorStoreIndex(
                nodes=nodes,
                storage_context=storage_context
            )
            
            # Persist index
            index.storage_context.persist(persist_dir=self.persist_dir)
            logger.info(f"Index build successful. Saved to: {self.persist_dir}")
            
            return index
            
        except Exception as e:
            logger.error(f"Error during index build: {e}")
            raise

    def run(self):
        """
        Execute complete pipeline
        """
        try:
            # 1. Document loading
            nodes = self.load_documents()
            
            # 2. Separate text and image nodes
            text_nodes = [n for n in nodes if not hasattr(n, 'image_embeds') or n.image_embeds is None]
            image_nodes = [n for n in nodes if hasattr(n, 'image_embeds') and n.image_embeds is not None]
            
            # 3. Image summarization
            image_summary_nodes = self.create_image_summaries(image_nodes)
            
            # 4. Combine all nodes
            all_nodes = text_nodes + image_summary_nodes
            
            # 5. Build index
            index = self.build_index(all_nodes)
            
            logger.info("Multimodal RAG pipeline completed successfully!")
            return index
            
        except Exception as e:
            logger.error(f"Pipeline execution failed: {e}")
            raise

# Execution example
if __name__ == "__main__":
    try:
        pipeline = MultimodalRAGPipeline(
            input_dir="./data",  # Directory containing PDFs
            persist_dir="./storage"  # Index save destination
        )
        index = pipeline.run()
        
    except Exception as e:
        logger.error(f"Application error: {e}")
        sys.exit(1)

Business Use Case: Financial Report Analysis Automation

Let’s introduce a concrete business application. Consider automating analysis of financial reports in the securities industry.

Traditional analysis involved manually reading hundreds of pages of reports, extracting important charts and numerical data. However, with Multimodal RAG:

  1. Automated Processing: PDF reports are automatically parsed, with both text and charts converted to searchable data
  2. Intelligent Search: When asking “Show companies with sales growth,” not only text mentions but also trend lines in graphs are understood
  3. Comparative Analysis: Multiple company reports can be cross-referenced to automatically extract comparative information

This dramatically reduces analyst workload while enabling more comprehensive information gathering.

Summary

Multimodal RAG is not just an extension of search technology but a paradigm shift that gives AI the ability to “see” and “understand” documents.

Key takeaways:

  • Beyond Text: 80% of enterprise data is visual; Multimodal RAG unlocks this value
  • Image Summarization: Converting visual information to searchable text enables practical implementation
  • Business Value: Particularly effective in document-heavy industries like finance, legal, and healthcare
  • Technical Stack: Combination of LlamaIndex, Vision LLM, and vector databases

The era of truly intelligent document processing has arrived. Start implementing Multimodal RAG today.

Frequently Asked Questions

Q: What are the implementation costs for Multimodal RAG?

Costs mainly consist of LLM API usage fees and vector database maintenance. When using high-performance models like GPT-4o for image understanding, token counts tend to increase compared to text-only RAG, making prompt optimization and caching strategies important.

Q: How can chart accuracy be improved?

While ensuring image resolution is important, for complex charts, an effective approach is to use object detection models to split charts into “graph area,” “legend area,” and “title area” before inputting to LLM, rather than processing the entire image at once.

Q: Is it usable in security-critical industries?

Yes. Instead of cloud versions using OpenAI or Anthropic APIs, you can operate within internal networks by hosting open-source models like Llama 3.2 Vision or Qwen2-VL in on-premise environments.

Tools & Frameworks

Books & Articles

  • “Building LLM Applications” - Practical guide for LLM application development
  • “Multimodal Machine Learning” - Technical guide for multimodal AI

AI Implementation Support & Development Consultation

Struggling with Multimodal RAG implementation or document processing automation? We offer free individual consultations.

Book a Free Consultation

Our team of experienced engineers provides support from architecture design to implementation.

References

[1] LlamaIndex Documentation [2] OpenAI Vision API Documentation [3] Multimodal RAG Research Paper

Tag Cloud