TL;DR

Traditional RAG fails on visual data. This project uses Cohere’s multimodal embeddings + Gemini 2.5 Flash to build a RAG system that understands both text and images — enabling accurate answers from charts, tables, and visuals inside PDFs.

📉The Problem: Traditional RAG's Visual Blindspot

Traditional Retrieval-Augmented Generation (RAG) systems rely on text embeddings to retrieve information from documents. But what if your most valuable insights are hidden in charts, tables, and images?

Whether you're analyzing financial PDFs, investment research reports, or market slides, much of the relevant information lives in visuals:

Numerical breakdowns in pie/bar charts (e.g., portfolio allocations)
Trend visualizations in line graphs (e.g., market performance)
Structured data in complex tables (e.g., comparison matrices)
Process flows in diagrams (e.g., system architectures)
Spatial relationships in maps or layouts

A purely text-based approach fails to capture this crucial layer of information.

💡The Solution: Multimodal RAG

Multimodal RAG augments traditional RAG by combining text and image understanding. This approach enables:

🔍 Image + Text search from the same document

🧠 Unified vector index with mixed modality support

🤖 Context-aware answers via Gemini using either matched text or matched image

🔧Key Technologies

Cohere's Embed v4.0: Embeds both text and images in the same vector space
Gemini 2.5 Flash: Processes queries with context (text or image) to generate factual, human-like responses
FAISS: Efficiently indexes and searches vectors from both modalities.FAISS supports efficient approximate nearest neighbor search.

🧭 End-to-End Multimodal RAG Workflow

Below is the high-level system flow for the Multimodal RAG pipeline:

📌 From PDF upload to image/text embedding, vector search, and Gemini-powered answer generation — everything is stitched together using Streamlit, Cohere, FAISS, and Gemini 2.5 Flash.

🎥 Multimodal RAG – Video Demo

Here's a 9-minute visual walkthrough of the system in action:

https://youtu.be/qI3lYZ6-79k?si=ji-zVAxIpQAhYBPC

See it live: Charts being analyzed, tables being interpreted, and complex visuals being understood all in real-time!

Architecture Comparison

🖼️ Multimodal RAG Architecture

In this pipeline, both the text and each page image are embedded using Cohere, stored in FAISS, and served as context to Gemini 2.5 Flash. This allows questions grounded in visuals to be answered — something traditional RAG setups can't handle.

📝 Text-Only RAG Architecture

This approach extracts text from the PDF, embeds it, and uses it for retrieval — but completely misses information embedded inside charts or graphics.

Results: Side-by-Side Comparison

We tested both Text-Only and Multimodal RAG apps on the same ETF PDF document:

The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.

❓ Query	📄 Text-Only App	🖼️ Multimodal App
What did Warren Buffett say about ETF?	✅ Answered from intro text	✅ Same
What is AUM of Invesco?	❌ Missed (Image)	✅ Pulled from Bar Chart
How much did BlackRock earn through Technology services?	❌ Missed (in chart)	✅ Answered using image block
How much Percentage is Apple in S&P?	❌ Missed (pie chart)	✅ Extracted % from visual
During Covid pandemic what was the top 10 weight in S&P 500?	❌ Missed (timeline chart)	✅ Parsed from infographic
What was the difference between Dotcom bubble and Covid crash?	❌ Missed (context lost)	✅ Interpreted from visual timeline
How to track Bitcoin in ETFs?	❌ Missed (Table data)	✅ Interpreted from Tables

The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.

Code Walkthrough : Multimodal Processing

💻 Full Source Code: GitHub Repository

1. PDF to Image Conversion

images = pdf2image.convert_from_path(pdf_path, dpi=200)

This gives us a list of page-wise PIL images, which are embedded next.

2. Embedding with Cohere

if content_type == "text":
    response = cohere.embed(input_type="search_document", texts=[text])
else:
    base64_img = convert_image_to_base64(image)
    response = cohere.embed(
        input_type="search_document",
        inputs=[{"content": [{"type": "image", "image": base64_img}]}]
    )

The output is added to FAISS as a float32 vector.

3. Gemini Answering Logic

if isinstance(content, Image.Image):
    response = gemini.generate_content([query, content])
else:
    response = gemini.generate_content(f"Question: {query}\n\nContext: {text}")

Gemini 2.5 Flash intelligently parses charts, titles, and layouts.

🚀Getting Started – Minimal Example

Here’s a compact script to get you up and running with multimodal RAG using Cohere + Gemini:

⚠️ Note*: This is a minimal gist to demonstrate the core flow. The full working code with UI, modular structure, and search logic is available in the [GitHub repository](https://github.com/SridharSampath/multimodal-rag-demo).*

import cohere
from google.generativeai import GenerativeModel
import faiss
import numpy as np
from pdf2image import convert_from_path
from PIL import Image

# Initialize APIs
co = cohere.Client("your-cohere-key")
gemini = GenerativeModel("gemini-2.5-flash")

# Convert PDF page to image
def pdf_to_images(pdf_path):
    return convert_from_path(pdf_path, dpi=200)

# Create embeddings
def get_embedding(content, content_type="text"):
    if content_type == "text":
        response = co.embed(input_type="search_document", texts=[content])
    else:
        base64_img = Image.open(content).resize((512, 512)).tobytes().hex()
        response = co.embed(
            input_type="search_document",
            inputs=[{"content": [{"type": "image", "image": base64_img}]}]
        )
    return response.embeddings[0]

# Index and query
dimension = 1024
index = faiss.IndexFlatL2(dimension)
images = pdf_to_images("your.pdf")
for img in images:
    index.add(np.array([get_embedding(img, "image")], dtype=np.float32))

def answer_query(query):
    query_emb = get_embedding(query)
    D, I = index.search(np.array([query_emb], dtype=np.float32), k=1)
    result = images[I[0][0]]
    return gemini.generate_content([query, result]).text

⚙️Project Setup

What You'll Need

🔑 API Keys:
- Cohere embed-v4.0 → Create Cohere Account
- Gemini 2.5 Flash → Try Gemini on Google AI Studio
💻 System Requirements:
- Python 3.8+
- Poppler (for PDF image conversion)

File	Purpose
`app.py`	Streamlit UI for uploading, querying
`core/embeddings.py`	Calls Cohere for text/image embeddings
`core/document_utils.py`	PDF parsing, image conversion, FAISS indexing
`core/search.py`	Embedding-based search + Gemini response
`config.py`	API Keys & Model Settings

# Clone repository
git clone https://github.com/SridharSampath/multimodal-rag-demo
cd multimodal-rag-app

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py

⚠️System Dependency: Poppler

This project uses pdf2image to convert PDF pages into images, which requires Poppler:

Windows:

Download from GitHub - Poppler Windows Releases
Extract to a folder like C:\poppler
Add C:\poppler\Library\bin to your system's PATH

🧪 Demo Screenshots – Multimodal vs. Text-Only RAG

Visual comparison of the same queries across two apps:

1. ❓ Query: “What is AUM of Invesco?”

Multimodal App: Found in Bar chart
Text-Only App: Missed (text doesn’t mention it)

2. ❓ Query: “How much did BlackRock earn through Technology services?”

Multimodal App: Pulled value from image- Blackrock Income Statement
Text-Only App: Missed (text doesn’t mention it)

3. 🍎 Query: “How much Percentage is Apple in S&P?”

Multimodal App: Found in pie chart
Text-Only App: Gave approximate data

4. 🦠 Query: “During Covid pandemic what was the top 10 weight in S&P 500?”

Multimodal App: Parsed timeline chart
Text-Only App: Missed specific figure

5. 💰 Query: “How to track Bitcoin in ETFs?”

Multimodal App: Found in Table Image

Text-Only App: Missed specific figure

⚠️ Limitations and Considerations

While multimodal RAG offers significant advantages, be aware of:

Computational overhead - Processing and embedding images requires more resources
API costs - Multimodal embedding APIs typically cost more than text-only equivalents
OCR dependency - Chart text recognition still relies on OCR quality
Image resolution impact - Low-resolution images may reduce embedding quality
Complex visualization challenges - Very complex visualizations might still be misinterpreted

Resources & Reference Links

🙌Closing Thoughts

If you're building LLM apps for financial document QA, research assistant bots, or compliance analytics, you need to look beyond just text. Multimodal RAG delivers context-aware, image-inclusive, and LLM-optimized retrieval that can extract insights from your entire document ecosystem, not just the textual components.

Try it out and let me know your thoughts!

🚀 Let's Connect!

If you found this useful, feel free to connect with me:
🔗 LinkedIn - Sridhar Sampath
🔗 Hashnode Blog

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

TL;DR

📉The Problem: Traditional RAG's Visual Blindspot

💡The Solution: Multimodal RAG

🔧Key Technologies

🧭 End-to-End Multimodal RAG Workflow

🎥 Multimodal RAG – Video Demo

Architecture Comparison

🖼️ Multimodal RAG Architecture

📝 Text-Only RAG Architecture

Results: Side-by-Side Comparison

Code Walkthrough : Multimodal Processing

1. PDF to Image Conversion

2. Embedding with Cohere

3. Gemini Answering Logic

🚀Getting Started – Minimal Example

⚙️Project Setup

What You'll Need

⚠️System Dependency: Poppler

🧪 Demo Screenshots – Multimodal vs. Text-Only RAG

1. ❓ Query: “What is AUM of Invesco?”

2. ❓ Query: “How much did BlackRock earn through Technology services?”

3. 🍎 Query: “How much Percentage is Apple in S&P?”

4. 🦠 Query: “During Covid pandemic what was the top 10 weight in S&P 500?”

5. 💰 Query: “How to track Bitcoin in ETFs?”

⚠️ Limitations and Considerations

Resources & Reference Links

🙌Closing Thoughts

🚀 Let's Connect!

✨ End

Comments (1)

More from this blog

🎙️ Local Speech-to-Text with NVIDIA Parakeet ASR (TDT 0.6B)

🚀 Dynamic Multi-Function Calling Locally with Gemma 3 and Ollama

🚀 Exploring GraphRAG: Smarter AI Knowledge Retrieval with Neo4j & LLMs

How to Build Multi-Agent Collaboration on AWS Bedrock: A Financial Assistant Tutorial

Command Palette

TL;DR

📉The Problem: Traditional RAG's Visual Blindspot

💡The Solution: Multimodal RAG

🔧Key Technologies

🧭 End-to-End Multimodal RAG Workflow

🎥 Multimodal RAG – Video Demo

Architecture Comparison

🖼️ Multimodal RAG Architecture

📝 Text-Only RAG Architecture

Results: Side-by-Side Comparison

Code Walkthrough : Multimodal Processing

1. PDF to Image Conversion

2. Embedding with Cohere

3. Gemini Answering Logic

🚀Getting Started – Minimal Example

⚙️Project Setup

What You'll Need

⚠️System Dependency: Poppler

🧪 Demo Screenshots – Multimodal vs. Text-Only RAG

1. ❓ Query: “What is AUM of Invesco?”

2. ❓ Query: “How much did BlackRock earn through Technology services?”

3. 🍎 Query: “How much Percentage is Apple in S&P?”

4. 🦠 Query: “During Covid pandemic what was the top 10 weight in S&P 500?”

5. 💰 Query: “How to track Bitcoin in ETFs?”

⚠️ Limitations and Considerations

Resources & Reference Links

🙌Closing Thoughts

🚀 Let's Connect!

✨ End

Comments (1)

More from this blog