🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini
Build a powerful Multimodal RAG system that understands both text and visuals using Cohere and Gemini - and stop missing the insights hidden in charts

TL;DR
Traditional RAG fails on visual data. This project uses Cohere’s multimodal embeddings + Gemini 2.5 Flash to build a RAG system that understands both text and images — enabling accurate answers from charts, tables, and visuals inside PDFs.
📉The Problem: Traditional RAG's Visual Blindspot
Traditional Retrieval-Augmented Generation (RAG) systems rely on text embeddings to retrieve information from documents. But what if your most valuable insights are hidden in charts, tables, and images?
Whether you're analyzing financial PDFs, investment research reports, or market slides, much of the relevant information lives in visuals:
Numerical breakdowns in pie/bar charts (e.g., portfolio allocations)
Trend visualizations in line graphs (e.g., market performance)
Structured data in complex tables (e.g., comparison matrices)
Process flows in diagrams (e.g., system architectures)
Spatial relationships in maps or layouts
A purely text-based approach fails to capture this crucial layer of information.
💡The Solution: Multimodal RAG
Multimodal RAG augments traditional RAG by combining text and image understanding. This approach enables:
🔍 Image + Text search from the same document
🧠 Unified vector index with mixed modality support
🤖 Context-aware answers via Gemini using either matched text or matched image
🔧Key Technologies
Cohere's Embed v4.0: Embeds both text and images in the same vector space
Gemini 2.5 Flash: Processes queries with context (text or image) to generate factual, human-like responses
FAISS: Efficiently indexes and searches vectors from both modalities.FAISS supports efficient approximate nearest neighbor search.
🧭 End-to-End Multimodal RAG Workflow
Below is the high-level system flow for the Multimodal RAG pipeline:

📌 From PDF upload to image/text embedding, vector search, and Gemini-powered answer generation — everything is stitched together using Streamlit, Cohere, FAISS, and Gemini 2.5 Flash.
🎥 Multimodal RAG – Video Demo
Here's a 9-minute visual walkthrough of the system in action:
See it live: Charts being analyzed, tables being interpreted, and complex visuals being understood all in real-time!
Architecture Comparison
🖼️ Multimodal RAG Architecture
In this pipeline, both the text and each page image are embedded using Cohere, stored in FAISS, and served as context to Gemini 2.5 Flash. This allows questions grounded in visuals to be answered — something traditional RAG setups can't handle.

📝 Text-Only RAG Architecture
This approach extracts text from the PDF, embeds it, and uses it for retrieval — but completely misses information embedded inside charts or graphics.

Results: Side-by-Side Comparison
We tested both Text-Only and Multimodal RAG apps on the same ETF PDF document:
The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.
| ❓ Query | 📄 Text-Only App | 🖼️ Multimodal App |
| What did Warren Buffett say about ETF? | ✅ Answered from intro text | ✅ Same |
| What is AUM of Invesco? | ❌ Missed (Image) | ✅ Pulled from Bar Chart |
| How much did BlackRock earn through Technology services? | ❌ Missed (in chart) | ✅ Answered using image block |
| How much Percentage is Apple in S&P? | ❌ Missed (pie chart) | ✅ Extracted % from visual |
| During Covid pandemic what was the top 10 weight in S&P 500? | ❌ Missed (timeline chart) | ✅ Parsed from infographic |
| What was the difference between Dotcom bubble and Covid crash? | ❌ Missed (context lost) | ✅ Interpreted from visual timeline |
| How to track Bitcoin in ETFs? | ❌ Missed (Table data) | ✅ Interpreted from Tables |
The results are clear: Text-only RAG struggled with questions grounded in visual data, while Multimodal RAG handled image-based content effectively.
Code Walkthrough : Multimodal Processing
💻 Full Source Code: GitHub Repository
1. PDF to Image Conversion
images = pdf2image.convert_from_path(pdf_path, dpi=200)
This gives us a list of page-wise PIL images, which are embedded next.
2. Embedding with Cohere
if content_type == "text":
response = cohere.embed(input_type="search_document", texts=[text])
else:
base64_img = convert_image_to_base64(image)
response = cohere.embed(
input_type="search_document",
inputs=[{"content": [{"type": "image", "image": base64_img}]}]
)
The output is added to FAISS as a float32 vector.
3. Gemini Answering Logic
if isinstance(content, Image.Image):
response = gemini.generate_content([query, content])
else:
response = gemini.generate_content(f"Question: {query}\n\nContext: {text}")
Gemini 2.5 Flash intelligently parses charts, titles, and layouts.
🚀Getting Started – Minimal Example
Here’s a compact script to get you up and running with multimodal RAG using Cohere + Gemini:
⚠️ Note*: This is a minimal gist to demonstrate the core flow. The full working code with UI, modular structure, and search logic is available in the [GitHub repository](https://github.com/SridharSampath/multimodal-rag-demo).*
import cohere
from google.generativeai import GenerativeModel
import faiss
import numpy as np
from pdf2image import convert_from_path
from PIL import Image
# Initialize APIs
co = cohere.Client("your-cohere-key")
gemini = GenerativeModel("gemini-2.5-flash")
# Convert PDF page to image
def pdf_to_images(pdf_path):
return convert_from_path(pdf_path, dpi=200)
# Create embeddings
def get_embedding(content, content_type="text"):
if content_type == "text":
response = co.embed(input_type="search_document", texts=[content])
else:
base64_img = Image.open(content).resize((512, 512)).tobytes().hex()
response = co.embed(
input_type="search_document",
inputs=[{"content": [{"type": "image", "image": base64_img}]}]
)
return response.embeddings[0]
# Index and query
dimension = 1024
index = faiss.IndexFlatL2(dimension)
images = pdf_to_images("your.pdf")
for img in images:
index.add(np.array([get_embedding(img, "image")], dtype=np.float32))
def answer_query(query):
query_emb = get_embedding(query)
D, I = index.search(np.array([query_emb], dtype=np.float32), k=1)
result = images[I[0][0]]
return gemini.generate_content([query, result]).text
⚙️Project Setup
What You'll Need
🔑 API Keys:
Cohere embed-v4.0 → Create Cohere Account
Gemini 2.5 Flash → Try Gemini on Google AI Studio
💻 System Requirements:
Python 3.8+
Poppler (for PDF image conversion)
| File | Purpose |
app.py | Streamlit UI for uploading, querying |
core/embeddings.py | Calls Cohere for text/image embeddings |
core/document_utils.py | PDF parsing, image conversion, FAISS indexing |
core/search.py | Embedding-based search + Gemini response |
config.py | API Keys & Model Settings |
# Clone repository
git clone https://github.com/SridharSampath/multimodal-rag-demo
cd multimodal-rag-app
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.py
⚠️System Dependency: Poppler
This project uses pdf2image to convert PDF pages into images, which requires Poppler:
Windows:
Download from GitHub - Poppler Windows Releases
Extract to a folder like C:\poppler
Add C:\poppler\Library\bin to your system's PATH
🧪 Demo Screenshots – Multimodal vs. Text-Only RAG
Visual comparison of the same queries across two apps:
1. ❓ Query: “What is AUM of Invesco?”
Multimodal App: Found in Bar chart
Text-Only App: Missed (text doesn’t mention it)


2. ❓ Query: “How much did BlackRock earn through Technology services?”
Multimodal App: Pulled value from image- Blackrock Income Statement
Text-Only App: Missed (text doesn’t mention it)


3. 🍎 Query: “How much Percentage is Apple in S&P?”
Multimodal App: Found in pie chart
Text-Only App: Gave approximate data


4. 🦠 Query: “During Covid pandemic what was the top 10 weight in S&P 500?”
Multimodal App: Parsed timeline chart
Text-Only App: Missed specific figure


5. 💰 Query: “How to track Bitcoin in ETFs?”
Multimodal App: Found in Table Image
Text-Only App: Missed specific figure


⚠️ Limitations and Considerations
While multimodal RAG offers significant advantages, be aware of:
Computational overhead - Processing and embedding images requires more resources
API costs - Multimodal embedding APIs typically cost more than text-only equivalents
OCR dependency - Chart text recognition still relies on OCR quality
Image resolution impact - Low-resolution images may reduce embedding quality
Complex visualization challenges - Very complex visualizations might still be misinterpreted
Resources & Reference Links
🙌Closing Thoughts
If you're building LLM apps for financial document QA, research assistant bots, or compliance analytics, you need to look beyond just text. Multimodal RAG delivers context-aware, image-inclusive, and LLM-optimized retrieval that can extract insights from your entire document ecosystem, not just the textual components.
Try it out and let me know your thoughts!
🚀 Let's Connect!
If you found this useful, feel free to connect with me:
🔗 LinkedIn - Sridhar Sampath
🔗 Hashnode Blog



