Nemotron ColEmbed V2 Raises Multimodal Retrieval Bar with ViDoRe V3’s Top Model

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval

NVIDIA’s introduction of the Nemotron ColEmbed V2 family marks a significant advancement in multimodal retrieval, with the models achieving state-of-the-art performance on the ViDoRe V1, V2, and V3 benchmarks. The nemotron-colembed-vl-8b-v2 model, in particular, ranks #1 on the ViDoRe V3 leaderboard with an accuracy of 63.42 NDCG@10, setting a new standard for multimodal retrieval.

Why This Matters

The development of accurate multimodal retrieval systems is crucial for effectively searching and retrieving information from diverse document types, including text, images, and structured visual elements. However, ideal models often struggle with capturing detailed semantic relationships between queries and documents, leading to reduced accuracy. The Nemotron ColEmbed V2 family addresses this challenge by adopting a late-interaction embedding approach, which enables fine-grained interactions between query and document tokens, resulting in improved accuracy.

Key Insights

The Nemotron ColEmbed V2 models achieve state-of-the-art performance on the ViDoRe V3 benchmark, with the nemotron-colembed-vl-8b-v2 model ranking #1 with 63.42 NDCG@10 accuracy.
The late-interaction mechanism introduced by ColBERT has been extended to a multimodal setting, enabling fine-grained interactions between query and document tokens.
The models are trained using a bi-encoder architecture and contrastive learning, maximizing the similarity between query and document embeddings.

Working Example

# Import necessary libraries
import torch
from transformers import AutoModel, AutoTokenizer

# Load pre-trained Nemotron ColEmbed V2 model and tokenizer
model = AutoModel.from_pretrained("nvidia/nemotron-colembed-vl-8b-v2")
tokenizer = AutoTokenizer.from_pretrained("nvidia/nemotron-colembed-vl-8b-v2")

# Define a sample query and document
query = "What is the main topic of this document?"
document = "This document discusses the application of multimodal retrieval in natural language processing."

# Preprocess the query and document using the tokenizer
query_inputs = tokenizer(query, return_tensors="pt")
document_inputs = tokenizer(document, return_tensors="pt")

# Compute the query and document embeddings using the model
query_embedding = model(**query_inputs)[0]
document_embedding = model(**document_inputs)[0]

# Compute the similarity between the query and document embeddings
similarity = torch.cosine_similarity(query_embedding, document_embedding)

# Print the similarity score
print(similarity.item())

Practical Applications

Use Case: The Nemotron ColEmbed V2 models can be used in multimedia search engines, cross-modal retrieval systems, and conversational AI applications to improve the accuracy of multimodal retrieval.
Pitfall: One common pitfall when using the Nemotron ColEmbed V2 models is the increased storage requirements due to the need to store token embeddings for the entire document corpus.

References:

On This Page

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM

Building a Matryoshka-Optimized Sentence Embedding Model for Ultra-Fast Retrieval

Yuan 3.0 Ultra: Optimizing Trillion-Parameter MoE Efficiency via LAEP