EJB Vision-Language Model Benchmark

A modular, reference-free evaluation framework for Vision-Language Models, inspired by CLIPScore

Overview

EJB VLM Benchmark is a comprehensive evaluation framework for assessing Vision-Language Models (VLMs) such as GPT-4-Vision, Claude-3, and Gemini-Vision. Built on the foundation of CLIPScore (Hessel et al., EMNLP 2021), this benchmark enables reference-free evaluation — meaning you can assess image caption quality without needing ground-truth captions.

What Makes This Different?

Traditional caption evaluation metrics (BLEU, CIDEr, METEOR) require reference captions for comparison. CLIPScore revolutionized this by measuring semantic similarity between images and generated captions directly, eliminating the need for references.

EJB VLM Benchmark extends this concept to evaluate modern VLMs:

Traditional Metrics: Compare generated captions ↔ reference captions
CLIPScore:          Compare generated captions ↔ images (reference-free!)
VLM Benchmark:      Generate captions with VLMs → Evaluate with CLIPScore

⚠️ IMPORTANT: This benchmark uses the original clipscore.py implementation directly from Hessel et al. (2021) — it does NOT reimplement the metric. All CLIPScore computations use the exact same code from the paper to ensure accuracy and reproducibility.

Key Features

✅ Reference-Free Evaluation: No ground-truth captions needed
✅ API Integration: Built-in support for LLM7.io and extendable to other APIs
✅ YAML Configuration: Complete parametrization through config files
✅ Jinja2 Templates: Reusable, parametrized prompts
✅ Batch Processing: Efficient evaluation of large datasets
✅ Modular Design: Easy to extend and customize

Reference-Free Evaluation with CLIPScore

The foundation of this benchmark is CLIPScore's ability to evaluate captions without reference captions. Here are examples using the original CLIPScore implementation:

Basic CLIPScore (Reference-Free)

Evaluate captions against images directly:

python clipscore.py example/good_captions.json example/images/

Output:

CLIPScore: 0.8584

Better captions receive higher scores. Worse captions get lower scores:

python clipscore.py example/bad_captions.json example/images/

Output:

CLIPScore: 0.7153

Why Reference-Free Matters

❌ No need for ground-truth captions or annotations
✅ Measures semantic alignment between visual and textual content
✅ Works on any image dataset without pre-existing labels
✅ Ideal for real-world applications and custom datasets

Installation

# Install dependencies
pip install torch clip pillow requests pyyaml jinja2 tqdm numpy scikit-learn

# Or use requirements file
pip install -r requirements.txt

Quick Start: VLM Evaluation

Evaluate Vision-Language Models using the benchmark:

# Navigate to benchmark directory
cd ejb_vlm_benchmark/

# Set up your API key
export LLM7_API_KEY="your-key-here"

# Run evaluation on a single image
python -c "
from ejb_vlm_benchmark import VLMBenchmark
benchmark = VLMBenchmark('config_example.yaml')
result = benchmark.evaluate_single('path/to/image.jpg')
print(f'CLIPScore: {result[\"clipscore\"]:.4f}')
print(f'Caption: {result[\"generated_caption\"]}')
"

Configuration

api:
  api_key: "your-api-key"
  base_url: "https://api.llm7.io/v1"

model:
  model_name: "gpt-4-vision-preview"
  temperature: 0.0
  max_tokens: 1000

evaluation:
  task_type: "caption_generation"
  evaluation_metrics:
    - "clipscore"
  use_references: false  # Reference-free!
  
  prompt_template_name: "default_caption"
  prompt_template_params:
    instruction: "Describe this image in detail"
    style: "descriptive and concise"

Pre-configured Templates

Available in configs/:

detailed_description_config.yaml - Detailed image descriptions
vqa_config.yaml - Visual Question Answering
comparison_config.yaml - Image comparisons
llm_judge_config.yaml - LLM-based quality assessment

Usage Examples

Evaluate a Dataset

from ejb_vlm_benchmark import VLMBenchmark
from ejb_vlm_benchmark.utils import find_images

# Initialize benchmark
benchmark = VLMBenchmark("ejb_vlm_benchmark/config_example.yaml")

# Find all images
images = find_images("path/to/images/")

# Evaluate
results = benchmark.evaluate_dataset(images)

print(f"Mean CLIPScore: {results['metrics']['mean_clipscore']:.4f}")
print(f"Evaluated {results['metrics']['total_images']} images")

# Results saved automatically to outputs/

Architecture

ejb_vlm_benchmark/
├── config.py              # Parametrized configuration system
├── api_client.py          # LLM7.io API client
├── prompt_templates.py    # Jinja2 template manager
├── evaluator.py           # Main evaluation logic + CLIPScore
├── utils.py               # Utility functions
├── configs/               # Pre-configured templates
│   ├── vqa_config.yaml
│   ├── detailed_description_config.yaml
│   └── ...
└── templates/             # Jinja2 prompt templates
    ├── custom_caption.j2
    └── conversation.j2

Use Cases

Model Benchmarking: Compare different VLMs on the same dataset
Prompt Engineering: Test different prompts and measure their impact
Quality Assurance: Monitor VLM output quality in production
Research: Investigate VLM capabilities on custom datasets
Custom Datasets: Evaluate on new/unlabeled image collections without ground-truth captions

Contact

For questions about the VLM Benchmark extension, contact: eduardojbarriosgarcia@gmail.com

edujbarrios/ejb_vlm_benchmark