EJB Vision-Language Model Benchmark
A modular, reference-free evaluation framework for Vision-Language Models, inspired by CLIPScore
Overview
EJB VLM Benchmark is a comprehensive evaluation framework for assessing Vision-Language Models (VLMs) such as GPT-4-Vision, Claude-3, and Gemini-Vision. Built on the foundation of CLIPScore (Hessel et al., EMNLP 2021), this benchmark enables reference-free evaluation — meaning you can assess image caption quality without needing ground-truth captions.
What Makes This Different?
Traditional caption evaluation metrics (BLEU, CIDEr, METEOR) require reference captions for comparison. CLIPScore revolutionized this by measuring semantic similarity between images and generated captions directly, eliminating the need for references.
EJB VLM Benchmark extends this concept to evaluate modern VLMs:
Traditional Metrics: Compare generated captions ↔ reference captions
CLIPScore: Compare generated captions ↔ images (reference-free!)
VLM Benchmark: Generate captions with VLMs → Evaluate with CLIPScore
clipscore.py implementation directly from Hessel et al. (2021) — it does NOT reimplement the metric. All CLIPScore computations use the exact same code from the paper to ensure accuracy and reproducibility.
Key Features
- ✅ Reference-Free Evaluation: No ground-truth captions needed
- ✅ API Integration: Built-in support for LLM7.io and extendable to other APIs
- ✅ YAML Configuration: Complete parametrization through config files
- ✅ Jinja2 Templates: Reusable, parametrized prompts
- ✅ Batch Processing: Efficient evaluation of large datasets
- ✅ Modular Design: Easy to extend and customize
Reference-Free Evaluation with CLIPScore
The foundation of this benchmark is CLIPScore's ability to evaluate captions without reference captions. Here are examples using the original CLIPScore implementation:
Basic CLIPScore (Reference-Free)
Evaluate captions against images directly:
python clipscore.py example/good_captions.json example/images/Output:
CLIPScore: 0.8584
Better captions receive higher scores. Worse captions get lower scores:
python clipscore.py example/bad_captions.json example/images/Output:
CLIPScore: 0.7153
Why Reference-Free Matters
- ❌ No need for ground-truth captions or annotations
- ✅ Measures semantic alignment between visual and textual content
- ✅ Works on any image dataset without pre-existing labels
- ✅ Ideal for real-world applications and custom datasets
Installation
# Install dependencies
pip install torch clip pillow requests pyyaml jinja2 tqdm numpy scikit-learn
# Or use requirements file
pip install -r requirements.txtQuick Start: VLM Evaluation
Evaluate Vision-Language Models using the benchmark:
# Navigate to benchmark directory
cd ejb_vlm_benchmark/
# Set up your API key
export LLM7_API_KEY="your-key-here"
# Run evaluation on a single image
python -c "
from ejb_vlm_benchmark import VLMBenchmark
benchmark = VLMBenchmark('config_example.yaml')
result = benchmark.evaluate_single('path/to/image.jpg')
print(f'CLIPScore: {result[\"clipscore\"]:.4f}')
print(f'Caption: {result[\"generated_caption\"]}')
"Configuration
api:
api_key: "your-api-key"
base_url: "https://api.llm7.io/v1"
model:
model_name: "gpt-4-vision-preview"
temperature: 0.0
max_tokens: 1000
evaluation:
task_type: "caption_generation"
evaluation_metrics:
- "clipscore"
use_references: false # Reference-free!
prompt_template_name: "default_caption"
prompt_template_params:
instruction: "Describe this image in detail"
style: "descriptive and concise"Pre-configured Templates
Available in configs/:
detailed_description_config.yaml- Detailed image descriptionsvqa_config.yaml- Visual Question Answeringcomparison_config.yaml- Image comparisonsllm_judge_config.yaml- LLM-based quality assessment
Usage Examples
Evaluate a Dataset
from ejb_vlm_benchmark import VLMBenchmark
from ejb_vlm_benchmark.utils import find_images
# Initialize benchmark
benchmark = VLMBenchmark("ejb_vlm_benchmark/config_example.yaml")
# Find all images
images = find_images("path/to/images/")
# Evaluate
results = benchmark.evaluate_dataset(images)
print(f"Mean CLIPScore: {results['metrics']['mean_clipscore']:.4f}")
print(f"Evaluated {results['metrics']['total_images']} images")
# Results saved automatically to outputs/Architecture
ejb_vlm_benchmark/
├── config.py # Parametrized configuration system
├── api_client.py # LLM7.io API client
├── prompt_templates.py # Jinja2 template manager
├── evaluator.py # Main evaluation logic + CLIPScore
├── utils.py # Utility functions
├── configs/ # Pre-configured templates
│ ├── vqa_config.yaml
│ ├── detailed_description_config.yaml
│ └── ...
└── templates/ # Jinja2 prompt templates
├── custom_caption.j2
└── conversation.j2
Use Cases
- Model Benchmarking: Compare different VLMs on the same dataset
- Prompt Engineering: Test different prompts and measure their impact
- Quality Assurance: Monitor VLM output quality in production
- Research: Investigate VLM capabilities on custom datasets
- Custom Datasets: Evaluate on new/unlabeled image collections without ground-truth captions
Contact
For questions about the VLM Benchmark extension, contact: eduardojbarriosgarcia@gmail.com