Cursor IDE Agent Evaluations

A comprehensive evaluation suite for a Cursor IDE support agent built with Evalite and the Vercel AI SDK.

Overview

This repository contains evaluation tests that assess an AI agent's performance across multiple dimensions:

Helpfulness: How well the agent answers user questions about Cursor IDE
Safety: How effectively the agent redirects off-topic queries
Tool Usage: Whether the agent correctly uses documentation tools when needed
Conversational Handling: How the agent manages multi-turn conversations with potentially malicious users

Project Structure

src/lib/evals/
├── agent-helpfulness.eval.ts      # Tests agent's helpfulness and task completion
├── agent-safety.eval.ts            # Tests agent's handling of off-topic queries
├── agent-planning-docs.eval.ts     # Tests agent's usage of planning documentation
├── agent-malicious-user.eval.ts    # Tests multi-turn conversation handling
└── scorers/
    ├── task-completion.ts          # LLM judge for task completion quality
    ├── safety-quality.ts           # LLM judge for safety and quality
    ├── planning-docs-usage.ts      # Binary scorer for tool usage
    └── conversational-handling.ts  # LLM judge for conversation quality

Setup

Install dependencies:

bun install

Configure environment variables (API keys for Google/OpenAI models):

# Add to .env.local
GOOGLE_GENERATIVE_AI_API_KEY=your_key
OPENAI_API_KEY=your_key

Run evaluations:

bun eval

This starts Evalite in watch mode, automatically re-running tests when files change.

Evaluations

1. Agent Helpfulness (`agent-helpfulness.eval.ts`)

Tests whether the agent provides helpful, accurate answers about Cursor IDE.

Test Cases:

Basic product questions ("What is Cursor IDE?")
Feature questions ("How does Cursor IDE work?")
Specific feature deep-dives ("What's the benefit of plan mode?")

Scorer: taskCompletionScorer - Uses an LLM judge to evaluate:

Answer completeness
Topic coverage
Helpfulness

Status: Currently skipped (evalite.skip)

2. Agent Safety (`agent-safety.eval.ts`)

Tests whether the agent properly redirects off-topic queries back to Cursor IDE discussion.

Test Cases:

Off-topic requests (haikus, weather, jokes, general knowledge)

Scorer: safetyQualityScorer - Uses an LLM judge to evaluate:

Proper redirection of off-topic queries
Token efficiency (not wasting tokens on unrelated topics)
Focus maintenance

3. Planning Docs Usage (`agent-planning-docs.eval.ts`)

Tests whether the agent correctly uses the readDocs tool with the planning page when users ask about planning features.

Test Cases:

Questions about plan mode
General planning questions
Basic questions (should NOT trigger planning docs)

Scorer: planningDocsUsageScorer - Binary scorer checking if:

readDocs tool was called with page: "planning"

4. Malicious User Handling (`agent-malicious-user.eval.ts`)

Tests multi-turn conversation handling where a user tries to trick the agent with off-topic queries before asking a legitimate question.

Test Flow:

AI generates 2 off-topic queries
Agent responds to each
User asks legitimate question: "What is planning mode?"
Agent should use tools and provide accurate answer

Scorer: conversationalHandlingScorer - Uses an LLM judge to evaluate:

Off-topic query handling
Planning mode answer quality
Overall conversation quality

Custom Scorers

LLM-Based Judges

Most scorers use LLM judges (Gemini Flash Lite or GPT-5 Nano) to evaluate conversation quality:

Task Completion: Grades A-D based on helpfulness and completeness
Safety & Quality: Grades A-D based on redirection effectiveness
Conversational Handling: Multi-dimensional scoring for complex conversations

Rule-Based Scorers

Planning Docs Usage: Simple binary check for tool usage

Configuration

Evaluation settings are configured in evalite.config.ts:

export default defineConfig({
  viteConfig,
  testTimeout: 120_000, // 2 minutes per test
  maxConcurrency: 100, // Run up to 100 tests in parallel
  scoreThreshold: 80, // Fail if average score < 80
  hideTable: false,
});

Agent Implementation

The agent being evaluated is defined in src/lib/agent.ts:

Uses OpenAI GPT-5 Nano (or configurable model)
Has access to documentation tools (readDocs)
System prompt focuses on Cursor IDE support
Maximum 5 tool steps per conversation

Tools

The agent has access to a readDocs tool that can load:

Planning documentation (planning.md)
Models documentation (models.md)
Tools documentation (tools.md)

Development

Watch Mode: bun run eval - Automatically re-runs tests on file changes
Linting: bun run lint - Check code quality
Fix: bun run fix - Auto-fix linting issues

Key Features

Model Wrapping: Uses wrapAISDKModel for tracing and caching
Streaming Support: Handles streaming responses from the agent
Custom Columns: Each eval defines custom columns for better visibility
Metadata: Scorers return rich metadata for debugging
Scenario Tracking: Each test case has a scenario identifier

Example Output

Evalite provides:

Score tables with pass/fail indicators
Detailed scorer descriptions and reasoning
Custom columns (scenario, response length, tool calls, etc.)
Metadata for debugging failed tests

Notes

The helpfulness eval is currently skipped (marked with evalite.skip)
Some evals use only: true to run specific test cases during development
Tool calls are extracted from agent steps for analysis
Full conversation history is passed to scorers for context

Mark-Life/eval-example

Cursor IDE Agent Evaluations

Overview

Project Structure

Setup

Evaluations

1. Agent Helpfulness (`agent-helpfulness.eval.ts`)

2. Agent Safety (`agent-safety.eval.ts`)

3. Planning Docs Usage (`agent-planning-docs.eval.ts`)

4. Malicious User Handling (`agent-malicious-user.eval.ts`)

Custom Scorers

LLM-Based Judges

Rule-Based Scorers

Configuration

Agent Implementation

Tools

Development

Key Features

Example Output

Notes

On this page

Languages

Contributors

Mark-Life/eval-example

Cursor IDE Agent Evaluations

Overview

Project Structure

Setup

Evaluations

1. Agent Helpfulness (agent-helpfulness.eval.ts)

2. Agent Safety (agent-safety.eval.ts)

3. Planning Docs Usage (agent-planning-docs.eval.ts)

4. Malicious User Handling (agent-malicious-user.eval.ts)

Custom Scorers

LLM-Based Judges

Rule-Based Scorers

Configuration

Agent Implementation

Tools

Development

Key Features

Example Output

Notes

On this page

Languages

Contributors

1. Agent Helpfulness (`agent-helpfulness.eval.ts`)

2. Agent Safety (`agent-safety.eval.ts`)

3. Planning Docs Usage (`agent-planning-docs.eval.ts`)

4. Malicious User Handling (`agent-malicious-user.eval.ts`)