Cursor IDE Agent Evaluations
A comprehensive evaluation suite for a Cursor IDE support agent built with Evalite and the Vercel AI SDK.
Overview
This repository contains evaluation tests that assess an AI agent's performance across multiple dimensions:
- Helpfulness: How well the agent answers user questions about Cursor IDE
- Safety: How effectively the agent redirects off-topic queries
- Tool Usage: Whether the agent correctly uses documentation tools when needed
- Conversational Handling: How the agent manages multi-turn conversations with potentially malicious users
Project Structure
src/lib/evals/
├── agent-helpfulness.eval.ts # Tests agent's helpfulness and task completion
├── agent-safety.eval.ts # Tests agent's handling of off-topic queries
├── agent-planning-docs.eval.ts # Tests agent's usage of planning documentation
├── agent-malicious-user.eval.ts # Tests multi-turn conversation handling
└── scorers/
├── task-completion.ts # LLM judge for task completion quality
├── safety-quality.ts # LLM judge for safety and quality
├── planning-docs-usage.ts # Binary scorer for tool usage
└── conversational-handling.ts # LLM judge for conversation quality
Setup
- Install dependencies:
bun install- Configure environment variables (API keys for Google/OpenAI models):
# Add to .env.local
GOOGLE_GENERATIVE_AI_API_KEY=your_key
OPENAI_API_KEY=your_key- Run evaluations:
bun evalThis starts Evalite in watch mode, automatically re-running tests when files change.
Evaluations
1. Agent Helpfulness (agent-helpfulness.eval.ts)
Tests whether the agent provides helpful, accurate answers about Cursor IDE.
Test Cases:
- Basic product questions ("What is Cursor IDE?")
- Feature questions ("How does Cursor IDE work?")
- Specific feature deep-dives ("What's the benefit of plan mode?")
Scorer: taskCompletionScorer - Uses an LLM judge to evaluate:
- Answer completeness
- Topic coverage
- Helpfulness
Status: Currently skipped (evalite.skip)
2. Agent Safety (agent-safety.eval.ts)
Tests whether the agent properly redirects off-topic queries back to Cursor IDE discussion.
Test Cases:
- Off-topic requests (haikus, weather, jokes, general knowledge)
Scorer: safetyQualityScorer - Uses an LLM judge to evaluate:
- Proper redirection of off-topic queries
- Token efficiency (not wasting tokens on unrelated topics)
- Focus maintenance
3. Planning Docs Usage (agent-planning-docs.eval.ts)
Tests whether the agent correctly uses the readDocs tool with the planning page when users ask about planning features.
Test Cases:
- Questions about plan mode
- General planning questions
- Basic questions (should NOT trigger planning docs)
Scorer: planningDocsUsageScorer - Binary scorer checking if:
readDocstool was called withpage: "planning"
4. Malicious User Handling (agent-malicious-user.eval.ts)
Tests multi-turn conversation handling where a user tries to trick the agent with off-topic queries before asking a legitimate question.
Test Flow:
- AI generates 2 off-topic queries
- Agent responds to each
- User asks legitimate question: "What is planning mode?"
- Agent should use tools and provide accurate answer
Scorer: conversationalHandlingScorer - Uses an LLM judge to evaluate:
- Off-topic query handling
- Planning mode answer quality
- Overall conversation quality
Custom Scorers
LLM-Based Judges
Most scorers use LLM judges (Gemini Flash Lite or GPT-5 Nano) to evaluate conversation quality:
- Task Completion: Grades A-D based on helpfulness and completeness
- Safety & Quality: Grades A-D based on redirection effectiveness
- Conversational Handling: Multi-dimensional scoring for complex conversations
Rule-Based Scorers
- Planning Docs Usage: Simple binary check for tool usage
Configuration
Evaluation settings are configured in evalite.config.ts:
export default defineConfig({
viteConfig,
testTimeout: 120_000, // 2 minutes per test
maxConcurrency: 100, // Run up to 100 tests in parallel
scoreThreshold: 80, // Fail if average score < 80
hideTable: false,
});Agent Implementation
The agent being evaluated is defined in src/lib/agent.ts:
- Uses OpenAI GPT-5 Nano (or configurable model)
- Has access to documentation tools (
readDocs) - System prompt focuses on Cursor IDE support
- Maximum 5 tool steps per conversation
Tools
The agent has access to a readDocs tool that can load:
- Planning documentation (
planning.md) - Models documentation (
models.md) - Tools documentation (
tools.md)
Development
- Watch Mode:
bun run eval- Automatically re-runs tests on file changes - Linting:
bun run lint- Check code quality - Fix:
bun run fix- Auto-fix linting issues
Key Features
- Model Wrapping: Uses
wrapAISDKModelfor tracing and caching - Streaming Support: Handles streaming responses from the agent
- Custom Columns: Each eval defines custom columns for better visibility
- Metadata: Scorers return rich metadata for debugging
- Scenario Tracking: Each test case has a scenario identifier
Example Output
Evalite provides:
- Score tables with pass/fail indicators
- Detailed scorer descriptions and reasoning
- Custom columns (scenario, response length, tool calls, etc.)
- Metadata for debugging failed tests
Notes
- The helpfulness eval is currently skipped (marked with
evalite.skip) - Some evals use
only: trueto run specific test cases during development - Tool calls are extracted from agent steps for analysis
- Full conversation history is passed to scorers for context