GitHunt
MA

Mark-Life/eval-example

evals examples. dev evals with evalite, prod evals with langfuse

Cursor IDE Agent Evaluations

A comprehensive evaluation suite for a Cursor IDE support agent built with Evalite and the Vercel AI SDK.

Overview

This repository contains evaluation tests that assess an AI agent's performance across multiple dimensions:

  • Helpfulness: How well the agent answers user questions about Cursor IDE
  • Safety: How effectively the agent redirects off-topic queries
  • Tool Usage: Whether the agent correctly uses documentation tools when needed
  • Conversational Handling: How the agent manages multi-turn conversations with potentially malicious users

Project Structure

src/lib/evals/
├── agent-helpfulness.eval.ts      # Tests agent's helpfulness and task completion
├── agent-safety.eval.ts            # Tests agent's handling of off-topic queries
├── agent-planning-docs.eval.ts     # Tests agent's usage of planning documentation
├── agent-malicious-user.eval.ts    # Tests multi-turn conversation handling
└── scorers/
    ├── task-completion.ts          # LLM judge for task completion quality
    ├── safety-quality.ts           # LLM judge for safety and quality
    ├── planning-docs-usage.ts      # Binary scorer for tool usage
    └── conversational-handling.ts  # LLM judge for conversation quality

Setup

  1. Install dependencies:
bun install
  1. Configure environment variables (API keys for Google/OpenAI models):
# Add to .env.local
GOOGLE_GENERATIVE_AI_API_KEY=your_key
OPENAI_API_KEY=your_key
  1. Run evaluations:
bun eval

This starts Evalite in watch mode, automatically re-running tests when files change.

Evaluations

1. Agent Helpfulness (agent-helpfulness.eval.ts)

Tests whether the agent provides helpful, accurate answers about Cursor IDE.

Test Cases:

  • Basic product questions ("What is Cursor IDE?")
  • Feature questions ("How does Cursor IDE work?")
  • Specific feature deep-dives ("What's the benefit of plan mode?")

Scorer: taskCompletionScorer - Uses an LLM judge to evaluate:

  • Answer completeness
  • Topic coverage
  • Helpfulness

Status: Currently skipped (evalite.skip)

2. Agent Safety (agent-safety.eval.ts)

Tests whether the agent properly redirects off-topic queries back to Cursor IDE discussion.

Test Cases:

  • Off-topic requests (haikus, weather, jokes, general knowledge)

Scorer: safetyQualityScorer - Uses an LLM judge to evaluate:

  • Proper redirection of off-topic queries
  • Token efficiency (not wasting tokens on unrelated topics)
  • Focus maintenance

3. Planning Docs Usage (agent-planning-docs.eval.ts)

Tests whether the agent correctly uses the readDocs tool with the planning page when users ask about planning features.

Test Cases:

  • Questions about plan mode
  • General planning questions
  • Basic questions (should NOT trigger planning docs)

Scorer: planningDocsUsageScorer - Binary scorer checking if:

  • readDocs tool was called with page: "planning"

4. Malicious User Handling (agent-malicious-user.eval.ts)

Tests multi-turn conversation handling where a user tries to trick the agent with off-topic queries before asking a legitimate question.

Test Flow:

  1. AI generates 2 off-topic queries
  2. Agent responds to each
  3. User asks legitimate question: "What is planning mode?"
  4. Agent should use tools and provide accurate answer

Scorer: conversationalHandlingScorer - Uses an LLM judge to evaluate:

  • Off-topic query handling
  • Planning mode answer quality
  • Overall conversation quality

Custom Scorers

LLM-Based Judges

Most scorers use LLM judges (Gemini Flash Lite or GPT-5 Nano) to evaluate conversation quality:

  • Task Completion: Grades A-D based on helpfulness and completeness
  • Safety & Quality: Grades A-D based on redirection effectiveness
  • Conversational Handling: Multi-dimensional scoring for complex conversations

Rule-Based Scorers

  • Planning Docs Usage: Simple binary check for tool usage

Configuration

Evaluation settings are configured in evalite.config.ts:

export default defineConfig({
  viteConfig,
  testTimeout: 120_000, // 2 minutes per test
  maxConcurrency: 100, // Run up to 100 tests in parallel
  scoreThreshold: 80, // Fail if average score < 80
  hideTable: false,
});

Agent Implementation

The agent being evaluated is defined in src/lib/agent.ts:

  • Uses OpenAI GPT-5 Nano (or configurable model)
  • Has access to documentation tools (readDocs)
  • System prompt focuses on Cursor IDE support
  • Maximum 5 tool steps per conversation

Tools

The agent has access to a readDocs tool that can load:

  • Planning documentation (planning.md)
  • Models documentation (models.md)
  • Tools documentation (tools.md)

Development

  • Watch Mode: bun run eval - Automatically re-runs tests on file changes
  • Linting: bun run lint - Check code quality
  • Fix: bun run fix - Auto-fix linting issues

Key Features

  • Model Wrapping: Uses wrapAISDKModel for tracing and caching
  • Streaming Support: Handles streaming responses from the agent
  • Custom Columns: Each eval defines custom columns for better visibility
  • Metadata: Scorers return rich metadata for debugging
  • Scenario Tracking: Each test case has a scenario identifier

Example Output

Evalite provides:

  • Score tables with pass/fail indicators
  • Detailed scorer descriptions and reasoning
  • Custom columns (scenario, response length, tool calls, etc.)
  • Metadata for debugging failed tests

Notes

  • The helpfulness eval is currently skipped (marked with evalite.skip)
  • Some evals use only: true to run specific test cases during development
  • Tool calls are extracted from agent steps for analysis
  • Full conversation history is passed to scorers for context
Mark-Life/eval-example | GitHunt