152 results for “topic:evals”
From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
AI Observability & Evaluation
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
AI observability platform for production LLM and agent systems.
Evaluation and Tracking for LLM Experiments and AI Agents
Laminar - open-source observability platform purpose-built for AI agents. YC S24.
OpenSource Production ready Customer service with built in Evals and monitoring
Evaluate your LLM-powered apps with TypeScript
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL
Harbor is a framework for running agent evaluations and creating and using RL environments.
Cloud-synced dashboards for OpenCode and Claude Code. Track sessions, search with semantic lookup, export eval datasets.
Test Generation for Prompts
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.
Evalica, your favourite evaluation toolkit
Agent ensembles to design, generate, and select the best code for every task.
AI system design guide for engineers building production AI systems and evals.
An MCP Evaluation Library
AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
Benchmarking Large Language Models for FHIR
Evals for Nuxt to test AI model competency at Nuxt.
Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.
A benchmark suite for evaluating how coding models solve real React Native tasks.
An opinionated list of awesome Pydantic-AI frameworks, libraries, software and resources.
Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025
Open source framework for evaluating AI Agents
No description provided.