"topic:evals" — Search

152 results for “topic:evals”

From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

TypeScript21.9k1.7kUpdated just now

agentsaichatbotsevalsjavascriptllmmcpnextjsnodejsreactjsttstypescriptworkflows

Arize-ai/phoenix

AI Observability & Evaluation

Jupyter Notebook8.8k742Updated just now

agentsai-monitoringai-observabilityaiengineeringanthropicdatasetsevalslangchainllamaindexllm-evalllm-evaluationllmopsllmsopenaiprompt-engineeringsmolagents

AgentOps-AI/agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

Python5.3k536Updated 2 hours ago

agentagentopsagents-sdkaianthropicautogencost-estimationcrewaievalsevaluation-metricsgroqlangchainllmmistralollamaopenaiopenai-agents

Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Python4.7k346Updated 3 hours ago

aichain-of-thoughtcollaborationdataset-generationevalsevaluationevaluation-frameworkfine-tuningmachine-learningmacosmcpmlollamaopenaipromptprompt-engineeringpythonrlhfsynthetic-datawindows

pydantic/logfire

AI observability platform for production LLM and agent systems.

Python4.1k209Updated 2 hours ago

agent-observabilityaiai-observabilityai-toolsevalsfastapillm-observabilityloggingmetricsobservabilityopenaiopentelemetrypydanticpydantic-aipythontrace

truera/trulens

Evaluation and Tracking for LLM Experiments and AI Agents

Python3.1k251Updated 5 hours ago

agent-evaluationagentopsai-agentsai-monitoringai-observabilityevalsexplainable-mlllm-evalllm-evaluationllmopsllmsmachine-learningneural-networks

lmnr-ai/lmnr

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

TypeScript2.7k175Updated 1 hour ago

agentsaiai-observabilityaiopsanalyticsdeveloper-toolsevalsevaluationllm-evaluationllm-observabilityllm-workflowllmopsmonitoringobservabilityopen-sourcerustrust-langself-hostedtstypescript

GitHamza0206/simba

OpenSource Production ready Customer service with built in Evals and monitoring

TypeScript1.4k104Updated 2 weeks ago

customer-serviceevalsknowledge-basellmrag

mattpocock/evalite

Evaluate your LLM-powered apps with TypeScript

TypeScript1.4k64Updated 1 day ago

aievalstypescript

superlinear-ai/raglite

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL

Python1.1k100Updated 1 day ago

chainlitcolbertduckdbevalslate-chunkinglate-interactionllmmarkdownpdfpgvectorpostgrespostgresqlquery-adapterragrerankingretrieval-augmented-generationsqlitevector-search

harbor-framework/harbor

Harbor is a framework for running agent evaluations and creating and using RL environments.

Python921685Updated 1 hour ago

evalsrl-environmentsterminal-bench

waynesutton/opensync

Cloud-synced dashboards for OpenCode and Claude Code. Track sessions, search with semantic lookup, export eval datasets.

TypeScript31734Updated 3 hours ago

aiconvexdasbhoardevalsopen-sourceopensyncsessions

microsoft/promptpex

Test Generation for Prompts

TeX15719Updated 4 days ago

chatgptevalsevaluationsgpt-4ollmsprompt-engineeringtesting

keshik6/HourVideo

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

Jupyter Notebook1574Updated 1 week ago

1-hour-video-language-understandingbenchmark-datasetegocentric-videosevalsgemini-progpt-4long-context-understandinglong-form-video-language-understandingmultimodal-large-language-modelsmultiple-choice-questionsnavigationneurips-2024perceptionreasoningspatial-intelligencesummarizationvideo-language-understandingvisual-reasoning

METR/vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

TypeScript13439Updated 2 weeks ago

aiai-evaluationelicitationevals

mclenhard/mcp-evals

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

TypeScript12512Updated 1 month ago

aievalsmcp

dustalov/evalica

Evalica, your favourite evaluation toolkit

Python625Updated 4 days ago

arenabradley-terryeloevalicaevalsevaluationhacktoberfestleaderboardlibraryllmpagerankpairwise-comparisonpyo3pythonrankingratingrustserbiastatisticswinrate

voratiq/voratiq

Agent ensembles to design, generate, and select the best code for every task.

TypeScript591Updated 8 hours ago

agentsaiclicode-generationevalsmulti-agentorchestration-frameworksandboxingspec-driven-development

ombharatiya/ai-system-design-guide

AI system design guide for engineers building production AI systems and evals.

5510Updated 23 hours ago

agentic-aiagentic-workflowartificial-intelligenceawsazureclaudeevalsgcpgeminigen-aiinterview-questionsllamallmmachine-learningnatural-language-processingopen-airagsystem-design-interview

Kylejeong2/mcpvals

An MCP Evaluation Library

TypeScript474Updated 3 months ago

evalsmcp

AgentEvalHQ/AgentEval

AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET

C#432Updated 4 hours ago

agentagenticevalsevaluationsframeworknetred-teamingtestingworkflows

AIAnytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

Python4218Updated 2 weeks ago

evalevalsrag

flexpa/llm-fhir-eval

Benchmarking Large Language Models for FHIR

TypeScript428Updated 3 months ago

evalsfhirfhir-llmfhir-resourcesfhirpathllmllm-evaluation-framework

nuxt/nuxt-evals

Evals for Nuxt to test AI model competency at Nuxt.