2,129 results for “topic:evaluation”
The open source AI engineering platform. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI agents, LLM applications, and ML models while controlling costs and managing access to models and data.
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.
Supercharge Your LLM Application Evaluations 🚀
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
:metal: awesome-semantic-segmentation
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Python package for the evaluation of odometry and SLAM
Arbitrary expression evaluation for golang
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.
The platform for LLM evaluations and AI agent testing
An open-source visual programming environment for battle-testing prompts to LLMs.
End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
A unified evaluation framework for large language models
Laminar - open-source observability platform purpose-built for AI agents. YC S24.
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends