"topic:evaluation" — Search

2,129 results for “topic:evaluation”

The open source AI engineering platform. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI agents, LLM applications, and ML models while controlling costs and managing access to models and data.

Python24.7k5.4kUpdated just now

agentopsagentsaiai-governanceapache-sparkevaluationlangchainllm-evaluationllmopsmachine-learningmlmlflowmlopsmodel-managementobservabilityopen-sourceopenaiprompt-engineering

langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

TypeScript22.9k2.3kUpdated just now

analyticsautogenevaluationlangchainlarge-language-modelsllama-indexllmllm-evaluationllm-observabilityllmopsmonitoringobservabilityopen-sourceopenaiplaygroundprompt-engineeringprompt-managementself-hostedycombinator

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Python18.1k1.4kUpdated just now

evaluationhacktoberfesthacktoberfest2025langchainllama-indexllmllm-evaluationllm-observabilityllmopsopen-sourceopenaiplaygroundprompt-engineering

Tencent/WeKnora

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

Go13.3k1.5kUpdated 5 hours ago

agentagenticaichatbotchatbotsembeddingsevaluationgenerative-aigolangknowledge-basellmmulti-tenantmultimodelollamaopenaiquestion-answeringragrerankingsemantic-searchvector-search

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

Python12.9k1.3kUpdated 5 hours ago

evaluationllmllmops

promptfoo/promptfoo

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript11.3k1.0kUpdated just now

cici-cdcicdevaluationevaluation-frameworkllmllm-evalllm-evaluationllm-evaluation-frameworkllmopspentestingprompt-engineeringprompt-testingpromptsragred-teamingtestingvulnerability-scanners

mrgloom/awesome-semantic-segmentation

:metal: awesome-semantic-segmentation

10.8k2.5kUpdated 13 hours ago

benchmarkdeeplearningevaluationsemantic-segmentation

oumi-ai/oumi

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

Python8.9k706Updated 16 hours ago

dpoevaluationfine-tuninggpt-ossgpt-oss-120bgpt-oss-20binferencellamallmssftslmsvlms

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python6.7k742Updated just now

benchmarkchatgptevaluationlarge-language-modelllama2llama3llmopenai

coze-dev/coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

Go5.3k738Updated 14 hours ago

agentagent-evaluationagent-observabilityagentopsaicozeeinoevaluationlangchainllm-observabilityllmopsmonitoringobservabilityopen-sourceopenaiplaygroundprompt-management

Helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

TypeScript5.2k493Updated 2 hours ago

agent-monitoringanalyticsevaluationgptlangchainlarge-language-modelsllama-indexllmllm-costllm-evaluationllm-observabilityllmopsmonitoringopen-sourceopenaiplaygroundprompt-engineeringprompt-managementycombinator

Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Python4.7k346Updated 1 hour ago

aichain-of-thoughtcollaborationdataset-generationevalsevaluationevaluation-frameworkfine-tuningmachine-learningmacosmcpmlollamaopenaipromptprompt-engineeringpythonrlhfsynthetic-datawindows

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Python4.6k380Updated 15 hours ago

analysisautomlbenchmarkingdocument-parserembeddingsevaluationllmllm-evaluationllm-opsopen-sourceopsoptimizationpipelinepythonqaragrag-evaluationretrieval-augmented-generation

MichaelGrupp/evo

Python package for the evaluation of odometry and SLAM

Python4.1k791Updated 12 hours ago

benchmarkeurocevaluationkittimappingmetricsodometryroboticsrosros2slamtrajectorytrajectory-analysistrajectory-evaluationtum

Knetic/govaluateArchived

Arbitrary expression evaluation for golang

Go3.9k517Updated 18 hours ago

evaluationexpressiongoparsing

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

TypeScript3.9k491Updated 4 hours ago

agentsevaluationllm-as-a-judgellm-evaluationllm-frameworkllm-monitoringllm-observabilityllm-platformllm-playgroundllm-toolsllmopsobservabilityprompt-engineeringprompt-managementrag-evaluation

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Python3.9k647Updated just now

chatgptclaudeclipcomputer-visionevaluationgeminigptgpt-4vgpt4large-language-modelsllavallmmulti-modalopenaiopenai-apipytorchqwenvitvqa

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Python3.9k537Updated just now

agiaudio-evaluationbenchmarkevaluationlarge-language-modelsllm-evaluationmultimodalmultimodal-evaluationvideo-understandingvision-language-modelvlm

sdiehl/write-you-a-haskell

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

Haskell3.5k256Updated 3 days ago

bookcompilerevaluationfunctional-languagefunctional-programminghaskelhindley-milnerintermediate-representationlambda-calculuspdf-booktypetype-checkingtype-inferencetype-systemtype-theory

CLUEbenchmark/SuperCLUE

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

3.3k113Updated 1 day ago

chatgptchineseevaluationfoundation-modelsgpt-4

viebel/klipse

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

HTML3.1k149Updated 1 week ago

brainfuckclojureclojurescriptcode-evaluationcodemirror-editorcommon-lispevaluationinteractive-snippetsjavascriptklipse-pluginluaocamlprologpythonreactreactjsreasonmlrubyscheme

langwatch/langwatch

The platform for LLM evaluations and AI agent testing

TypeScript3.1k285Updated 3 hours ago

aianalyticsdatasetsdspyevaluationgptllmllm-opsllmopslow-codeobservabilityopenaiprompt-engineering

ianarawjo/ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

TypeScript3.0k252Updated 6 hours ago

aievaluationlarge-language-modelsllmopsllmsprompt-engineering

zzw922cn/Automatic_Speech_Recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

Python2.8k535Updated 16 hours ago

audioautomatic-speech-recognitionchinese-speech-recognitioncnndata-preprocessingdeep-learningend-to-endevaluationfeature-vectorlayer-normalizationlstmpaperphonemesrnnrnn-encoder-decoderspeech-recognitiontensorflowtimit-dataset

microsoft/promptbench

A unified evaluation framework for large language models

Python2.8k219Updated 11 hours ago

adversarial-attacksbenchmarkchatgptevaluationlarge-language-modelspromptprompt-engineeringrobustness

lmnr-ai/lmnr

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

TypeScript2.7k175Updated 11 hours ago

agentsaiai-observabilityaiopsanalyticsdeveloper-toolsevalsevaluationllm-evaluationllm-observabilityllm-workflowllmopsmonitoringobservabilityopen-sourcerustrust-langself-hostedtstypescript

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Python2.5k281Updated 4 hours ago

evaluationllmperformanceragvlm

huggingface/evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Python2.4k310Updated 16 hours ago

evaluationmachine-learning

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Python2.3k202Updated 1 day ago

autoevaluationevaluationexperimentationhallucination-detectionjailbreak-detectionllm-evalllm-promptingllm-testllmopsmachine-learningmonitoringopenai-evalsprompt-engineeringroot-cause-analysis

huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Python2.3k437Updated 11 hours ago

evaluationevaluation-frameworkevaluation-metricshuggingface

Page 1 of 34