"topic:llm-eval" — Search

67 results for “topic:llm-eval”

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript11.4k1.1kUpdated just now

cici-cdcicdevaluationevaluation-frameworkllmllm-evalllm-evaluationllm-evaluation-frameworkllmopspentestingprompt-engineeringprompt-testingpromptsragred-teamingtestingvulnerability-scanners

Arize-ai/phoenix

AI Observability & Evaluation

Jupyter Notebook8.8k742Updated just now

agentsai-monitoringai-observabilityaiengineeringanthropicdatasetsevalslangchainllamaindexllm-evalllm-evaluationllmopsllmsopenaiprompt-engineeringsmolagents

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Python5.2k404Updated 4 hours ago

agent-evaluationai-red-teamai-securityai-testingfairness-aillmllm-evalllm-evaluationllm-securityllmopsml-testingml-validationmlopsrag-evaluationred-team-toolsresponsible-aitrustworthy-ai

truera/trulens

Evaluation and Tracking for LLM Experiments and AI Agents

Python3.1k251Updated just now

agent-evaluationagentopsai-agentsai-monitoringai-observabilityevalsexplainable-mlllm-evalllm-evaluationllmopsllmsmachine-learningneural-networks

datachain-ai/datachain

Analytics, Versioning and ETL for multimodal data: video, audio, PDFs, images

Python2.7k135Updated 9 hours ago

aicvdata-analyticsdata-wranglingembeddingsllmllm-evalmachine-learningmlopsmultimodal

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Python2.3k202Updated 1 day ago

autoevaluationevaluationexperimentationhallucination-detectionjailbreak-detectionllm-evalllm-promptingllm-testllmopsmachine-learningmonitoringopenai-evalsprompt-engineeringroot-cause-analysis

AI-QL/tuui

A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context Protocol (MCP) and enabling cross-vendor LLM API orchestration.

TypeScript1.1k103Updated 1 day ago

agentagentic-aiaiai-playgroundanthropicclaudedeepseekdxtllmllm-evalmcpmcp-clientmcp-hostmcp-inspectormcpbmodel-context-protocolopenai-apipromptqwentesting

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Python29821Updated 1 week ago

evaluationevaluation-frameworkevaluation-metricsllm-evalllm-evaluationllm-evaluation-toolkitllm-opsllmops

verifywise-ai/verifywise

Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI frameworks and regulations. Join our Discord channel: https://discord.com/invite/d3k3E4uEpR

TypeScript23488Updated just now

aiai-auditingai-complianceai-governanceai-governance-modelai-riskauditauditingcomplianceeu-ai-actgovernancegrciso27001iso42001llm-evalllm-evaluationnist-ai-rmfrisk-management

Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Python907Updated 2 months ago

evaluationgpt4llmllm-evalllm-evaluationllm-evaluation-toolkit

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Python829Updated 2 months ago

generative-aigood-first-issuellmllm-evalllm-evaluationllm-evaluation-frameworkllm-evaluation-toolkitllm-toolsllmopsllms-benchmarkingmetricsprompt-engineering

QuesmaOrg/BinaryAudit

An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.

Shell713Updated 2 days ago

aibenchmarkbinary-analysiscybersecurityllm-evalreverse-engineering

kuk/rulm-sbs2

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

Jupyter Notebook612Updated 3 months ago

llm-evalrussian-specific

grigio/llm-eval-simple

llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection

Python591Updated 1 week ago

llmllm-evalllm-evaluation-benchmark

whitecircle-ai/circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Python513Updated 2 days ago

aibenchmarkbenchmarkingguardrailguardrailsjailbreaklarge-language-modellarge-language-modelsllmllm-as-a-judgellm-evalllm-evaluationllm-jailbreaksllm-securitysafeguard

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

Python4514Updated 2 weeks ago

evaluationlanguage-modelllmllm-evalllmopsmachine-learningprompt-engineeringrag

multinear/multinear

Develop reliable AI apps

Python441Updated 3 months ago

evaluationllmllm-evalllm-evaluationllm-evaluation-frameworkllmsllms-benchmarkingreliability

Striveworks/valor

Valor is a lightweight, numpy-based library designed for fast and seamless evaluation of machine learning models.

Python404Updated 1 day ago

classificationcomputer-visionevaluationevaluation-metricsimage-segmentationllm-evalmlopsmodel-evaluationnlpobject-detectiontext-generation

alan-turing-institute/prompto

An open source library for asynchronous querying of LLM endpoints

Python362Updated 3 weeks ago

deep-learninghut23large-language-modelsllm-evalllm-evaluationllmsmachine-learningnatural-language-processingnlppythontransformertransformers

thedataquarry/structured-outputs

Structured output benchmarks comparing DSPy and BAML with different LLMs

Python271Updated 1 week ago

bamldspyinformation-extractionllmllm-evalllm-evaluationstructured-evaluationstructured-output

genia-dev/vibraniumdome

LLM Security Platform.

Python265Updated 3 months ago

adversarial-attackschatgptlarge-language-modelllmllm-agentllm-evalllm-evaluationllm-firewallllm-frameworkllm-inferencellm-securityllm-servingllmopsopenaiprompt-engineeringprompt-injectionprompt-injection-toolpromptssecurity

Supahands/llm-comparison-backend

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end

Python223Updated 1 month ago

aichatgptllmllm-apillm-comparisonllm-eval

paradime-io/dbt-llm-evals

The warehouse-native LLM evaluation package for dbt™ - monitor AI quality without data egress

Python203Updated 1 week ago

aianthropicbigquerydatabricksdbtdbt-coredbt-packagesgeminillmllm-evalllm-evaluationopenaisnowflakevertex-ai

amplifying-ai/ai-product-bench

No description provided.

HTML193Updated 1 week ago

datasetevalsllm-eval

honeyhiveai/realignArchived

Realign is a testing and simulation framework for AI applications.

Python181Updated 2 months ago

aiaiengineeringalignmentevaluationllm-evalllm-evaluationllm-evaluation-frameworkllmopsllmsprompt-engineeringragred-teamingsimulation

attogram/ollama-multirun

Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.

Shell162Updated 2 weeks ago

aiai-evaluation-toolsattogram-projectbash-scriptllm-evalllm-evaluationllm-evaluation-metricsollamaollama-appollama-interfacestatic-site-generator

Human-Centric-Machine-Learning/prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

Jupyter Notebook91Updated 1 year ago

llm-evalllm-evaluationllm-evaluation-frameworkprediction-powered-inferencerank-setsranking-algorithm

IAAR-Shanghai/GuessArena

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Python91Updated 3 months ago

benchmarkchatgptdeepseekdomain-specific-evalevaluation-frameworkgamearenaguessarenaknowledge-evaluationlarge-language-modelsllm-evalopenaiqwenreasoning-evaluationreliable-evaluation

pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

Jupyter Notebook86Updated 2 months ago

llmllm-evalllm-evalsllm-evaluation-frameworkllm-evaluation-metricsllm-monitoringllm-testllm-testingllmopsllmsworkshop

prompt-foundry/python-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Python

Python80Updated 7 months ago

llmllm-evalllm-evaluationopen-aiprompt-engineeringprompt-evaluationprompt-managementpythonpython3

Page 1 of 3