"topic:llm-benchmarking" — Search

92 results for “topic:llm-benchmarking”

SalesforceAIResearch/enterprise-deep-research

Salesforce Enterprise Deep Research

deep-research-agente2bfastapilangchainllm-benchmarkingmulti-agent-systemsreacttailwindcsstavily

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

ai-evaluationbenchmarkclaudeconfabulationsdeepseek-r1geminigemini-prohallucinationslanguage-modelleaderboardllamallmllm-benchmarkingo1o3-minirag

HiThink-Research/BizFinBench

A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Python21410Updated 2 months ago

benchmarkfinancellmllm-benchmarkingllm-evaluation

tongye98/Awesome-Code-Benchmark

A comprehensive code domain benchmark review of LLM researches.

20916Updated 6 months ago

awesomebenchmarksbug-fixingcode-completioncode-efficiencycode-generationcodellmcodellmsdata-sciencellm-benchmarkingmultimodalreasoning

alopatenko/LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

HTML18116Updated 2 weeks ago

evaluationgenerative-ai-benchmarkingllmllm-benchmarkingllm-evaluation

lakeraai/pint-benchmark

A benchmark for prompt injection detection systems.

Jupyter Notebook17221Updated 4 days ago

benchmarkllmllm-benchmarkingllm-securityprompt-injection

robertvacareanu/llm4regression

Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update

Python16220Updated 5 months ago

large-language-modelslinear-regressionllmllm-benchmarkingllm-inferencellmsregressionregression-modelssklearn

oripress/AlgoTune

AlgoTune is a NeurIPS 2025 benchmark made up of 154 math, physics, and computer science problems. The goal is write code that solves each problem, and is faster than existing implementations.

Python9513Updated 1 week ago

code-agentcode-generationcode-optimizationllm-agentllm-benchmarkingllm-coder

asimsinan/LLM-Research

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

Python629Updated 1 year ago

arxiv-papersbuyuk-dil-modellerilarge-language-modelsllmllm-benchmarkingllm-datasetsllm-frameworksllm-researchllm-thesesllm-toolsllms

forecastingresearch/forecastbench

A dynamic forecasting benchmark for LLMs

Python608Updated 2 days ago

benchmarkforecastbenchforecastingllm-benchmarking

damianomarsili/VADAR

[CVPR 2025] Program synthesis for 3D spatial reasoning

Jupyter Notebook594Updated 9 months ago

3dllm-benchmarkingllmsprogram-synthesisspatial-reasoning

AKSW/LLM-KG-Bench

LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.

Python565Updated 1 week ago

knowledge-graphlarge-language-modelsllmllm-benchmarkingrdfsparql

MJ-Bench/MJ-Bench

(NeurIPS 2025) Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

Jupyter Notebook475Updated 9 months ago

llm-as-a-judgellm-benchmarkingmultimodal-foundation-modelmultimodal-judgereward-models

roboflow/vision-ai-checkup

Take your LLM to the optometrist.

Python4715Updated 4 days ago

llmllm-benchmarkingvision-languagevision-language-modelvlm

nl4opt/ORQA

[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when presented with complex optimization modeling tasks.

Python452Updated 9 months ago

aaai2025ai4orlinear-programmingllmllm-benchmarkingllm-reasoningllm4mathllm4optllm4ormathematical-modellingmixed-integer-programmingmulti-choiceoperations-researchoptimization

HiThink-Research/MME-Finance

[MM 2025] A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Python444Updated 2 months ago

financellmllm-benchmarkingllm-evaluationmmllmmultimodal

HiThink-Research/BizFinBench.v2

BizFinBench.v2: A Unified Offline–Online Bilingual Benchmark for Expert-Level Financial Capability Evaluation of LLMs

Python393Updated 1 month ago

benchmarkfinancial-llmllm-benchmarkingllm-evaluation

AUCOHL/RTL-Repo

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24

Python342Updated 1 year ago

llmllm-benchmarkingrtl-designverilog

Belluxx/LocalAIME

Test your local LLMs on the AIME problems

Python328Updated 9 months ago

benchmark-datasetsllm-benchmarkinglocal-llm

lechmazur/deception

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

312Updated 1 year ago

ai-benchmarksai-evaluationai-safetyai-securityclaudedisinformationgeminigpt4olanguage-modelllamallmllm-benchmarkingmachine-learningmistralmodel-evaluationnlp

voice-from-the-outer-world/lisan-bench

LisanBench is a lightweight benchmark for LLMs that stresses forward planning, vocabulary depth, constraint adherence, attention, and long-context "stamina" all at once.

Python312Updated 9 months ago

aillmllm-benchmarkingpython3python312

truefoundry/llm-locust

LLM Locust combines the simplicity of Locust with deep support for LLM-specific benchmarking

TypeScript239Updated 1 month ago

llm-benchmarkingllm-performance

atadml/txt-2-sql-benchmark

An app and set of methodologies designed to evaluate the performance of various Large Language Models (LLMs) on the text-to-SQL task. Our goal is to offer a standardized way to measure how well these models can generate SQL queries from natural language descriptions

Jupyter Notebook200Updated 10 months ago

benchmarkllm-benchmarkingtext-to-sql

aws-samples/fm-leaderboarder

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

Python195Updated 1 year ago

llm-benchmarkingllm-evaluationllm-evaluation-framework

AlaaLab/ER-Reason

Official Codebase for "ER-Reason: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room"

Jupyter Notebook191Updated 4 months ago

clinical-reasoningemergency-medicinellm-benchmarkingllm-evaluationmedical-llm-benchmarksmedical-llmsreasoning-language-models

wafer-ai/chipbenchmark

a platform for monitoring the chip situation

Shell153Updated 8 months ago

amdbenchmarkcppgpuhigh-performance-computinglarge-language-modelsllmllm-benchmarkingllmsnvidiapythonquantization

StanfordMIMI/dspy-helm

Structured Prompting Enables More Robust Evaluation of Language Models

Python114Updated 2 months ago

llm-benchmarkingstructured-prompting

zjukg/SKA-Bench

[Paper][EMNLP 2025] SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

Python110Updated 6 months ago

benchmarkevalutationknowledgeknowledge-graphlarge-language-modelsllm-benchmarkingska-benchstructured-knowledgetableunderstanding

ronniross/asi-core-protocol

A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.

101Updated 3 weeks ago

agiagi-developmentartificial-general-intelligenceartificial-general-super-intelligencedatasetdatasetsllmllm-benchmarkingllm-datasetsllm-frameworkllmsllms-benchmarkingmachine-learningmachine-learning-librarymachine-learning-projectsrlhf

Cristian-Curaba/CryptoFormalEval

We introduce a benchmark for testing how well LLMs can find vulnerabilities in cryptographic protocols. By combining LLMs with symbolic reasoning tools like Tamarin, we aim to improve the efficiency and thoroughness of protocol analysis, paving the way for future AI-powered cybersecurity defenses.

Haskell103Updated 8 months ago

communication-protocolcryptographyevaluationllm-based-agentsllm-benchmarkingllm-reasoningvulnerability-detection

Page 1 of 4