"topic:gsm8k" — Search

An evaluation of prompting techniques (Zero-Shot CoT, Few-Shot, Self-Consistency) on the Mistral-7B model for mathematical reasoning. This project systematically benchmarks 7 distinct methods on the GSM8K dataset.

Python10Updated 4 months ago

chain-of-thoughtchain-of-thought-reasoningcourse-projectdeep-learningfew-shotgsm8khuggingface-transformerslarge-language-modelsllmllm-evaluationmajority-votingmathematical-reasoningmistral-7bprompt-engineeringquestion-decompositionreasoningreasoning-language-modelsself-consistencyuniversity-projectzero-shot

jpordoy/-Dynamic-Multi-Chain-Multi-Path-Reasoning-with-ConsensusArchived

Multi-path reasoning with dynamic chains and consensus scoring for improved GSM8K benchmark performance.

Jupyter Notebook10Updated 1 month ago

claudeconsensusgsm8kllmmulti-path-reasoning

DeveloperZeeshu/Nano_R1-model

Nano R1 Model is an AI-driven reasoning model built using reinforcement learning techniques. It focuses on decision-making and adaptability in dynamic environments, utilizing state-of-the-art machine learning methods to improve over time. Developed with Python and hosted on Hugging Face.

10Updated 9 months ago

grpogsm8kneural-networkpythonpytorchqwen2-5transformerunsloth

DURGESH716/Creating-Hard-Reasoning-Benchmark

Hard Reasoning Benchmark filtered with disagreement scores

Python10Updated 1 month ago

arc-agibenchmarksgsm8kmathematical-modellingmodel-evaluationmodel-reliability

Mantissagithub/alphazero_llm_trainer

AlphaZero-style RL training for LLMs using MCTS on mathematical reasoning tasks (GSM8K). Student model explores reasoning paths guided by teacher ensembles and reward signals.

Python10Updated 3 months ago

alphazerogsm8kllm-trainmathematical-reasmctspytorchreinforcement-learningreward-model

strongSoda/prompt-sculpting

You Don't Need Prompt Engineering Anymore: The Prompting Inversion

Python10Updated 4 months ago

aigpt4ogpt5gsm8kllmprompt-engineering

HyperKuvid-Labs/alphazero_llm_trainer

AlphaZero-style RL training for LLMs using MCTS on mathematical reasoning tasks (GSM8K). Student model explores reasoning paths guided by teacher ensembles and reward signals.

Python10Updated 3 months ago

alphazerogsm8kmathematical-reasoninmctspytorchreinforcement-learningreward-model

goblinasaddy/nanoJEPA

A minimal JEPA-based language model demonstrating latent-space reasoning on GSM8K using a single decoder-only Transformer.

Python10Updated 3 weeks ago

deep-learninggsm8kjepalanguage-modellatent-spacemath-reasoningpytorchrepresentation-learningresearch-projecttransformer

manncodes/rlvr-gsm8k-benchmark

Comprehensive benchmarking framework for RLVR/RLHF libraries on GSM8K mathematical reasoning dataset

Python00Updated 5 months ago

benchmarkdeep-learningevaluationgsm8kllmmachine-learningreinforcement-learningrlhfrlvrtransformers

antonisbaro/promptimus-prime

Transforming weak prompts into reasoning machines using Textual Gradients and AdalFlow. Runs on Colab.

Python01Updated 1 month ago

adalflowautomated-prompt-engineeringchain-of-thoughtgenerative-aigoogle-colabgsm8klarge-language-modelsllmllm-autodiffmath-reasoningprompt-engineeringprompt-optimizationpytorchtextual-gradientstransformers

Shuichi346/llm-benchmark-script

A tool to evaluate and compare local LLMs running on Ollama or LM Studio under identical conditions using deepeval's public benchmarks (MMLU, TruthfulQA, GSM8K).

Python00Updated 6 days ago

apple-siliconbenchmarkdeepevalgsm8kllmlmstudiolocal-llmmacosmmlumodel-evaluationollamapythonquantizationtruthfulqa

Pavansomisetty21/Supervised-Fine-Tuning-of-GPT-OSS-20B-on-OpenAI-s-gsm8k-reasoning-with-LoRA

In this we finetune GPT-OSS-20B on OpenAI's gsm8k dataset

Jupyter Notebook01Updated 7 months ago

finetuninggpt-ossgpt-oss-20bgsm8kunsloth

harshit-ojha0324/Responsible-AI

Empirical study comparing process-based vs. outcome-based LLM supervision on GSM8K — evaluating accuracy, interpretability, and legal/regulatory compliance implications (GDPR, EU AI Act).

Python00Updated 3 weeks ago

ai-safetygsm8kllmportfoliopythonresearch

North-Shore-AI/crucible_datasets

Dataset management and caching for AI research benchmarks

Elixir00Updated 2 months ago

aibeambenchmark-datasetsbenchmarkingdata-loadingdataset-managementdatasetselixirensemble-methodsgsm8khumanevalllmmachine-learningml-datasetsmmlunshkr-crucibleotpreliabilityresearchstatistical-testing

goldbar123467/GSM8K-BenchMark-Qwen-3-

GSM8K benchmark results for Qwen3-4B and Qwen3-8B fine-tuned

Python00Updated 2 weeks ago

fine-tuninggsm8kmath-reasoningqloraqwensynthetic-data

Saharsh1005/autonomous-prompting

Developing an autonomous system for prompt selection for Large Language Models (LLMs), enhancing performance across tasks by balancing generality and specificity. This project automates diverse, high-quality prompt creation and selection, reducing manual intervention and maximizing LLM utility across applications.

Jupyter Notebook01Updated 1 year ago

chain-of-thoughtcotgsm8klarge-language-modelsllmprompt-engineeringself-consistency

North-Shore-AI/datasets_ex

Dataset management library for ML experiments—loaders for SciFact, FEVER, GSM8K, HumanEval, MMLU, TruthfulQA, HellaSwag; git-like versioning with lineage tracking; transformation pipelines; quality validation with schema checks and duplicate detection; GenStage streaming for large datasets. Built for reproducible AI research.

Elixir00Updated 1 month ago

beambenchmarksdata-managementdata-qualitydata-validationdatasetselixirfevergenstagegsm8khumanevalmachine-learningmmlunorth-shore-ainshkr-crucibleotpreproducibilityscifactstreamingversioning

Mithil-hub/STaR-Self-Taught-Reasoner-Reasoning-Enhancement-for-LLMs-on-GSM8K

STaR Self-Taught Reasoner implementation on GSM8K — Zero-Shot CoT vs Vanilla SFT vs STaR with Llama 3.2-3B

Python00Updated 2 weeks ago

chain-of-thought-reasoninggsm8kllama3llmllm-evaluationllm-finetuningllm-inferencellm-reasoningllm-trainingmathematical-reasoningself-taught-reasonerstarstar-bootstrappingsupervised-fine-tuning