26 results for “topic:gsm8k”
Small and Efficient Mathematical Reasoning LLMs
[ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced Language Models (LLMs + MCTS + Self-Improvement)
Reproducible Language Agent Research
DICE: Detecting In-distribution Data Contamination with LLM's Internal State
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions
GRPO and SFT Finetune Qwen3 using Unsloth : Reasoning and Non-Reasoning Dataset
Short-CoT distilled GSM8K dataset generated with OpenAI gpt-oss-120b.
GSM8K-Consistency is a benchmark database for analyzing the consistency of Arithmetic Reasoning on GSM8K.
An evaluation of prompting techniques (Zero-Shot CoT, Few-Shot, Self-Consistency) on the Mistral-7B model for mathematical reasoning. This project systematically benchmarks 7 distinct methods on the GSM8K dataset.
Multi-path reasoning with dynamic chains and consensus scoring for improved GSM8K benchmark performance.
Nano R1 Model is an AI-driven reasoning model built using reinforcement learning techniques. It focuses on decision-making and adaptability in dynamic environments, utilizing state-of-the-art machine learning methods to improve over time. Developed with Python and hosted on Hugging Face.
Hard Reasoning Benchmark filtered with disagreement scores
AlphaZero-style RL training for LLMs using MCTS on mathematical reasoning tasks (GSM8K). Student model explores reasoning paths guided by teacher ensembles and reward signals.
You Don't Need Prompt Engineering Anymore: The Prompting Inversion
AlphaZero-style RL training for LLMs using MCTS on mathematical reasoning tasks (GSM8K). Student model explores reasoning paths guided by teacher ensembles and reward signals.
A minimal JEPA-based language model demonstrating latent-space reasoning on GSM8K using a single decoder-only Transformer.
Comprehensive benchmarking framework for RLVR/RLHF libraries on GSM8K mathematical reasoning dataset
Transforming weak prompts into reasoning machines using Textual Gradients and AdalFlow. Runs on Colab.
A tool to evaluate and compare local LLMs running on Ollama or LM Studio under identical conditions using deepeval's public benchmarks (MMLU, TruthfulQA, GSM8K).
In this we finetune GPT-OSS-20B on OpenAI's gsm8k dataset
Empirical study comparing process-based vs. outcome-based LLM supervision on GSM8K — evaluating accuracy, interpretability, and legal/regulatory compliance implications (GDPR, EU AI Act).
Dataset management and caching for AI research benchmarks
GSM8K benchmark results for Qwen3-4B and Qwen3-8B fine-tuned
Developing an autonomous system for prompt selection for Large Language Models (LLMs), enhancing performance across tasks by balancing generality and specificity. This project automates diverse, high-quality prompt creation and selection, reducing manual intervention and maximizing LLM utility across applications.
Dataset management library for ML experiments—loaders for SciFact, FEVER, GSM8K, HumanEval, MMLU, TruthfulQA, HellaSwag; git-like versioning with lineage tracking; transformation pipelines; quality validation with schema checks and duplicate detection; GenStage streaming for large datasets. Built for reproducible AI research.
STaR Self-Taught Reasoner implementation on GSM8K — Zero-Shot CoT vs Vanilla SFT vs STaR with Llama 3.2-3B