1,432 results for “topic:ai-safety”
A curated list of awesome responsible machine learning resources.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Secrets of RLHF in Large Language Models Part I: PPO
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Deliver safe & effective language models
Cordum (cordum.io) is a platform-only control plane for autonomous AI Agents and external workers. It uses NATS for the bus, Redis for state and payload pointers, and CAP v2 wire contracts for jobs, results, and heartbeats. Workers and product packs live outside this repo.Core cordum
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
Intervention layer with audit logs for OpenClaw agents. Browser-aware. Trajectory-aware. Human-routable.
Aligning AI With Shared Human Values (ICLR 2021)
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection
RuLES: a benchmark for evaluating rule-following in language models
📚 A curated list of papers & technical articles on AI Quality & Safety
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.
Code accompanying the paper Pretraining Language Models with Human Preferences
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
Attack to induce LLMs within hallucinations
Governance gateway for AI agents — bounded, auditable, session-aware control with MCP proxy, shell proxy & HTTP API. Works with Cursor, Claude Code, Codex, and any MCP-compatible agent.
[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
Dialectical reasoning architecture for LLMs (Thesis → Antithesis → Synthesis)
A curated list of resources for activation engineering
Reading list for adversarial perspective and robustness in deep reinforcement learning.
A curated list of awesome academic research, books, code of ethics, courses, databases, data sets, frameworks, institutes, maturity models, newsletters, principles, podcasts, regulations, reports, responsible scale policies, tools and standards related to Responsible, Trustworthy, and Human-Centered AI.