"topic:ai-safety" — Search

Cordum (cordum.io) is a platform-only control plane for autonomous AI Agents and external workers. It uses NATS for the bus, Redis for state and payload pointers, and CAP v2 wire contracts for jobs, results, and heartbeats. Workers and product packs live outside this repo.Core cordum

Go46318Updated 6 hours ago

ai-orchestrationai-safetyautonomous-agentsgovernancellm-agentsworkflow-engine

agencyenterprise/PromptInject

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

Python45943Updated 12 hours ago

adversarial-attacksagiagi-alignmentai-alignmentai-safetychain-of-thoughtgpt-3language-modelslarge-language-modelsmachine-learningml-safetyprompt-engineering

tigerlab-ai/tiger

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

Jupyter Notebook40127Updated 1 week ago

ai-safetyaisafetyclassificationdata-augmentationfine-tuninglarge-language-modelsllmllm-trainingrag

pegasi-ai/clawreins

Intervention layer with audit logs for OpenClaw agents. Browser-aware. Trajectory-aware. Human-routable.

Python36646Updated 15 hours ago

agent-securityai-safetyaudit-trailbrowser-automationcuahuman-in-the-loopinterventionopenclaw

hendrycks/ethics

Aligning AI With Shared Human Values (ICLR 2021)

Python31546Updated 3 weeks ago

ai-safetyethical-aigpt-3machine-ethicsml-safety

ShengranHu/Thought-Cloning

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

Python26824Updated 5 days ago

ai-safetyartificial-intelligencedeep-learningimitation-learningpytorchreinforcement-learning

Govcraft/rust-docs-mcp-server

🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.

Rust26132Updated 2 days ago

aiai-safetycachingcargocoding-assistantcontext-awarecrates-iodeveloper-toolsembeddingsinformation-retrievalllmmcpmcp-serveropenairagrustrust-libraryrustdocrustlangsemantic-search

WindVChen/DiffAttack

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

Python25618Updated 1 month ago

adverarial-attacksai-safetydiffusion-adversarial-attackdiffusion-modelsimperceptible-attackstransferable-attacksunrestricted-attacks

cvs-health/langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

Python25441Updated 4 days ago

aiai-safetyartificial-intelligencebiasbias-detectionethical-aifairnessfairness-aifairness-mlfairness-testinglarge-language-modelsllmllm-evaluationllm-evaluation-frameworkllm-evaluation-metricspythonresponsible-ai

Jiaqi-Chen-00/ImBD

[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection

Python25331Updated 1 week ago

ai-content-detectorai-safetyllm-detectionmechine-text-detection

normster/llm_rules

RuLES: a benchmark for evaluating rule-following in language models

Python24816Updated 1 day ago

ai-safetyai-securitygpt-4

Giskard-AI/awesome-ai-safety

📚 A curated list of papers & technical articles on AI Quality & Safety

20527Updated 1 week ago

aiai-alignmentai-qualityai-safetyartificial-intelligenceawesomeawesome-listcomputer-visionethical-aillmllmopsmachine-learningmlml-safetyml-testingmlopsmodel-testingmodel-validationnatural-language-processingrobustness

ryoungj/ToolEmu

[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

Python19020Updated 1 week ago

agentai-safetylanguage-agentlanguage-modellarge-language-modelsprompt-engineering

edwinkys/phantasm

Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.

Svelte1897Updated 2 weeks ago

ai-agentsai-safetyai-securityapproval-workflowautomation-toolscontrol-flowdashboardhuman-computer-interactionhuman-in-the-loopllmllm-securityllmopsmonitoringopen-sourcerust

tomekkorbak/pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

Python18014Updated 4 months ago

ai-alignmentai-safetydecision-transformersgptlanguage-modelspretrainingreinforcement-learningrlhf

PKU-Alignment/beavertails

BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).

Makefile1776Updated 3 days ago

ai-safetybeaverdatasetsgpthuman-feedbackhuman-feedback-datalanguage-modellarge-language-modelllamallmllmsrlhfsafe-rlhfsafety

lets-make-safe-ai/make-safe-ai

How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚

1708Updated 1 week ago

agiaiai-alignmentai-safetyartificial-general-intelligenceartificial-intelligence

aisa-group/PostTrainBench

Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

Python16718Updated 8 hours ago

ai-research-automationai-safetyclaude-codecodex-cligemini-clipost-training

PKU-YuanGroup/Hallucination-Attack

Attack to induce LLMs within hallucinations

Python16421Updated 1 week ago

adversarial-attacksai-safetydeep-learninghallucinationsllmllm-safetymachine-learningnlp

elliot35/deterministic-agent-control-protocol

Governance gateway for AI agents — bounded, auditable, session-aware control with MCP proxy, shell proxy & HTTP API. Works with Cursor, Claude Code, Codex, and any MCP-compatible agent.

TypeScript1455Updated 23 hours ago

agent-controlagent-governanceai-agentai-governanceai-safetyai-securityai-toolsaudit-logclaude-codecodexcursordeterministicgatewayllmmcpmcp-servermodel-context-protocolnodejspolicy-enginetypescript

LetterLiGo/SafeGen_CCS2024

[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Python13813Updated 2 months ago

ai-safetyai-securitygenerative-aitext-to-imagethrustworthy-ai

Hmbown/Hegelion

Dialectical reasoning architecture for LLMs (Thesis → Antithesis → Synthesis)

Python13512Updated 4 days ago

ai-safetyanthropicauto-codebenchmarkingdialecticismevaluationhegelianllmmcpollamaopenaipair-programmingreasoningresearch

ZFancy/awesome-activation-engineering

A curated list of resources for activation engineering

1328Updated 3 days ago

activation-engineeringai-safetyconceptconcept-activation-vectorconcept-repcontrolinterpretabilitylarge-language-modelsllmllm-aligmenttransparent

EzgiKorkmaz/adversarial-reinforcement-learning

Reading list for adversarial perspective and robustness in deep reinforcement learning.

1308Updated 1 week ago

adversarial-reinforcement-learningai-alignmentai-safetyai-securityartificial-intelligence-alignmentartificial-intelligence-securitydeep-reinforcement-learninglarge-language-model-safetyllm-safetyllm-securitymachine-learning-securityml-securityreinforcement-learning-safetyreinforcement-learning-securityresponsible-airobust-deep-reinforcement-learningrobust-machine-learningrobust-reinforcement-learningsafe-reinforcement-learningsecure-ai

AthenaCore/AwesomeResponsibleAI

A curated list of awesome academic research, books, code of ethics, courses, databases, data sets, frameworks, institutes, maturity models, newsletters, principles, podcasts, regulations, reports, responsible scale policies, tools and standards related to Responsible, Trustworthy, and Human-Centered AI.

11523Updated 23 hours ago

aiai-alignmentai-governanceai-regulationai-safetyai-standardsawesome-listethical-aiexplainable-aifairness-aihuman-centered-aiinterpretable-airesponsible-aitrustworthy-aixai

Page 1 of 34