ARGUS-AI

Production-Grade LLM Observability in 3 Lines of Code

ARGUS-AI is the G-ARVIS scoring engine for monitoring LLM application quality in production. It evaluates every LLM response across six orthogonal dimensions: Groundedness, Accuracy, Reliability, Variance, Inference Cost, and Safety.

Your LLM app is degrading right now. You just can't see it yet.

import argus_ai

argus = argus_ai.init()
result = argus.evaluate(prompt=prompt, response=response, context=context)

That's it. Every LLM call now has a quality score.

Why ARGUS

LLM outputs degrade silently. Models update, prompts drift, context windows overflow, and costs creep. Traditional monitoring catches latency and errors. It does not catch a model that starts hallucinating 12% more after a provider update, or a prompt that silently loses grounding when context exceeds 80K tokens.

G-ARVIS catches it.

Dimension	What It Measures	Why It Matters
Groundedness	Is the response grounded in provided context?	Hallucination detection
Accuracy	Does it match ground truth / internal consistency?	Factual correctness
Reliability	Format consistency, completeness, latency SLA	Structural quality
Variance	Output determinism and confidence signals	Consistency across runs
Inference Cost	Token efficiency, cost-per-word, latency-to-value	Budget control
Safety	PII leakage, toxicity, injection, harmful content	Compliance and trust

Plus three agentic evaluation metrics for autonomous workflows:

Metric	Formula	What It Measures
ASF (Agent Stability Factor)	completion × (1 - failure_rate) × consistency	Workflow completion reliability
ERR (Error Recovery Rate)	recovered / failed	Self-healing capability
CPCS (Cost Per Completed Step)	total_cost / completed_steps	Economic efficiency per step

Install

pip install argus-ai

With provider integrations:

pip install argus-ai[anthropic]    # Anthropic Claude wrapper
pip install argus-ai[openai]       # OpenAI wrapper
pip install argus-ai[prometheus]   # Prometheus export
pip install argus-ai[opentelemetry] # OTEL export
pip install argus-ai[all]          # Everything

Quick Start

Basic Evaluation

import argus_ai

argus = argus_ai.init(profile="enterprise")

result = argus.evaluate(
    prompt="What causes climate change?",
    response="Greenhouse gas emissions from fossil fuels are the primary driver.",
    context="Climate change is driven by human activities releasing greenhouse gases.",
    model_name="claude-sonnet-4",
    latency_ms=1200.0,
    input_tokens=45,
    output_tokens=30,
    cost_usd=0.002,
)

print(f"Composite: {result.garvis_composite:.3f}")  # 0.847
print(f"Passing:   {result.passing}")                # True
print(f"Safety:    {result.safety:.3f}")             # 0.950

Agentic Workflow Evaluation

from argus_ai.types import AgenticEvalRequest

argus = argus_ai.init(profile="agentic")

workflow = AgenticEvalRequest(
    prompt="Research competitors and generate report",
    response="Report generated with 5 competitor analyses.",
    steps_planned=8,
    steps_completed=7,
    steps_failed=2,
    steps_recovered=1,
    retries=3,
    total_cost_usd=0.45,
)

result, agentic_metrics = argus.evaluate_agentic(workflow)

for m in agentic_metrics:
    print(f"{m.name}: {m.score:.3f}")
# AgentStabilityFactor: 0.612
# ErrorRecoveryRate: 0.600
# CostPerCompletedStep: 0.357

Drop-In Provider Wrappers

Anthropic Claude:

from argus_ai.integrations.anthropic import InstrumentedAnthropic

argus = argus_ai.init()
client = InstrumentedAnthropic(argus=argus)

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain transformers"}],
)

# G-ARVIS score automatically attached
print(response._argus_score.garvis_composite)

OpenAI:

from argus_ai.integrations.openai import InstrumentedOpenAI

client = InstrumentedOpenAI(argus=argus_ai.init())

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain transformers"}],
)

print(response._argus_score.garvis_composite)

Threshold Monitoring with Alerting

from argus_ai.monitoring.thresholds import ThresholdConfig
from argus_ai.monitoring.alerts import AlertRule, AlertSeverity

config = ThresholdConfig(
    composite_min=0.80,
    safety_min=0.90,
    window_size=100,
    breach_ratio=0.15,
)

rules = [
    AlertRule(
        dimension="safety",
        threshold=0.85,
        severity=AlertSeverity.CRITICAL,
        message="Safety below critical threshold",
    ),
]

argus = argus_ai.init(
    profile="healthcare",
    thresholds=config,
    alert_rules=rules,
    exporters=["console", "prometheus"],
    on_alert=lambda msg, result: pagerduty.trigger(msg),
)

Decorator Instrumentation

from argus_ai.sdk.decorators import argus_evaluate

argus = argus_ai.init()

@argus_evaluate(argus, model_name="claude-sonnet-4")
def generate(prompt: str) -> str:
    return anthropic_client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

Weight Profiles

G-ARVIS weights are configurable per deployment scenario:

Profile	G	A	R	V	I	S	Best For
`enterprise`	0.20	0.20	0.15	0.15	0.10	0.20	General production
`healthcare`	0.25	0.25	0.15	0.10	0.05	0.20	HIPAA workloads
`finance`	0.20	0.25	0.20	0.10	0.05	0.20	SOX/GAAP compliance
`consumer`	0.15	0.15	0.20	0.15	0.20	0.15	Cost-sensitive apps
`agentic`	0.15	0.15	0.25	0.20	0.10	0.15	Autonomous agents

Custom weights:

from argus_ai.scoring.garvis import GarvisWeights

argus = argus_ai.init(weights=GarvisWeights(safety=0.40, accuracy=0.30))

Metrics Export

Prometheus

argus = argus_ai.init(exporters=["prometheus"])

Exposes: argus_garvis_composite, argus_garvis_{dimension}, argus_evaluation_duration_ms, argus_alerts_total

OpenTelemetry

argus = argus_ai.init(exporters=["opentelemetry"])

Compatible with Datadog, New Relic, Honeycomb, Grafana Cloud, and any OTLP backend.

Architecture

See ARCHITECTURE.md for the full system design and open core split.

argus-ai (Open Source)          ARGUS Platform (Commercial)
├── G-ARVIS Scorer              ├── Autonomous Correction Loop
├── 3-Line SDK                  ├── Prompt Optimizer
├── Threshold Monitor           ├── LLM-as-Judge Evaluation
├── ASF/ERR/CPCS Metrics        ├── Multi-Run Variance Analysis
├── Prometheus/OTEL Export      ├── Dashboard UI
└── Anthropic/OpenAI Wrappers   └── SOC2/HIPAA Compliance

Performance

G-ARVIS heuristic scoring runs in sub-5ms per evaluation with zero external dependencies at runtime.

Benchmark	Value
Single evaluation	< 3ms
Batch (1000 requests)	< 2.5s
Memory overhead	< 5MB
Dependencies (core)	3 (pydantic, numpy, structlog)

Roadmap

Contributing

Contributions welcome. See CONTRIBUTING.md for guidelines.

git clone https://github.com/anilatambharii/argus-ai.git
cd argus-ai
pip install -e ".[dev]"
pytest tests/ -v

About

Built by Anil Prasad at Ambharii Labs.

G-ARVIS framework published in "Field Notes: Production AI" on LinkedIn.

ARGUS: Autonomous Runtime Guardian for Unified Systems.

License

Apache 2.0. See LICENSE.

anilatambharii/argus-ai