pravinva/superannuation-agent-multi-country

Agentic AI Retirement Advisor serving AU US UK and IN customers

agentbricks agentic-ai claude databricks llm mlflow

Building Production-Ready Agentic AI Applications on Databricks: A Deep Dive into the Multi-Country Pension Advisor

Author: Pravin Varma | Databricks Delivery Solutions Architect
Published: 2025

This project demonstrates how to build enterprise-grade agentic AI applications on Databricks using production best practices. We'll explore the ReAct agent framework, MLflow integration, observability patterns, and how to build scalable, maintainable agentic systems.

Executive Summary

In this reference implementation, we've built a production-ready agentic AI pension advisory system that showcases the best practices for developing agentic AI applications on Databricks. The system processes retirement planning queries across multiple countries (Australia, USA, UK, India) with enterprise-grade observability, cost optimization, and regulatory compliance.

Key Highlights:

ReAct Agentic Framework with intelligent reasoning and tool orchestration
MLflow Integration for prompt versioning, experiment tracking, and model governance
Production Observability with Lakehouse Monitoring and real-time dashboards
Cost Optimization achieving 80% cost reduction through intelligent cascade classification
Prompt Registry with versioning and A/B testing capabilities
Regulatory Compliance with comprehensive audit trails and LLM-as-a-Judge validation

Cost Efficiency: Processes queries at $0.003 per query (vs. traditional advisor at $150/hour), demonstrating massive cost savings while maintaining high-quality, personalized responses.

Architecture Overview

The ReAct Agent Pattern

At the heart of this system lies the ReAct (Reasoning-Acting-Observing) agentic loop, a proven pattern for building intelligent agents that can reason about problems, select appropriate tools, and iteratively refine their responses.

┌─────────────────────────────────────────────────────────────┐
│                     Streamlit UI Layer                        │
│              Multi-country Pension Advisory Portal            │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│              ReAct Agentic Loop (react_loop.py)               │
│  ┌────────────────────────────────────────────────────────┐  │
│  │ 1. REASON: Analyze query → Select tools               │  │
│  │ 2. ACT: Execute SQL functions (Unity Catalog)          │  │
│  │ 3. OBSERVE: Analyze results → Refine understanding     │  │
│  │ 4. ITERATE: Continue until sufficient information      │  │
│  └────────────────────────────────────────────────────────┘  │
└───────────────────────────┬─────────────────────────────────┘
                            │
       ┌────────────────────┼────────────────────┐
       ▼                    ▼                    ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Unity     │     │  Foundation  │     │   MLflow    │
│  Catalog    │     │    Models     │     │             │
│             │     │              │     │ • Traces    │
│ • Member    │     │ • Claude 4   │     │ • Prompts   │
│   Profiles  │     │ • GPT-4      │     │ • Metrics   │
│ • SQL Tools │     │ • Llama 3    │     │ • Dashboard │
│ • Audit Log │     │ • BGE (emb)   │     │ • Artifacts │
└─────────────┘     └──────────────┘     └─────────────┘

Why ReAct?

The ReAct pattern provides several advantages over traditional RAG or single-shot LLM approaches:

Iterative Reasoning: Agents can break down complex queries into multiple tool calls
Tool Orchestration: Dynamically selects the right tools based on context
Self-Correction: Observes results and adjusts strategy if needed
Transparency: Full visibility into reasoning steps for debugging and auditing

Our implementation (react_loop.py) separates the agentic loop from orchestration, making it reusable and testable.

Code Flow Architecture

Understanding how the components interact is crucial for maintaining and extending the system:

┌─────────────────────────────────────────────────────────────────┐
│                        app.py (UI Layer)                        │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ User selects country & member profile                    │  │
│  │ User enters query → clicks "Get Recommendation"         │  │
│  └───────────────────────────┬──────────────────────────────┘  │
└──────────────────────────────┼──────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│              agent_processor.py (Orchestration)                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ agent_query() function:                                  │  │
│  │ 1. Initialize observability (MLflow)                    │  │
│  │ 2. Track Phase 1: Data Retrieval                       │  │
│  │ 3. Create SuperAdvisorAgent instance                    │  │
│  │ 4. Track Phase 2: Anonymization                        │  │
│  │ 5. Call agent.process_query()                           │  │
│  │ 6. Track Phases 3-6: Classification, Tools, Synthesis  │  │
│  │ 7. Track Phase 7: Name Restoration                     │  │
│  │ 8. Track Phase 8: Audit Logging                        │  │
│  │ 9. End observability run                                │  │
│  └───────────────────────────┬──────────────────────────────┘  │
└───────────────────────────────┼──────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                   agent.py (Agent Class)                        │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ SuperAdvisorAgent.process_query():                       │  │
│  │ 1. Get member profile (via tools.get_member_profile())   │  │
│  │ 2. Build context with anonymization                      │  │
│  │ 3. Create AgentState                                     │  │
│  │ 4. Delegate to react_loop.run_agentic_loop()            │  │
│  └───────────────────────────┬──────────────────────────────┘  │
└──────────────────────────────┼──────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│              react_loop.py (Core Agentic Loop)                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ ReactAgenticLoop.run_agentic_loop():                    │  │
│  │                                                          │  │
│  │ PHASE 3: Classification                                 │  │
│  │  └─> classifier.classify_query_topic()                  │  │
│  │      ├─ Stage 1: Regex patterns                         │  │
│  │      ├─ Stage 2: Embedding similarity                    │  │
│  │      └─ Stage 3: LLM fallback                           │  │
│  │                                                          │  │
│  │ PHASE 4: Tool Selection & Execution                     │  │
│  │  └─> reason_and_select_tools()                          │  │
│  │      └─> act_execute_tools()                            │  │
│  │          └─> tools.call_tool()                           │  │
│  │              └─> Unity Catalog SQL functions            │  │
│  │                                                          │  │
│  │ PHASE 5: Response Synthesis                            │  │
│  │  └─> synthesize_response()                              │  │
│  │      └─> agent.generate_response()                      │  │
│  │          └─> Foundation Model API (Claude)              │  │
│  │                                                          │  │
│  │ PHASE 6: Validation                                     │  │
│  │  └─> observe_and_validate()                             │  │
│  │      └─> validator.validate()                           │  │
│  │          └─> Foundation Model API (Judge LLM)           │  │
│  │                                                          │  │
│  │ Return result_dict with response, citations, etc.       │  │
│  └───────────────────────────┬──────────────────────────────┘  │
└──────────────────────────────┼──────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│              agent_processor.py (Completion)                    │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ 1. Extract results from result_dict                      │  │
│  │ 2. Calculate cost breakdown                              │  │
│  │ 3. Log to governance table                               │  │
│  │ 4. End MLflow run                                        │  │
│  │ 5. Return structured response to app.py                  │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Component Responsibilities

app.py (UI Layer)

Streamlit interface for user interaction
Country and member profile selection
Query input and result display
Progress tracking UI

agent_processor.py (Orchestration Layer)

High-level query orchestration
Phase tracking (8 phases)
MLflow observability initialization
Governance logging (Unity Catalog)
Cost calculation and reporting
Error handling and recovery

agent.py (Agent Class)

Agent instance coordination
Tool initialization (SuperAdvisorTools)
Validator initialization (LLMJudgeValidator)
Prompts registry initialization
Creates ReactAgenticLoop instance
Data preparation (member profile, context building)
Utility methods (currency formatting, authority mapping)

react_loop.py (Core Agentic Loop)

Implements ReAct pattern (REASON → ACT → OBSERVE)
Query classification (3-stage cascade)
Tool selection and execution
Response synthesis (LLM generation)
Response validation (LLM-as-a-Judge)
Iterative refinement loop

Supporting Components:

classifier.py: 3-stage cascade classification
validation.py: LLM-as-a-Judge validation
tools.py: Unity Catalog function wrappers
country_config.py: Country-specific configurations
prompts_registry.py: Prompt versioning and MLflow tracking
observability.py: MLflow and Lakehouse Monitoring
utils/formatting.py: Currency formatting, SQL escaping
utils/audit.py: Governance logging utilities
utils/lakehouse.py: Unity Catalog SQL utilities
utils/progress.py: Real-time UI progress tracking

Key Design Patterns

Separation of Concerns:

agent_processor.py: Infrastructure and orchestration (phase tracking, logging, error handling)
agent.py: Agent instance coordination (tools, validators, prompts)
react_loop.py: Core agentic logic (REASON → ACT → OBSERVE)
utils/: Reusable utilities (formatting, SQL, audit)

Utility Organization:

utils/formatting.py: General formatting utilities (currency, SQL escaping)
utils/audit.py: Audit logging and governance utilities
utils/lakehouse.py: Unity Catalog and SQL operations
utils/progress.py: Real-time UI progress tracking

Country Configuration:

country_config.py: Single source of truth for country-specific settings
No hardcoded country logic in agent or react_loop
Easy to add new countries by extending configuration

Production Best Practices Demonstrated

1. MLflow Integration for Agentic AI

Challenge: Agentic AI applications need to track prompts, model versions, tool selections, and validation results across multiple iterations.

Solution: Comprehensive MLflow integration at every stage:

Prompt Registry with Versioning

All prompts are centralized in prompts_registry.py and automatically versioned in MLflow:

# Register prompts with MLflow
registry = PromptsRegistry()
registry.register_prompts_with_mlflow(run_name="prompts_v1.2.0")

# MLflow automatically tracks:
# - Prompt versions
# - Prompt content (artifacts)
# - Registration timestamps
# - A/B test comparisons

Benefits:

✅ Reproducibility: Full history of prompt changes
✅ A/B Testing: Easy comparison of prompt variations
✅ Rollback: Quickly revert to previous versions
✅ Collaboration: Team members can iterate on prompts safely

Experiment Tracking

Every query execution is tracked as an MLflow run:

with mlflow.start_run(run_name=f"query-{session_id}"):
    mlflow.log_param("country", country)
    mlflow.log_param("validation_mode", "llm_judge")
    mlflow.log_metric("runtime_sec", elapsed)
    mlflow.log_metric("total_cost_usd", cost)
    mlflow.log_dict(judge_verdict, "validation.json")
    mlflow.log_dict(cost_breakdown, "cost_breakdown.json")

What We Track:

Parameters: Country, user ID, tools used, validation mode
Metrics: Runtime, cost, token counts, validation confidence
Artifacts: Validation results, cost breakdowns, error logs
Traces: Full agent reasoning steps (via MLflow Traces)

2. Intelligent Cost Optimization

Challenge: LLM inference can be expensive, especially when processing every query through multiple models.

Solution: 3-Stage Cascade Classification System (classifier.py)

Stage 1: Regex Patterns (80% of queries)
├─ Cost: $0
├─ Latency: <1ms
└─ Accuracy: 95%+

Stage 2: Embedding Similarity (15% of queries)
├─ Cost: ~$0.0001
├─ Latency: ~100ms
├─ Model: databricks-bge-large-en
└─ Accuracy: 98%+

Stage 3: LLM Fallback (5% of queries)
├─ Cost: ~$0.001
├─ Latency: ~300ms
├─ Model: databricks-gpt-oss-120b
└─ Accuracy: 99%+

Result: 80% cost reduction compared to pure LLM classification while maintaining 99%+ accuracy.

3. Production Observability

Challenge: Agentic AI systems are complex—how do you monitor what's happening in production?

Solution: Multi-layer observability stack:

MLflow Dashboard

Per-query tracking: Every query logged with full context
Cost analysis: Real-time cost breakdowns
Quality metrics: Validation confidence, pass rates
Error tracking: Failed queries with full context

Lakehouse Monitoring

Aggregated metrics: Daily/weekly trends
Anomaly detection: Cost spikes, quality degradation
Automated alerts: Threshold-based notifications

Streamlit Governance UI

Real-time dashboards: Live metrics and trends
Classification analytics: Stage distribution and cost savings
Quality monitoring: Pass rates, confidence distributions
Cost analysis: Detailed breakdowns and projections

Key Insight: Observability isn't just logging—it's actionable insights that help you improve the system continuously.

4. Prompt Registry Pattern

Challenge: Prompts evolve over time. How do you manage versions, test changes, and roll back safely?

Solution: Centralized prompt registry with MLflow versioning (prompts_registry.py)

Features:

Single Source of Truth: All prompts in one place
Automatic Versioning: Every change tracked in MLflow
A/B Testing: Easy comparison of prompt variations
Country-Specific: Dynamic prompt generation per country
Audit Trail: Full history of prompt changes

Example:

# Get system prompt for Australia
registry = PromptsRegistry()
system_prompt = registry.get_system_prompt(country="Australia")

# Automatically includes:
# - Country-specific regulatory context
# - Currency and terminology
# - Special instructions
# - Version tracking

5. LLM-as-a-Judge Validation

Challenge: How do you ensure AI-generated responses are accurate and safe?

Solution: Multi-mode validation system (validation.py)

Validation Modes:

LLM Judge (Default)
- Comprehensive semantic validation
- Checks regulatory accuracy, safety, completeness
- Provides confidence scores
- Model: Claude Sonnet 4
Deterministic
- Fast rule-based checks
- Catches critical violations instantly
- Zero LLM cost
Hybrid
- Deterministic first, LLM fallback
- Best of both worlds

Safety Feature: Failed validations are hidden from users and placed in an internal review queue, ensuring users never see potentially incorrect advice.

6. Country-Agnostic Architecture

Challenge: How do you scale to multiple countries without code duplication?

Solution: Configuration-driven architecture (country_config.py)

Add a new country in 10 minutes:

CountryConfig(
    code="CA",
    name="Canada",
    currency="CAD",
    retirement_account_term="RRSP",
    regulatory_context="Follow Canadian RRSP rules...",
    available_tools=["tax", "benefit", "projection"]
)

Benefits:

✅ Zero code changes required
✅ Single source of truth for country settings
✅ Easy to test and validate
✅ Maintainable and scalable

Getting Started

Prerequisites

Databricks Workspace with Unity Catalog enabled
SQL Warehouse access
Foundation Model API access (Claude Sonnet 4, Claude Haiku 4, BGE)
Python 3.9+

Installation

Clone the repository

git clone https://github.com/pravinva/multi-country-pension-advisor.git
cd multi-country-pension-advisor

Install dependencies
```
pip install -r requirements.txt
```

Set up Unity Catalog

# Run SQL scripts in order:
# 1. sql_ddls/super_advisory_demo_schema.sql
# 2. sql_ddls/super_advisory_demo_functions.sql

Configure environment

Update config.py:

CATALOG_NAME = "super_advisory_demo"
SCHEMA_NAME = "pension_calculators"
SQL_WAREHOUSE_ID = "your_warehouse_id"
MLFLOW_PROD_EXPERIMENT_PATH = "/Users/your_email/pension-advisor-prod"

Run the application
```
streamlit run app.py
```

Demo Notebooks

Interactive Databricks notebooks demonstrating the complete system:

Setup & Data (_resources/)

00-setup.py - Initialize catalog and schemas
00-load-data.py - Generate demo data with Faker
bundle_config.yaml - dbdemos configuration

Infrastructure (01-setup/)

01-unity-catalog-setup.py - Create UC functions
02-governance-setup.py - Configure row-level security and audit logging

Agent Framework (02-agent-demo/)

01-agent-overview.py - Architecture and design patterns
02-build-agent.py - Build and test the agent
03-tool-integration.py - Unity Catalog function calling
04-validation.py - LLM-as-a-Judge validation

Observability (03-monitoring-demo/)

01-mlflow-tracking.py - Experiment tracking and metrics
02-observability.py - Latency and tool usage monitoring
03-dashboard.py - Governance dashboard

User Interface (04-ui-demo/)

01-streamlit-ui.py - Interactive web interface demo

Technology Stack

Agent Framework

ReAct Pattern: Custom implementation in react_loop.py
Tool Orchestration: Dynamic tool selection based on query analysis
Iterative Reasoning: Multi-step problem-solving

MLflow Integration

Prompt Registry: Version-controlled prompt management
Experiment Tracking: Per-query execution tracking
Model Governance: Full audit trail of model usage

Foundation Models

Claude Sonnet 4: Main synthesis and validation
Claude Haiku 4: Fast classification fallback
BGE Large: Embedding generation for semantic similarity
GPT-OSS-120B: Cost-effective LLM fallback

Data Platform

Unity Catalog: Member profiles, audit logs, citation registry
SQL Functions: 18 country-specific pension calculators
Lakehouse Monitoring: Aggregated metrics and dashboards

Observability

MLflow: Experiment tracking and prompt versioning
Lakehouse Monitoring: Time-series metrics and alerts
Streamlit UI: Real-time governance dashboards

Key Capabilities

Intelligent Off-Topic Detection

The 3-stage cascade classifier ensures that only relevant queries consume expensive LLM resources:

Stage 1: Regex patterns catch 80% of queries instantly (<1ms, $0)
Stage 2: Embedding similarity handles 15% semantically (~100ms, $0.0001)
Stage 3: LLM fallback for ambiguous cases (~300ms, $0.001)

Result: 80% cost reduction vs. pure LLM classification.

Privacy Protection

Built-in PII anonymization pipeline:

Extract member names from Unity Catalog
Replace with tokens during LLM processing
Restore real names in final response
Log only anonymized versions to MLflow

Regulatory Compliance

Every response includes:

Regulatory citations: Proper references to country-specific laws
Audit trail: Complete logging to Unity Catalog
Validation results: LLM-as-a-Judge confidence scores
Cost transparency: Per-query cost breakdowns

Cost Analysis

Typical Query Cost Breakdown

Operation	Model	Tokens	Cost
Planning & Synthesis	Claude Opus 4.1	~2,000	$0.0029
Validation	Claude Sonnet 4	~500	$0.0004
Total per Query			~$0.003

Cost Comparison

Traditional Financial Advisor: $150-300/hour
This System: $0.003/query (~50,000 queries = $150)
Cost Savings: 99.9% reduction

Real-Time Cost Display

Each response shows detailed cost breakdown:

Total Cost: $0.003245
├─ Main LLM (Opus 4.1): $0.002891
├─ Validation (Sonnet 4): $0.000354
└─ Classification: $0.0001

Governance & Compliance

Audit Trail

Every query is logged to Unity Catalog with:

Query text: Original user question
Agent response: Generated answer
Validation results: Judge verdict and confidence
Cost breakdown: Detailed cost analysis
Regulatory citations: References to laws and regulations
Tool usage: Which SQL functions were called

MLflow Artifacts

Prompt versions: Tracked for reproducibility
Validation results: Full judge reasoning
Cost breakdowns: Token-level analysis
Error logs: Debugging information

Lakehouse Monitoring

Daily metrics: Query volume, pass rates, costs
Anomaly detection: Cost spikes, quality degradation
Automated alerts: Threshold-based notifications

Development Guide

Project Structure

multi-country-pension-advisor/
├── agent.py                 # Main agent orchestrator
├── react_loop.py            # ReAct agentic loop implementation
├── agent_processor.py       # Execution orchestration & MLflow
├── classifier.py            # 3-stage cascade classifier
├── prompts_registry.py      # Prompt registry with MLflow
├── validation.py            # LLM-as-a-Judge validation
├── country_config.py        # Country-agnostic configuration
├── observability.py         # MLflow & Lakehouse Monitoring
├── tools.py                 # Unity Catalog function wrappers
├── app.py                   # Streamlit UI
├── agents/
│   └── orchestrator.py      # Phase tracking and timing utilities
├── prompts/
│   ├── template_manager.py  # Jinja2 template management
│   └── templates/           # Jinja2 template files
├── validation/
│   ├── json_parser.py       # JSON extraction and parsing
│   └── token_calculator.py  # Token counting utilities
├── ui/
│   ├── theme_config.py      # Country-specific UI themes
│   ├── html_builder.py      # HTML component builders
│   └── tab_base.py          # Base class for monitoring tabs
├── shared/
│   └── logging_config.py    # Centralized logging configuration
└── utils/
    ├── formatting.py        # Currency formatting, SQL escaping
    ├── audit.py             # Governance logging
    ├── lakehouse.py         # Unity Catalog utilities
    └── progress.py          # Real-time progress tracking

New Packages (Refactored)

agents/ - Agent infrastructure

orchestrator.py: Context manager for automatic phase tracking, timing, and error handling
- Reduces boilerplate in agent_processor.py from 400+ lines to <150 lines
- Provides reusable track_phase() context manager
- Automatic MLflow logging and progress tracking

prompts/ - Prompt management

template_manager.py: Jinja2 template engine for all prompts
- Eliminates hard-coded prompts throughout the codebase
- Country-specific prompt generation with caching
- Integration with country_config for dynamic variables

validation/ - Validation utilities

json_parser.py: Extract and parse JSON from LLM responses
token_calculator.py: Token counting for cost estimation

ui/ - UI components

theme_config.py: Centralized country theme definitions (colors, flags, gradients)
- Extracted from ui_components.py and ui_dashboard.py
- Single source of truth for all country styling
html_builder.py: Reusable HTML builders for cards, badges, metrics
- Generic card builders
- Specialized components (member cards, question cards, activity items)
- System status banners
tab_base.py: Abstract base class for monitoring tabs
- Eliminates ~425 lines of duplicated code across 5 tabs
- Automatic data loading, error handling, and empty state management
- Consistent UI patterns across all tabs

shared/ - Shared infrastructure

logging_config.py: Centralized logging configuration
- Structured logging with ColoredFormatter
- Configurable log levels and file output
- Consistent logging patterns across all modules

Logging Setup

Configuration

All modules use structured logging via shared/logging_config.py:

from shared.logging_config import get_logger, setup_logging

# Initialize logging (in app.py or main entry point)
setup_logging(
    log_level=logging.INFO,      # DEBUG, INFO, WARNING, ERROR, CRITICAL
    enable_file=True,             # Write logs to file
    log_file="logs/app.log"      # Log file path
)

# Get logger in any module
logger = get_logger(__name__)

# Use structured logging
logger.info("Processing query")
logger.error("Failed to execute tool", exc_info=True)  # Includes stack trace
logger.debug(f"Response: {response}")

Log Levels

DEBUG: Detailed information for diagnosing problems
INFO: Confirmation that things are working as expected
WARNING: Indication of potential problems
ERROR: Due to a more serious problem, the function has failed
CRITICAL: Serious error, the program may be unable to continue

Best Practices

Production: Use INFO level with file logging enabled
Development: Use DEBUG level for detailed tracing
Error handling: Always use exc_info=True for exceptions
Consistency: All production code uses logger, no print statements

Code Flow Architecture

Understanding how the components interact is crucial for maintaining and extending the system:

┌─────────────────────────────────────────────────────────────────┐
│                        app.py (UI Layer)                        │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ User selects country & member profile                    │  │
│  │ User enters query → clicks "Get Recommendation"         │  │
│  └───────────────────────────┬──────────────────────────────┘  │
└───────────────────────────────┼──────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│              agent_processor.py (Orchestration)                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ agent_query() function:                                  │  │
│  │ 1. Initialize observability (MLflow)                    │  │
│  │ 2. Track Phase 1: Data Retrieval                       │  │
│  │ 3. Create SuperAdvisorAgent instance                    │  │
│  │ 4. Track Phase 2: Anonymization                        │  │
│  │ 5. Call agent.process_query()                           │  │
│  │ 6. Track Phases 3-6: Classification, Tools, Synthesis  │  │
│  │ 7. Track Phase 7: Name Restoration                     │  │
│  │ 8. Track Phase 8: Audit Logging                        │  │
│  │ 9. End observability run                                │  │
│  └───────────────────────────┬──────────────────────────────┘  │
└───────────────────────────────┼──────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                   agent.py (Agent Class)                        │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ SuperAdvisorAgent.process_query():                       │  │
│  │ 1. Get member profile (via tools.get_member_profile())   │  │
│  │ 2. Build context with anonymization                      │  │
│  │ 3. Create AgentState                                     │  │
│  │ 4. Delegate to react_loop.run_agentic_loop()            │  │
│  └───────────────────────────┬──────────────────────────────┘  │
└───────────────────────────────┼──────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│              react_loop.py (Core Agentic Loop)                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ ReactAgenticLoop.run_agentic_loop():                    │  │
│  │                                                          │  │
│  │ PHASE 3: Classification                                 │  │
│  │  └─> classifier.classify_query_topic()                   │  │
│  │      ├─ Stage 1: Regex patterns                         │  │
│  │      ├─ Stage 2: Embedding similarity                    │  │
│  │      └─ Stage 3: LLM fallback                           │  │
│  │                                                          │  │
│  │ PHASE 4: Tool Selection & Execution                     │  │
│  │  └─> reason_and_select_tools()                          │  │
│  │      └─> act_execute_tools()                            │  │
│  │          └─> tools.call_tool()                         │  │
│  │              └─> Unity Catalog SQL functions            │  │
│  │                                                          │  │
│  │ PHASE 5: Response Synthesis                            │  │
│  │  └─> synthesize_response()                              │  │
│  │      └─> agent.generate_response()                      │  │
│  │          └─> Foundation Model API (Claude)             │  │
│  │                                                          │  │
│  │ PHASE 6: Validation                                     │  │
│  │  └─> observe_and_validate()                             │  │
│  │      └─> validator.validate()                          │  │
│  │          └─> Foundation Model API (Judge LLM)          │  │
│  │                                                          │  │
│  │ Return result_dict with response, citations, etc.       │  │
│  └───────────────────────────┬──────────────────────────────┘  │
└──────────────────────────────┼──────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│              agent_processor.py (Completion)                    │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ 1. Extract results from result_dict                      │  │
│  │ 2. Calculate cost breakdown                              │  │
│  │ 3. Log to governance table                               │  │
│  │ 4. End MLflow run                                        │  │
│  │ 5. Return structured response to app.py                  │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Supporting Components

Configuration & Utilities:

country_config.py: Centralized country configurations (currency, authorities, regulations)
prompts_registry.py: Prompt versioning and MLflow tracking
utils/formatting.py: Currency formatting, SQL escaping utilities
utils/audit.py: Governance logging and audit trail utilities
utils/lakehouse.py: Unity Catalog SQL execution utilities
utils/progress.py: Real-time UI progress tracking

Core Logic:

classifier.py: 3-stage cascade classification (Regex → Embedding → LLM)
validation.py: LLM-as-a-Judge validation with multiple modes
tools.py: Unity Catalog function wrappers
observability.py: MLflow and Lakehouse Monitoring integration

Key Design Patterns

Separation of Concerns:

agent_processor.py: Infrastructure and orchestration (phase tracking, logging, error handling)
agent.py: Agent instance coordination (tools, validators, prompts)
react_loop.py: Core agentic logic (REASON → ACT → OBSERVE)
utils/: Reusable utilities (formatting, SQL, audit)

Utility Organization:

utils/formatting.py: General formatting utilities (currency, SQL escaping)
utils/audit.py: Audit logging and governance utilities
utils/lakehouse.py: Unity Catalog and SQL operations
utils/progress.py: Real-time UI progress tracking

Country Configuration:

country_config.py: Single source of truth for country-specific settings
No hardcoded country logic in agent or react_loop
Easy to add new countries by extending configuration

Adding a New Country

The system is designed to support new countries with minimal configuration changes. Here's a complete example of adding Canada as a fifth country:

Step 1: Update Configuration (config/config.yaml)

Add the new country to the countries list:

countries:
  - code: "AU"
    name: "Australia"
    enabled: true
  - code: "US"
    name: "United States"
    enabled: true
  - code: "UK"
    name: "United Kingdom"
    enabled: true
  - code: "IN"
    name: "India"
    enabled: true
  - code: "CA"
    name: "Canada"
    enabled: true

Step 2: Add Country Configuration (country_config.py)

Define the country-specific settings:

CountryConfig(
    code="CA",
    name="Canada",
    currency="CAD",
    currency_symbol="$",
    retirement_account_term="RRSP",
    authority="CRA",
    regulatory_context="""
        Follow Canadian Revenue Agency (CRA) regulations for RRSPs.
        Key rules:
        - Contribution limit: 18% of previous year's income (max $31,560 for 2024)
        - Withdrawal taxed as income
        - First Home Buyers' Plan: withdraw up to $35,000
        - Age 71: must convert to RRIF
    """,
    available_tools=["rrsp_balance", "tax_calculation", "withdrawal_eligibility"]
)

Step 3: Create SQL Functions (Unity Catalog)

Create country-specific calculation functions:

-- RRSP Balance Query
CREATE OR REPLACE FUNCTION super_advisory_demo.pension_calculators.CA_get_rrsp_balance(
    member_id STRING
)
RETURNS STRUCT<
    balance DOUBLE,
    contribution_room DOUBLE,
    currency STRING
>
LANGUAGE SQL
RETURN SELECT
    balance,
    contribution_room,
    'CAD' as currency
FROM super_advisory_demo.member_data.member_profiles
WHERE user_id = member_id AND country = 'CA';

-- Tax Calculation
CREATE OR REPLACE FUNCTION super_advisory_demo.pension_calculators.CA_calculate_tax(
    withdrawal_amount DOUBLE,
    member_id STRING
)
RETURNS STRUCT<
    tax_amount DOUBLE,
    net_amount DOUBLE,
    tax_rate DOUBLE
>
LANGUAGE SQL
...

Step 4: Add Test Member Data

Insert test member profile:

INSERT INTO super_advisory_demo.member_data.member_profiles
(user_id, name, age, country, super_balance, contribution_rate, employer_match)
VALUES ('CA001', 'John Smith', 45, 'CA', 125000, 0.10, 0.05);

Step 5: Add UI Theme (Optional)

Update ui/theme_config.py for country-specific UI styling:

COUNTRY_COLORS = {
    "CA": {
        "primary": "#FF0000",      # Red from Canadian flag
        "secondary": "#FFFFFF",     # White from Canadian flag
        "currency": "$",
        "flag": "CA"
    },
    ...
}

Step 6: Test the Integration

# Test query for Canadian member
from agent_processor import agent_query

result = agent_query(
    member_id="CA001",
    user_query="What is my RRSP contribution room?",
    withdrawal_amount=None
)

That's it! The system will automatically:

Route queries to Canadian-specific functions
Apply CRA regulatory context
Format currency as CAD
Use Canadian terminology (RRSP, CRA)
Track Canada-specific metrics

Key Benefits:

Zero code changes to agent logic
Configuration-driven approach
Full isolation between countries
Easy to test and validate
Maintainable and scalable

Testing & Evaluation

Offline Evaluation

The system includes built-in offline evaluation capabilities for batch testing and validation. Offline evaluation allows you to test multiple queries at once and analyze results systematically.

Using Offline Evaluation via UI

Navigate to Governance → Config Tab
- Open the Streamlit app and go to the Governance page
- Click on the "⚙️ Config" tab
Upload Evaluation CSV
- Scroll to "🧪 Offline Evaluation" section
- Upload a CSV file with the required columns (see format below)
- Preview the data before running
Run Evaluation
- Click "▶️ Run Offline Evaluation"
- Results are processed and logged to MLflow
- View summary statistics and sample results

CSV Format

The evaluation CSV must contain the following columns:

user_id,country,query_str
AU001,AU,"How much superannuation will I have at retirement?"
US002,US,"What's my 401k balance?"
UK003,UK,"When can I access my pension?"

Column Requirements:

user_id: Member ID (must exist in member_profiles table)
country: Country code (AU, US, UK, IN) or display name (Australia, USA, etc.)
query_str: The query to evaluate (can also be query_string)

Example CSV:

user_id,country,query_str
AU001,AU,"What is my preservation age?"
AU002,AU,"Can I withdraw my super before age 60?"
US001,US,"What are the RMD rules for my 401k?"
UK001,UK,"When can I access my state pension?"

Results & Logging

MLflow Integration:

All evaluation runs are logged to MLFLOW_OFFLINE_EVAL_PATH (configurable in config.py)
Each query in the batch creates a separate MLflow run
Full observability: costs, validation results, tool usage tracked per query

Result Structure:

{
    "total_queries": 10,
    "sample_result": {
        "row_index": 0,
        "user_id": "AU001",
        "country": "AU",
        "query": "What is my preservation age?",
        "session_id": "uuid-here",
        "result": {
            "answer": "...",
            "citations": [...],
            "cost": 0.003,
            "validation_passed": True,
            ...
        }
    }
}

Command-Line Usage

You can also run offline evaluation from the command line:

# Offline evaluation (batch CSV)
python run_evaluation.py --mode offline --csv_path eval_queries.csv

# Online evaluation (single query)
python run_evaluation.py --mode online \
    --query_str "What is my preservation age?" \
    --user_id AU001 \
    --country AU

Evaluation Code Location

Evaluation Script: run_evaluation.py
UI Integration: ui_components.py → render_configuration_tab() (lines 747-812)
MLflow Path: Configured in config.py → MLFLOW_OFFLINE_EVAL_PATH

Functions:

offline_eval(csv_path): Batch evaluation on CSV file
online_eval(query_str, user_id, country): Single query evaluation
run_csv_evaluation(uploaded_csv): Streamlit wrapper for UI

Governance Dashboard Guide

The Governance page provides comprehensive monitoring and observability through 5 specialized tabs. Each tab focuses on different aspects of system performance, cost, and quality.

Tab 1: 🔒 Governance Dashboard

Purpose: High-level overview of system health and recent activity at a glance.

What You See:

Key Metrics Cards (Last 24 Hours)

Total Queries: Total number of queries processed in the last 24 hours
- How to read: Higher numbers indicate active usage
- What to watch: Sudden drops may indicate system issues
Pass Rate: Percentage of queries that passed validation
- How to read: Should be consistently above 80% for healthy operation
- What to watch: Drops below 70% indicate quality issues requiring investigation
Avg Cost: Average cost per query in USD
- How to read: Typically $0.003-$0.010 per query
- What to watch: Spikes may indicate expensive queries or classification failures
Health Score: Overall system health (0-100)
- How to read: Above 80 = Healthy, 60-80 = Needs attention, Below 60 = Critical
- Calculation: Based on pass rate, error rate, latency, and cost efficiency

Status indicators: Shows overall system state (Operational, Degraded, Critical)
Tool usage: Displays most frequently used tools (tax, benefit, projection)
Country distribution: Shows query volume by country

Audit Trail

Complete query history: Full list of all queries with timestamps
Columns: User ID, Country, Query, Response Preview, Cost, Validation Status, Runtime
Sorting: Click column headers to sort
Filtering: Use search boxes to filter by user, country, or query text
Purpose: Compliance auditing and debugging specific queries

Recent Activity Feed

Last 10 queries: Quick view of most recent activity
Color coding: Green (passed), Red (failed), Yellow (pending)
Purpose: Quick health check without scrolling through full audit trail

Quick Trend Charts

Query Volume Over Time: Line chart showing query frequency
Cost Trend: Area chart showing spending patterns
Purpose: Identify usage patterns and cost spikes

Use Cases:

Morning health check: Review key metrics and system status
Compliance audit: Use audit trail to review specific queries
Pattern analysis: Use trend charts to identify usage patterns

Tab 2: 🔬 MLflow Traces

Purpose: Deep dive into MLflow experiment tracking, prompt versions, and individual query execution details.

What You See:

MLflow Experiments Tab

Recent runs: List of all MLflow runs with timestamps
Run details: Click any run to see:
- Parameters: Country, user ID, validation mode, tools used
- Metrics: Runtime, cost, token counts, validation confidence
- Artifacts: Full validation results, cost breakdowns, error logs
Purpose: Debug individual queries and compare runs

Prompt Registry Tab

Registered prompts: All prompt versions tracked in MLflow
Version history: See how prompts evolved over time
Prompt content: View full prompt text for each version
Registration button: Manually trigger prompt registration
Purpose: Prompt versioning, A/B testing, and rollback capabilities

How to Interpret:

High confidence scores: Good quality responses
Low confidence scores: May need prompt refinement
Cost spikes: Check artifacts for expensive queries
Failed validations: Review reasoning in artifacts

Use Cases:

Prompt iteration: Compare different prompt versions
Quality investigation: Deep dive into failed validations
Cost optimization: Identify expensive queries and optimize

Tab 3: ⚙️ Config

Purpose: System configuration and offline evaluation management.

What You See:

SQL Warehouse Configuration

Warehouse selection: Dropdown to select active SQL warehouse
Purpose: Configure which warehouse is used for SQL function execution

LLM Configuration

Main Advisory LLM (Claude Opus 4.1):
- Temperature: Controls creativity (0.0-1.0)
  - Lower (0.0-0.3): More factual, consistent
  - Higher (0.7-1.0): More creative, varied
- Max Tokens: Maximum response length (100-4000)
  - Lower: Shorter, more focused responses
  - Higher: More detailed responses
Judge Validation LLM (Claude Sonnet 4):
- Temperature: Lower recommended (0.0-0.2) for consistent validation
- Max Tokens: Typically 100-1000 for validation responses

How to Interpret:

Temperature too high: Inconsistent responses, hallucinations
Temperature too low: Rigid, repetitive responses
Max tokens too low: Truncated responses
Max tokens too high: Unnecessary cost

Offline Evaluation

CSV upload: Upload evaluation datasets
Preview: See data before running
Run evaluation: Execute batch queries
Results: View summary and sample results
Purpose: Batch testing and validation

Use Cases:

Parameter tuning: Adjust LLM settings based on performance
Batch testing: Evaluate multiple queries systematically
Cost optimization: Test different configurations

Tab 4: 💰 Cost Analysis

Purpose: Comprehensive cost analysis, trends, and projections.

What You See:

Key Cost Metrics

Total Cost: Sum of all query costs
Median Cost: Median cost per query (less sensitive to outliers)
Max Cost: Highest single query cost
Cost Std Dev: Variability in query costs
Last Run Cost: Cost of most recent query with % change vs average

How to Interpret:

Median vs Average: Large difference indicates cost outliers
High Std Dev: Inconsistent query costs, may need optimization
Max Cost: Identify expensive queries for optimization

Cost Distribution Charts

Cost Distribution Histogram: Shows how costs are distributed
- Peak at low costs: Most queries are efficient
- Long tail: Some expensive queries exist
Cost Over Time: Trend line showing cost patterns
- Upward trend: May indicate classification issues
- Spikes: Investigate specific time periods

Cost by Country

Bar chart: Average cost per country
Purpose: Identify country-specific cost patterns
Use Cases: Optimize country-specific configurations

Cost Projections

Monthly projection: Based on current rate
Annual projection: Long-term cost estimate
Purpose: Budget planning and forecasting

Use Cases:

Budget planning: Use projections for cost estimates
Cost optimization: Identify expensive queries and optimize
Country analysis: Compare costs across countries

Tab 5: 📊 Observability

Purpose: Real-time monitoring of performance, quality, classification, and system health.

What You See:

Real-Time Performance Metrics (Left Column)

Key Metrics (Last 24h):

Total Queries: Count of queries processed
Unique Users: Number of distinct users
Total Cost: Aggregate spending
Average Latency: Mean response time
Pass Rate: Validation success rate

Charts:

Query Volume Over Time: Hourly query frequency (line chart)
- Peaks: Identify usage patterns
- Drops: May indicate system issues
Cost Trend: Cost spending over time (area chart)
- Spikes: Investigate expensive periods
Latency Trend: Response time over time (line chart)
- Spikes: Performance degradation

Performance by Country:

Table: Metrics broken down by country
Purpose: Country-specific performance analysis

Query Distribution:

Pie chart: Percentage of queries by country
Purpose: Understand usage patterns

Classification Analytics (Right Column)

Stage Distribution:

Stage 1 (Regex): Percentage handled by regex patterns
- Target: 75-85% of queries
- Low percentage: May need regex pattern updates
Stage 2 (Embedding): Percentage handled by embeddings
- Target: 10-20% of queries
Stage 3 (LLM): Percentage requiring LLM fallback
- Target: <10% of queries
- High percentage: May indicate classification issues

Cost Savings Analysis:

Actual Cost: Current classification cost
Pure LLM Cost: What it would cost with LLM-only classification
Savings: Percentage saved (typically 80-90%)
Purpose: Validate cost optimization effectiveness

Classification Funnel:

Visual flow: Shows cascade progression
Purpose: Understand how queries flow through stages

Latency by Stage:

Bar chart: Average latency per stage
Purpose: Performance comparison

Quality Monitoring (Left Column)

Validation Pass Rate:

Overall Pass Rate: Percentage of queries passing validation
Pass Rate Trend: Pass rate over time (line chart)
- Downward trend: Quality degradation
- Drops: Investigate specific time periods

Confidence Analysis:

Average Confidence: Mean validation confidence score
Confidence Distribution: Histogram of confidence scores
- Peak near 1.0: High quality responses
- Peak near 0.5-0.7: May need prompt refinement
Confidence by Verdict: Average confidence for passed vs failed
- Purpose: Understand validation patterns

Common Violations:

Violation types: Most frequent validation failures
Frequency: How often each violation occurs
Purpose: Identify recurring issues

Quality by Country:

Table: Pass rate and confidence by country
Purpose: Country-specific quality analysis

System Health (Right Column)

Health Score:

Overall Score: 0-100 health rating
Components: Based on pass rate, error rate, latency, cost
Status: Healthy (80+), Degraded (60-80), Critical (<60)

Anomaly Detection:

Cost Anomalies: Unusual cost spikes
Latency Anomalies: Unusual response times
Quality Anomalies: Unusual validation failures
Purpose: Early warning system

Error Rate:

Current Error Rate: Percentage of failed queries
Error Trend: Error rate over time
Purpose: Monitor system stability

System Alerts:

Active Alerts: Current issues requiring attention
Alert History: Past alerts and resolutions

Use Cases:

Performance monitoring: Track latency and throughput
Quality assurance: Monitor validation pass rates
Cost optimization: Validate classification savings
Anomaly detection: Identify unusual patterns early

Interpreting Charts and Metrics

General Guidelines

Green Indicators (Good):

High pass rates (>80%)
Low latency (<3 seconds)
Low cost (<$0.005 per query)
High health scores (>80)

Yellow Indicators (Warning):

Pass rates 70-80%
Latency 3-5 seconds
Cost $0.005-$0.010 per query
Health scores 60-80

Red Indicators (Critical):

Pass rates <70%
Latency >5 seconds
Cost >$0.010 per query
Health scores <60

Chart Reading Tips

Line Charts:

Upward trends: Increasing over time
Downward trends: Decreasing over time
Spikes: Sudden increases (investigate)
Drops: Sudden decreases (investigate)

Bar Charts:

Taller bars: Higher values
Compare heights: Relative differences
Clusters: Group similar values

Histograms:

Peaks: Most common values
Distribution shape: Normal, skewed, bimodal
Outliers: Values far from the peak

Pie Charts:

Larger slices: Higher percentages
Compare slices: Relative proportions

Documentation

Core Guides

README.md (this file) - Overview and production practices
CLASSIFIER_GUIDE.md - Intelligent off-topic detection
MONITORING_GUIDE.md - Lakehouse Monitoring setup
COUNTRY_CONFIG_MIGRATION.md - Adding new countries

Production Best Practices

Prompt Registry Pattern: Centralized prompt management
MLflow Integration: Experiment tracking and versioning
Observability Stack: Multi-layer monitoring approach
Cost Optimization: Intelligent cascade classification

Key Learnings

What Makes This Production-Ready?

Observability First: Every query is tracked, logged, and monitored
Cost Awareness: Intelligent optimization without sacrificing quality
Version Control: Prompts and models are versioned and reproducible
Safety Mechanisms: Failed validations don't reach users
Scalability: Country-agnostic architecture for easy expansion
Regulatory Compliance: Full audit trails and citation tracking

Databricks Platform Advantages

Unity Catalog: Single source of truth for data and functions
Foundation Model API: No API keys, workspace authentication
MLflow: Built-in experiment tracking and model governance
Lakehouse Monitoring: Native time-series metrics and alerts
SQL Functions: Reusable, versioned calculation logic

Contributing

Issues and pull requests welcome! This is a reference implementation demonstrating production best practices for agentic AI on Databricks.

Maintained by: Pravin Varma

License

MIT License - See LICENSE file for details

Acknowledgments

Built with:

Databricks Platform - Unified analytics and AI platform
MLflow - Open-source ML lifecycle management
Unity Catalog - Unified governance for data and AI
Foundation Model API - Managed LLM endpoints

This reference implementation demonstrates how to build production-ready agentic AI applications on Databricks using industry best practices. Use it as a blueprint for your own agentic AI projects.

For questions or feedback, open an issue or reach out via GitHub.

Languages

Python98.8%Jinja1.2%

Contributors

Created October 22, 2025

Updated January 17, 2026

pravinva/superannuation-agent-multi-country

Building Production-Ready Agentic AI Applications on Databricks: A Deep Dive into the Multi-Country Pension Advisor

Executive Summary

Architecture Overview

The ReAct Agent Pattern

Why ReAct?

Code Flow Architecture

Component Responsibilities

Key Design Patterns

Production Best Practices Demonstrated

1. MLflow Integration for Agentic AI

Prompt Registry with Versioning

Experiment Tracking

2. Intelligent Cost Optimization

3. Production Observability

MLflow Dashboard

Lakehouse Monitoring

Streamlit Governance UI

4. Prompt Registry Pattern

5. LLM-as-a-Judge Validation

6. Country-Agnostic Architecture

Getting Started

Prerequisites

Installation

Demo Notebooks

Technology Stack

Agent Framework

MLflow Integration

Foundation Models

Data Platform

Observability

Key Capabilities

Intelligent Off-Topic Detection

Privacy Protection

Regulatory Compliance

Cost Analysis

Typical Query Cost Breakdown

Cost Comparison

Real-Time Cost Display

Governance & Compliance

Audit Trail

MLflow Artifacts

Lakehouse Monitoring

Development Guide

Project Structure

New Packages (Refactored)

Logging Setup

Code Flow Architecture

Supporting Components

Key Design Patterns

Adding a New Country

Step 1: Update Configuration (config/config.yaml)

Step 2: Add Country Configuration (country_config.py)

Step 3: Create SQL Functions (Unity Catalog)

Step 4: Add Test Member Data

Step 5: Add UI Theme (Optional)

Step 6: Test the Integration

Testing & Evaluation

Offline Evaluation

Using Offline Evaluation via UI

CSV Format

Results & Logging

Command-Line Usage

Evaluation Code Location

Governance Dashboard Guide

Tab 1: 🔒 Governance Dashboard

Key Metrics Cards (Last 24 Hours)

System Status Banner

Audit Trail

Recent Activity Feed

Quick Trend Charts

Tab 2: 🔬 MLflow Traces

MLflow Experiments Tab

Prompt Registry Tab

Tab 3: ⚙️ Config

SQL Warehouse Configuration

LLM Configuration

Offline Evaluation

Tab 4: 💰 Cost Analysis

Key Cost Metrics