danijeun/foresee-app

AI-powered AutoML platform that transforms CSV files into ML insights in 60 seconds. Upload data → Get AI-ranked target recommendations (Gemini 2.5) → Auto-train 3 models (LR, DT, XGBoost) → Download professional PDF reports with visualizations. Built with React, Flask, Snowflake & Google Gemini AI.

artificial-intelligence automated-machine-learning automl data-analysis data-science eda flask google-gemini machine-learning ml-pipeline pdf-reports python react snowflake xgboost

ForeSee - AI-Powered AutoML Platform

Automated Machine Learning Analysis & Reporting with Google Gemini AI

ForeSee is an intelligent web application that transforms raw data into actionable ML insights. Upload a CSV file and get professional ML analysis reports with AI-powered target variable recommendations, automated model training, and comprehensive PDF reports—all in minutes.

🎯 What Does Foresee Do?

From CSV to ML Insights in 5 Simple Steps:

Upload your dataset (CSV format)
AI Analysis - Google Gemini automatically suggests the best target variables to predict
Select your prediction target from ranked recommendations
Auto-Train - System trains 3 ML models in parallel (Logistic Regression, Decision Tree, XGBoost)
Download a comprehensive PDF report with insights, metrics, and recommendations

✨ Key Features

🤖 AI-Powered Target Selection (Google Gemini 2.5)

Uses Google Gemini 2.5 Flash (gemini-2.5-flash-preview-05-20) to intelligently analyze your dataset
Recommends the top 5 most valuable prediction targets with importance scores (1-100)
Distinguishes between target variables (outcomes) and features (predictors)
Provides detailed business rationale, predictability assessment, and suggested features
Runs in parallel with EDA for faster results

📊 Automatic Exploratory Data Analysis (EDA)

Snowflake-based comprehensive statistical analysis
Analyzes all column types: numeric, categorical, datetime, text
Detects data types, missing values, duplicates, and cardinality
Calculates metrics: mean, std, quartiles, skewness, kurtosis, top values
Stores all results in Snowflake for querying and persistence
Parallel execution with Target Analysis for optimal performance

🚀 Multi-Model Machine Learning

Trains 3 models sequentially after target selection:

Model	Description	Key Metrics
Logistic Regression	Fast, interpretable baseline	Accuracy, Precision, Recall, F1, ROC-AUC
Decision Tree	Non-linear pattern detection	Tree depth, leaves, feature importance
XGBoost	State-of-the-art gradient boosting	N-estimators, max depth, learning rate

Each model provides:

Performance metrics (train & test)
Confusion matrices
Feature importance rankings
Model-specific recommendations
Data quality assessments

📄 Professional PDF Reports (AI-Generated)

Natural language insights generated by Google Gemini
Executive summary with best-performing model
Data quality and EDA insights
Model performance comparisons
Feature importance analysis
Actionable recommendations
Professional charts and visualizations

❄️ Snowflake Data Platform

Isolated workflow schemas - Each upload creates WORKFLOW_<UUID> schema
Scalable data warehouse for enterprise datasets
SQL-based data processing and storage
Persistent storage for all EDA and ML results
Clean separation between workflows

⚡ Modern Web Interface (React + Tailwind)

Drag-and-drop file upload
Real-time progress tracking with rotating status messages
Interactive podium display for top 3 target recommendations
Responsive design for all devices
Smooth animations with AOS (Animate On Scroll)
In-browser PDF viewing and download

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    FRONTEND (React 19 + Vite)                   │
│                                                                  │
│  • Drag & drop file upload       • Podium target display       │
│  • Real-time progress tracking   • PDF viewer                  │
│  • Target variable selection     • Responsive UI               │
└─────────────────────┬───────────────────────────────────────────┘
                      │ REST API (CORS enabled)
                      │
┌─────────────────────▼───────────────────────────────────────────┐
│                  BACKEND (Flask 3.0 API)                        │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │               MULTI-AGENT SYSTEM                          │  │
│  │                                                            │  │
│  │  1️⃣ EDA Agent (Parallel)                                  │  │
│  │     → Analyze dataset                                     │  │
│  │     → Store stats in Snowflake                           │  │
│  │                                                            │  │
│  │  2️⃣ Target Variable Agent (Parallel) - Gemini 2.5        │  │
│  │     → Sample data                                         │  │
│  │     → LLM analysis                                        │  │
│  │     → Rank top 5 targets (importance scores)            │  │
│  │                                                            │  │
│  │  3️⃣ ML Training Agents (Sequential after target select)  │  │
│  │     → Logistic Regression Agent                          │  │
│  │     → Decision Tree Agent                                │  │
│  │     → XGBoost Agent                                      │  │
│  │                                                            │  │
│  │  4️⃣ Natural Language Agent - Gemini 2.5                  │  │
│  │     → Collect EDA & ML results                           │  │
│  │     → Generate insights (LLM)                            │  │
│  │     → Create PDF report (ReportLab)                      │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
│  Services:                                                      │
│  • Workflow Manager     • Snowflake Ingestion                  │
│  • EDA Service          • Config Management                    │
└─────────────────────┬───────────────────────────────────────────┘
                      │ Snowflake Connector
                      │
┌─────────────────────▼───────────────────────────────────────────┐
│                    SNOWFLAKE DATA PLATFORM                      │
│                                                                  │
│  Isolated Schemas: WORKFLOW_<UUID>                             │
│                                                                  │
│  Tables per Workflow:                                          │
│  • WORKFLOW_METADATA          → Workflow info                  │
│  • WORKFLOW_EDA_SUMMARY       → EDA results                    │
│  • COLUMN_STATS               → Column metrics                 │
│  • LOGISTIC_REGRESSION_SUMMARY → LR model results             │
│  • DECISION_TREE_SUMMARY       → DT model results             │
│  • XGBOOST_SUMMARY             → XGB model results            │
│  • RAW_DATA_TABLE              → Original CSV data            │
└─────────────────────────────────────────────────────────────────┘

🔄 Application Workflow

Phase 1: Upload & Parallel Analysis (45-60s)

User uploads CSV
    │
    ├─→ Store in Snowflake (temp file → Snowflake table)
    │
    ├─→ 🧵 Thread 1: EDA Agent
    │       └─→ Analyze all columns
    │           └─→ Save to WORKFLOW_EDA_SUMMARY
    │
    └─→ 🧵 Thread 2: Target Variable Agent (Gemini 2.5)
            └─→ Sample 100 rows
                └─→ LLM analysis
                    └─→ Return top 5 targets (ranked)

Time Saved: ~40% faster than sequential execution

Phase 2: Target Selection (User interaction)

Frontend displays 3 recommendations in podium:
    🥇 Gold   (Rank 1 - Highest importance)
    🥈 Silver (Rank 2)
    🥉 Bronze (Rank 3)
    
+ "Other Options" button → Shows all 5 recommendations

Each recommendation includes:
    • Importance Score (1-100)
    • Problem Type (regression/classification)
    • Why Important (business value)
    • Predictability (HIGH/MEDIUM/LOW)
    • Suggested Features (top predictors)

User selects target → Saved to workflow_metadata

Phase 3: Sequential ML Training (10-15s)

Automatic training after target selection:

1. Logistic Regression Agent
    ├─→ Feature engineering
    ├─→ Train/test split (80/20)
    ├─→ Model training (max_iter=1000)
    ├─→ Performance evaluation
    └─→ Save to LOGISTIC_REGRESSION_SUMMARY

2. Decision Tree Agent
    ├─→ Feature engineering
    ├─→ Train/test split (80/20)
    ├─→ Model training (max_depth=10)
    ├─→ Performance evaluation
    └─→ Save to DECISION_TREE_SUMMARY

3. XGBoost Agent
    ├─→ Feature engineering
    ├─→ Train/test split (80/20)
    ├─→ Model training (n_estimators=100, max_depth=6)
    ├─→ Performance evaluation
    └─→ Save to XGBOOST_SUMMARY

Phase 4: Report Generation (5-10s)

Natural Language Agent (Gemini 2.5)
    │
    ├─→ Collect EDA insights from Snowflake
    ├─→ Collect ML results (all 3 models)
    ├─→ Generate narrative with Gemini LLM
    ├─→ Create visualizations (Matplotlib)
    ├─→ Generate PDF (ReportLab)
    └─→ Save to backend/pdf/

Total Time: ~60-85 seconds from upload to PDF

🛠️ Technology Stack

Frontend

Library	Version	Purpose
React	19.1.1	UI framework
React Router	7.9.3	Navigation
Vite	7.1.7	Build tool & dev server
Tailwind CSS	4.1.14	Styling framework
AOS	2.3.4	Scroll animations
ESLint	9.36.0	Code linting

Backend

Library	Version	Purpose
Flask	3.0.0	REST API framework
Flask-CORS	4.0.0	Cross-origin support
python-dotenv	1.0.1	Environment config

AI & Machine Learning

Library	Version	Purpose
google-generativeai	0.8.3	Google Gemini 2.5 API
scikit-learn	1.5.0	Logistic Regression, Decision Tree
XGBoost	2.1.0	Gradient boosting
SHAP	0.44.0	Model explainability
pandas	2.2.0	Data manipulation
NumPy	1.26.0	Numerical operations

Data Platform

Library	Version	Purpose
snowflake-connector-python	3.12.0	Snowflake connectivity
snowflake-snowpark-python	1.39.1	Snowpark DataFrame API

Reporting & Visualization

Library	Version	Purpose
ReportLab	4.0.7	PDF generation
Matplotlib	3.8.0	Charts & visualizations

📋 Project Structure

foresee-app/
├── frontend/                           # React Application
│   ├── src/
│   │   ├── pages/
│   │   │   ├── Home.jsx               # Landing page
│   │   │   ├── Foresee.jsx            # Main app (upload, analysis, results)
│   │   │   ├── AboutUs.jsx            # Team information
│   │   │   └── Help.jsx               # User guide
│   │   ├── components/
│   │   │   ├── TopBanner.jsx          # Navigation header
│   │   │   └── Footer.jsx             # Footer
│   │   ├── App.jsx                    # Main component & routing
│   │   └── main.jsx                   # Entry point
│   ├── package.json                   # Frontend dependencies
│   └── vite.config.js                 # Vite configuration
│
├── backend/                            # Flask API + ML Agents
│   ├── app.py                         # Main Flask API (1949 lines)
│   │
│   ├── agents/                        # AI/ML Agents
│   │   ├── eda_agent/                 # EDA Agent (Snowflake-based)
│   │   │   ├── agent.py               # Main EDA orchestration
│   │   │   ├── config.py              # EDA configuration
│   │   │   ├── database/
│   │   │   │   ├── connection.py      # Snowflake connection
│   │   │   │   ├── schema.py          # Schema management
│   │   │   │   └── storage.py         # Results storage
│   │   │   ├── metrics/               # Metric calculators
│   │   │   │   ├── basic_metrics.py   # Basic stats
│   │   │   │   ├── numeric_metrics.py # Numeric stats
│   │   │   │   ├── categorical_metrics.py
│   │   │   │   ├── datetime_metrics.py
│   │   │   │   ├── text_metrics.py
│   │   │   │   └── target_metrics.py
│   │   │   └── utils/
│   │   │       ├── helpers.py
│   │   │       ├── logger.py
│   │   │       └── validators.py
│   │   │
│   │   ├── target_variable_agent.py   # Gemini-powered target suggestions
│   │   ├── logistic_regression_agent.py
│   │   ├── decision_tree_agent.py
│   │   ├── xgboost_agent.py
│   │   └── natural_language_agent.py  # Gemini-powered PDF generation
│   │
│   ├── services/
│   │   ├── workflow_manager.py        # Workflow & schema management
│   │   ├── snowflake_ingestion.py     # CSV → Snowflake
│   │   ├── eda_service.py             # EDA orchestration
│   │   └── config.py                  # Configuration loader
│   │
│   ├── insights/                      # Generated JSON insights
│   └── pdf/                           # Generated PDF reports
│
├── requirements.txt                   # Python dependencies
├── .env                               # Environment variables (not tracked)
├── .gitignore
├── start.bat                          # Windows startup script
├── start.sh                           # Linux/Mac startup script
└── README.md

🚀 Getting Started

Prerequisites

Python 3.11+ (Download)
Node.js 18+ (Download)
Snowflake Account (Sign up)
Google Gemini API Key (Get free key)

Installation

1. Clone the repository

git clone https://github.com/yourusername/foresee-app.git
cd foresee-app

2. Set up backend

# Create virtual environment
python -m venv myenv

# Activate virtual environment
# Windows:
myenv\Scripts\activate
# macOS/Linux:
source myenv/bin/activate

# Install Python dependencies
pip install -r requirements.txt

3. Configure environment variables

Create a .env file in the project root:

# Snowflake Configuration
SNOWFLAKE_ACCOUNT=your_account_identifier
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_SCHEMA=PUBLIC
INGESTION_WAREHOUSE=your_warehouse

# Google Gemini API
GEMINI_API_KEY=your_gemini_api_key_here

Get your Gemini API key:

Visit https://aistudio.google.com/app/apikey
Sign in with Google account
Click "Create API Key"
Copy and paste into .env

4. Set up frontend

cd frontend
npm install
cd ..

🎬 Running the Application

Option 1: Quick Start (Recommended) ⭐

Windows:

start.bat

macOS/Linux:

chmod +x start.sh  # First time only
./start.sh

This automatically:

Activates Python virtual environment
Starts Flask backend (port 5000)
Starts Vite frontend (port 5173)

Option 2: Using npm

cd frontend
npm run dev:all

Uses concurrently to run both servers simultaneously.

Option 3: Manual (Two Terminals)

Terminal 1 - Backend:

# Activate virtual environment
myenv\Scripts\activate  # Windows
# or
source myenv/bin/activate  # macOS/Linux

# Start Flask server
cd backend
python app.py

Terminal 2 - Frontend:

cd frontend
npm run dev

Access the Application

Frontend: http://localhost:5173
Backend API: http://localhost:5000
Health Check: http://localhost:5000/api/health

📖 Usage Guide

1. Upload Dataset

Navigate to "Foresee" in the top menu
Drag & drop your CSV file or click "Choose File"
Click "Upload & Analyze"

The system will:

Upload data to Snowflake
Run parallel EDA + Target Analysis (~45-60s)
Display progress with rotating status messages

2. Select Target Variable

After analysis, you'll see:

Podium Display (Top 3):

🥇 Gold - Most important target (Rank 1)
🥈 Silver - Second best (Rank 2)
🥉 Bronze - Third option (Rank 3)

Click "Other Options" to see all 5 recommendations.

Each recommendation shows:

Importance Score (1-100) - Quantitative ranking
Problem Type - regression/classification
Why Important - Business value explanation
Predictability - HIGH/MEDIUM/LOW feasibility
Suggested Features - Best predictor columns

3. Model Training (Automatic)

After selecting a target, the system automatically:

Trains Logistic Regression model
Trains Decision Tree model
Trains XGBoost model
Generates Natural Language Insights (Gemini)
Creates PDF Report (ReportLab)

Total Time: 10-15 seconds

4. View/Download Report

When complete:

Click "View Report" → Opens PDF in browser
Click "Download Report" → Saves PDF to your computer

📊 What's in the PDF Report?

1. Executive Summary

Dataset overview (rows, columns)
Selected target variable
Best-performing model
Key findings

2. Data Quality Analysis

Missing value analysis
Duplicate detection
Column type breakdown
Data completeness metrics

3. Exploratory Data Analysis

Numeric column statistics (mean, std, quartiles, skewness, kurtosis)
Categorical distributions (top values, cardinality)
Datetime patterns
Text metrics

4. Model Performance Comparison

Model	Accuracy	Precision	Recall	F1 Score	ROC-AUC
Logistic Regression	0.85	0.82	0.88	0.85	0.91
Decision Tree	0.83	0.80	0.87	0.83	0.89
XGBoost	0.88	0.86	0.90	0.88	0.94

5. Feature Importance

Top 10 most important features
Feature importance scores
Model-specific interpretations

6. Model-Specific Insights

Confusion matrices
Decision tree depth/leaves
XGBoost hyperparameters
Performance summaries

7. Recommendations

Best model selection advice
Data quality improvements
Feature engineering suggestions
Next steps for deployment

🎯 API Reference

Core Endpoints

Method	Endpoint	Description
`GET`	`/api/health`	Health check
`POST`	`/api/upload`	Upload CSV & run parallel analysis
`GET`	`/api/workflows`	List all workflows
`DELETE`	`/api/workflow/<id>`	Delete workflow & schema
`POST`	`/api/query`	Execute SQL query on workflow

Target Variable Selection

Method	Endpoint	Description
`GET`	`/api/target-suggestions/<workflow_id>/<table_name>`	Get AI recommendations (Gemini)
`POST`	`/api/workflow/<id>/select-target`	Save target & auto-train models

POST Body:

{
  "target_variable": "column_name",
  "table_name": "table_name",
  "problem_type": "classification",
  "importance_score": 95
}

Manual Model Training (Optional)

Method	Endpoint	Description
`POST`	`/api/workflow/<id>/train-logistic-regression`	Train LR model
`POST`	`/api/workflow/<id>/train-decision-tree`	Train DT model
`POST`	`/api/workflow/<id>/train-xgboost`	Train XGBoost model

Model Results

Method	Endpoint	Description
`GET`	`/api/workflow/<id>/logistic-regression-results`	Get LR results
`GET`	`/api/workflow/<id>/decision-tree-results`	Get DT results
`GET`	`/api/workflow/<id>/xgboost-results`	Get XGB results

Report Generation

Method	Endpoint	Description
`POST`	`/api/workflow/<id>/generate-insights`	Generate insights & PDF (Gemini)
`GET`	`/api/workflow/<id>/report/view`	View PDF in browser
`GET`	`/api/workflow/<id>/report/download`	Download PDF

🗄️ Snowflake Database Schema

Each workflow creates an isolated schema: WORKFLOW_<UUID>

Tables Created Per Workflow

1. WORKFLOW_METADATA

CREATE TABLE WORKFLOW_METADATA (
    key VARCHAR,
    value VARIANT,
    updated_at TIMESTAMP
);

Stores workflow-level metadata (selected target, configuration).

2. WORKFLOW_EDA_SUMMARY

CREATE TABLE WORKFLOW_EDA_SUMMARY (
    analysis_id VARCHAR PRIMARY KEY,
    table_name VARCHAR,
    total_rows INTEGER,
    total_columns INTEGER,
    duplicate_rows INTEGER,
    target_column VARCHAR,
    analysis_type VARCHAR,
    created_at TIMESTAMP
);

3. COLUMN_STATS

CREATE TABLE COLUMN_STATS (
    column_name VARCHAR,
    data_type VARCHAR,
    null_count INTEGER,
    unique_count INTEGER,
    completeness FLOAT,
    -- Numeric metrics
    mean FLOAT,
    std FLOAT,
    min FLOAT,
    max FLOAT,
    q1 FLOAT,
    q2 FLOAT,
    q3 FLOAT,
    skewness FLOAT,
    kurtosis FLOAT,
    -- Categorical metrics
    mode VARCHAR,
    top_values VARIANT,
    cardinality INTEGER,
    -- Text metrics
    avg_length FLOAT,
    max_length INTEGER,
    -- Datetime metrics
    date_range VARIANT
);

4. LOGISTIC_REGRESSION_SUMMARY

CREATE TABLE LOGISTIC_REGRESSION_SUMMARY (
    analysis_id VARCHAR PRIMARY KEY,
    table_name VARCHAR,
    target_variable VARCHAR,
    model_type VARCHAR,
    problem_type VARCHAR,
    test_accuracy FLOAT,
    test_precision FLOAT,
    test_recall FLOAT,
    test_f1_score FLOAT,
    test_roc_auc FLOAT,
    train_accuracy FLOAT,
    total_samples INTEGER,
    total_features INTEGER,
    n_classes INTEGER,
    confusion_matrix ARRAY,
    top_features ARRAY,
    performance_summary VARCHAR,
    recommendations VARCHAR,
    created_at TIMESTAMP
);

5. DECISION_TREE_SUMMARY

Same as Logistic Regression + additional columns:

    tree_depth INTEGER,
    n_leaves INTEGER,
    max_depth INTEGER,
    min_samples_split INTEGER,
    min_samples_leaf INTEGER

6. XGBOOST_SUMMARY

Same as Logistic Regression + additional columns:

    n_estimators INTEGER,
    max_depth INTEGER,
    learning_rate FLOAT,
    subsample FLOAT,
    colsample_bytree FLOAT

⚙️ Configuration

Backend Configuration (`backend/services/config.py`)

from dotenv import load_dotenv
import os

load_dotenv()

class Config:
    SNOWFLAKE_ACCOUNT = os.getenv("SNOWFLAKE_ACCOUNT")
    SNOWFLAKE_USER = os.getenv("SNOWFLAKE_USER")
    SNOWFLAKE_PASSWORD = os.getenv("SNOWFLAKE_PASSWORD")
    SNOWFLAKE_DATABASE = os.getenv("SNOWFLAKE_DATABASE")
    SNOWFLAKE_SCHEMA = os.getenv("SNOWFLAKE_SCHEMA")
    INGESTION_WAREHOUSE = os.getenv("INGESTION_WAREHOUSE")

Frontend Configuration (`frontend/src/pages/Foresee.jsx`)

const API_BASE_URL = "http://localhost:5000/api";

Change this to your backend URL in production.

Flask Configuration (`backend/app.py`)

ALLOWED_EXTENSIONS = {'csv'}
app.config['MAX_CONTENT_LENGTH'] = 500 * 1024 * 1024  # 500MB max

🤖 Google Gemini Integration

Models Used

Agent	Model	Purpose
Target Variable Agent	`gemini-2.5-flash-preview-05-20`	Analyze data & rank targets
Natural Language Agent	`gemini-2.5-flash-preview-05-20`	Generate PDF insights

API Configuration

import google.generativeai as genai

genai.configure(api_key=os.getenv('GEMINI_API_KEY'))
model = genai.GenerativeModel('models/gemini-2.5-flash-preview-05-20')

# Generate content
response = model.generate_content(prompt)

Rate Limits & Pricing

Free Tier: 15 requests/minute, 1500 requests/day
Paid Tier: Higher limits available
Check current pricing: https://ai.google.dev/pricing

🔒 Security & Privacy

Data Isolation

Each workflow creates an isolated Snowflake schema (WORKFLOW_<UUID>)
No data mixing between workflows
Automatic cleanup on workflow deletion

API Security

CORS enabled for frontend-backend communication
File size limits (500MB max)
File type validation (CSV only)
Secure filename sanitization

Data Privacy

Data stored in your Snowflake account (not ours)
AI models (Gemini) don't retain your data
Stateless API calls
No data sent to third parties

Temporary Files

Uploaded files stored in system temp directory
Automatically deleted after Snowflake upload
No persistent local storage

🐛 Troubleshooting

Backend won't start

# Check Python version
python --version  # Should be 3.11+

# Verify .env file exists
cat .env  # macOS/Linux
type .env  # Windows

# Test Snowflake connection
python -c "from services.config import Config; print(Config.SNOWFLAKE_ACCOUNT)"

Frontend can't connect to backend

# Verify backend is running
curl http://localhost:5000/api/health

# Check CORS is enabled in backend/app.py
# CORS(app) should be present

# Verify API_BASE_URL in frontend matches backend port

Upload fails

Possible causes:

Invalid CSV format - Verify file has headers and proper encoding
Snowflake credentials - Check .env variables
Warehouse not running - Start warehouse in Snowflake UI
Insufficient credits - Check Snowflake billing

Gemini API errors

# Verify API key is set
echo $GEMINI_API_KEY  # macOS/Linux
echo %GEMINI_API_KEY%  # Windows

# Test API key manually
curl -H "Content-Type: application/json" \
     -d '{"contents":[{"parts":[{"text":"Hello"}]}]}' \
     "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-05-20:generateContent?key=YOUR_API_KEY"

Common errors:

INVALID_ARGUMENT: Invalid API key
RESOURCE_EXHAUSTED: Rate limit exceeded (wait 1 minute)
PERMISSION_DENIED: API not enabled (enable in Google Cloud Console)

Model training fails